At Wavii, we use classification for a number of NLP tasks such as disambiguating entities (Bush the band vs. George W. Bush), automatic learning of new entities (new musical artists, politicians, etc.), and relationship extraction between entities (whom did a company acquire).
A common problem when performing classification, is deciding what features should be generated. Since there’s no silver bullet/one-size-fits all feature-space that works for all classification tasks, you have to invest time in the feature generation and selection process.
Like any engineering task, it’s best to iteratively approach classification. Start with a simple set of features that can be rapidly cranked out. Figure out which features are helping and generate more of those. And (sometimes) for runtime performance and space issues, filter redundant features. Then iterate until you are happy with the results.
I use simple linear classifiers such as Logistic Regression with Regularization and SVMs, as these are robust and resilient to noise…therefore, I don’t have to bother filtering features out until I’m taking the code to production.
As a side note, I have found closely inspecting and understanding my features has given me a better understanding of the problem domain I was trying to solve. It also serves as a sanity check — signal leakage, bugs in feature generation, etc.
Sci-kit and feature-selection
clf.coef_ : the array of feature coefficients for a trained classifier clf.
I typically train an L1-regularized Logistic Regression classifier and inspect the weights. You can control the C parameter of the regularizer to increase sparseness – fewer features — and thus see which features are helping. This presentation is a good resource to understand the effects of regularization.
Why namespace the features?
In text-classification, the number of features can be quite large (in the thousands), making this process cumbersome. So I came up with a simple approach of organizing my features into a hierarchy, and generating summary statistics for the bag of features at various levels of the hierarchy.
For example: For wikipedia classifications, I have features such as, abstract:bag-of-words:jude, infobox:bigram:Jude Law, etc. The levels of the hierarchy are delimited by ‘:’. (Abstract includes features generated from a Wikipedia abstract, and similarly from the Wikipedia infobox).
This allows me to breakdown my top-features (in clf.coef_) by levels of the hierarchy — and compare abstract vs. infobox, for example. Or drill down on abstract features and compare bag-of-words features to bigram features.
This approach of organizing the features and comparing groups of features makes this feature generation more tractable.
FYI – In this task, I found bigram features to be useful. I also discovered a bug in my Wikipedia scraper and thus the infobox:* features were being effectively ignored by the classifier.
Thanks for reading. Please share your thoughts and your approaches and give me feedback about mine. You can either leave a comment below, ping me on twitter (@mkbubba) or email me at manishatwaviidotcom.