You can now follow all of them on Wavii, because today we’re rolling out sports.
Click an image to view its timeline feed…
…Or, download the Wavii iPhone app to view the full Sports feed!
Let us know what you think.
You can now follow all of them on Wavii, because today we’re rolling out sports.
Click an image to view its timeline feed…
…Or, download the Wavii iPhone app to view the full Sports feed!
Let us know what you think.
At Wavii, we use classification for a number of NLP tasks such as disambiguating entities (Bush the band vs. George W. Bush), automatic learning of new entities (new musical artists, politicians, etc.), and relationship extraction between entities (whom did a company acquire).
A common problem when performing classification, is deciding what features should be generated. Since there’s no silver bullet/one-size-fits all feature-space that works for all classification tasks, you have to invest time in the feature generation and selection process.
Like any engineering task, it’s best to iteratively approach classification. Start with a simple set of features that can be rapidly cranked out. Figure out which features are helping and generate more of those. And (sometimes) for runtime performance and space issues, filter redundant features. Then iterate until you are happy with the results.
I use simple linear classifiers such as Logistic Regression with Regularization and SVMs, as these are robust and resilient to noise…therefore, I don’t have to bother filtering features out until I’m taking the code to production.
As a side note, I have found closely inspecting and understanding my features has given me a better understanding of the problem domain I was trying to solve. It also serves as a sanity check — signal leakage, bugs in feature generation, etc.
Sci-kit and feature-selection
clf.coef_ : the array of feature coefficients for a trained classifier clf.
I typically train an L1-regularized Logistic Regression classifier and inspect the weights. You can control the C parameter of the regularizer to increase sparseness – fewer features — and thus see which features are helping. This presentation is a good resource to understand the effects of regularization.
Why namespace the features?
In text-classification, the number of features can be quite large (in the thousands), making this process cumbersome. So I came up with a simple approach of organizing my features into a hierarchy, and generating summary statistics for the bag of features at various levels of the hierarchy.
For example: For wikipedia classifications, I have features such as, abstract:bag-of-words:jude, infobox:bigram:Jude Law, etc. The levels of the hierarchy are delimited by ‘:’. (Abstract includes features generated from a Wikipedia abstract, and similarly from the Wikipedia infobox).
This allows me to breakdown my top-features (in clf.coef_) by levels of the hierarchy — and compare abstract vs. infobox, for example. Or drill down on abstract features and compare bag-of-words features to bigram features.
This approach of organizing the features and comparing groups of features makes this feature generation more tractable.
FYI – In this task, I found bigram features to be useful. I also discovered a bug in my Wikipedia scraper and thus the infobox:* features were being effectively ignored by the classifier.
Thanks for reading. Please share your thoughts and your approaches and give me feedback about mine. You can either leave a comment below, ping me on twitter (@mkbubba) or email me at manishatwaviidotcom.
What do YOU think? Wavii’s new polls let you add your voice to the world’s events, and see how you stack up against your friends and everybody else.
Wavii is full of facts, but opinions makes them more fun. Are people going to buy the next Apple iPad for $799? Should the U.S. Congress focus on gun control. Is Frank Ocean the one to bet on in a fight…or Justin Bieber? Do people care about that new movie? There’s a different side to each story, and we want to hear it!
You can tell the world how you feel about Apple’s new tablet and immediately see what the rest of the world thinks. The social bar above the poll tells you how the people you follow voted, and if you want to defend your vote, add a comment in protest or ask your friends on Facebook or Twitter what they think!
If you can’t get enough of polls in your main feed, switch to the “Popular” or other feeds to find even more! More are on their way…let us know what you think.
Explore the year’s 20 most interesting characters, controversies, and events by viewing their Wavii timelines.
New to Wavii? We create the world’s timelines, so that you can keep up on what’s happening with your interests on Wavii, just like you keep up with your friends on Facebook. Get the Wavii app to explore if you’re on an iPhone!
Click the images to view their timeline:
Wavii Christmas edition is here! Ok, just kidding, but we do have a big announcement. After months of toiling away to create the perfect mobile experience, we’re ready to announce our new iPhone app!
tl;dr looks good, runs fast, is fun …Twitter and email registration now included. Check it out at wavii.com/getapp.
Facebook is a great place to see what’s going on in your friends’ lives and give your opinion. But, it’s a little tougher to find and discuss the latest news over there, and even the most popular stories like Facebook’s purchase of Instagram, the latest iPhone release, or whoever Justin Bieber is dating today, are hard to follow because they’re spread between a lot of conversations and groups of friends.
Outside of Facebook, these and the other 99% of stories on the web typically show up in the news dozens of times, with conversations about them scattered everywhere you look.
Twitter isn’t any better. You follow people to get interesting news, but find your feed filled with noise and duplicates of the same opinion and story repeated ad nauseam. Then, good luck having a conversation about something…it’s hard to track the thread and know where to reply. And even when you find a story you want, you have to read each article to get the details.
The new Wavii is built to address this.
First, we comb through all of this noise and boil it down to simple status updates about what’s happening, i.e., the world’s news events. This is great if you love going deep on the news because it de-duplicates stories and makes it easy to navigate to what you care about…Especially when you’re on the go and don’t have time to read everything, but just want the highlights. There are millions of people like this (we heard from a LOT of them) that just don’t have time to read lots of news articles, but will keep up on their world through quick updates, similar to what they already use to keep up with friends.
Now you have one place to share your opinions with friends about what happened, whether you want to sing kumbaya together or yell and scream.
So, see what your friends think. Of course, your opinion is what really matters…so share your thoughts too!
Most people also have a lot of friends over on Facebook or Twitter, where they like to share their favorite stories, so we make that easy too. You can loop in these groups with a single tap.
Also, as part of this release we added highly requested features like more ways to share and comment on items in your feed. And one of my personal favorites is that you can now also register using Twitter or email!
Last but not least, we wanted to make it seamless to discover new things and add them to your personalized feed. Check out the new feeds at the top of your screen, like Popular, Technology, Entertainment, and a few more. These make it easy to find new topics to follow on the fly, and personalize your experience.
So go grab it today and let us know what you think!!
We recently heard from our users that they were getting too many irrelevant stories about Twitter in their Wavii feed. When we looked at our dataset for Twitter, the issue was clear…while most people want news about Twitter the company (e.g., recent hires, new products, fundraising, etc.), Twitter the social media platform is frequently what our feed items refer to…things that happened ON Twitter. For example, Rihanna might respond to criticism on Twitter, and people following Twitter on Wavii would see that.
Which Twitter matters?
Both matter. While we’ll include mentions of “Twitter” the platform when it is contextually relevant to the story, it needs to be clearly distinguished from ‘regular’ news about Twitter as the primary actor in the story, which is the context most people on Wavii care about. This distinction lets us improve feed relevance, making users happy. The challenge is determining which Twitter is being referenced in each instance.
How we distinguish them
To solve this we looked at the language surrounding “Twitter” in order to find patterns indicative of one use or the other. By doing this we discovered a set of contexts that bind to Twitter when it is referred to as a social media platform, i.e., words and phrases likely to be used with or around Twitter when something happened on the platform. Here are some of the phrases that we found for this context:
Once we know which of these phrases care about, we can isolate the stories associated with them. For example, in most cases we’ll show news about the company, but on occasion we’ll use stories about things happening on Twitter to create visualizations about the network.
As a result, here’s the feed before and after we’ve created this distinction.
Before…Twitter shown in all contexts
After…News about Twitter shown
This solution isn’t only relevant for Twitter, but also Facebook, YouTube, and other topics that are both companies and platforms. With this approach, we can make our feeds even more clear and relevant.
What does this mean for our users?
Ultimately, we want to be able to tailor our technology to allow each person to customize their exposure to Twitter, or any other entity. All types of entities have quirks about the way they are represented online. This is just one example of the many ways we are fine tuning the power behind Wavii to provide the most dense & relevant headlines.
Later we’ll talk about some interesting conclusions we’re able to draw about Twitter, now that we can better understand the context of how they appear in the news.
We are often asked how we evaluate improvements to Wavii’s rapidly evolving natural language processing (NLP) pipeline. We spend a lot of time planning and considering these evaluations, so here are some insights into how Wavii tackles this problem.
Wavii’s NLP pipeline incorporates many distinct components, each relying on the output of one or more previous components. For example, performing event extraction requires articles marked-up with named-entities, which in turn depends on phrase chunking and part-of-speech (POS) tagging. To manage this we often take a divide-and-conquer approach, whereby we improve upon individual components in isolation.
For each of these NLP components, we create a gold-standard corpus of articles that are marked-up with the component’s ideal output. For example, in our named-entity recognition (NER) articles each topic mention is hand-labelled, and is then used to train our NER algorithms to distinguish between entities and non-entities, and an entity’s type (e.g., politician, band).
Once trained, a component’s model’s performance is evaluated on separate test corpus by comparing its generated labels to the corpus’ gold labels. This evaluation tells us what gold labels we predict correctly, incorrectly, and/or miss completely, allowing us to quickly see any localised performance changes while we are working on a component.
Unfortunately, the performance of our models in isolation is not always the performance we observe under real-world circumstances. Our training and testing corpora cannot cover all possible input cases, and thus unexpected behavior and errors can arise.
Furthermore, an error in one component may propagate to other, dependent components. For example, in our pipeline recall and precision errors have multiplicative effects. If we change our NER system and the result is lower recall, we may miss important entity mentions in articles, reducing the number of events being extracted from them, and ultimately reducing the number of feed items for a user. Therefore, we must have a mechanism to evaluate any component that we improve, along with the other components that it affects.
In an ideal world, this would be an extensive end-to-end gold corpus, where each article is hand-labelled for each components’ expected output throughout the pipeline. We could then reliably compare any new component’s local and global effects within the pipeline. However building an end-to-end gold corpus is impractical to maintain at Wavii, as we are constantly adding new topics and events, and improving and adding new components.
So how do we go about evaluating an updated component in the pipeline?
One-boxing is a common technique for evaluating a component, within a pipeline, by deploying the latest software on one of the component’s machines. For instance, in the following diagram, there are a set of NER machines, each running version 0, except for the one box that is now running version 1. Each NER box, receives different articles (a, b, c) from the NER Queue, and once processed, places articles with interesting entities in the Event Queue for downstream Event extraction.
In this form of one-boxing, an error introduced into article ‘a’ by version 1 (shown in red), simply continues to pass through the pipeline. Significant errors like a decreased throughput or a downstream component failing are fairly easy to spot, but many subtle and less frequent errors may go undetected and affect our feed. As this approach makes it very difficult to diagnose global errors and their side-effects, we worked on developing an approach to highlight those, whilst maintaining our stable pipeline.
Our main evaluation approach involves a very large database, a replay mechanism, and the revisioning system, Git.
Every article processed in our pipeline is tagged with each components’ output and version number, and saved in our article-store database. This database supports fast article comparisons and updates using Git, and is used in our new pipeline-testing framework, shown below.
In this approach, when we one-box a component we only send it articles from the article-store that have been recently processed by the main pipeline. Those articles are first re-processed by the updated component, and then replayed through the remaining pipeline stages. But instead of influencing the feed again, the replayed articles are subjected to a strict evaluation. In the evaluation stage, rather than using expensive gold data, we exploit Git’s diff command to quickly identify any differences between an article’s new and original output, across all of the pipeline’s components.
Our new evaluation framework, along with the traditional corpus evaluation method, allows us to quickly iterate on and evaluate our NLP tools, without breaking our current pipeline. We can now answer more probing questions, like will a change in our anaphora resolution component really increase the number of events extracted, and if not, why? All this makes it possible for us to regularly update our pipeline and in turn improve the experience for everyone using our site.
At Wavii our goal is to tell stories with data. A few months ago we created a simple way to visualize the Olympics, and get the real-time results NBC wasn’t providing. With election season upon us, we switched gears and created a straightforward site for all of the latest updates on the races you care about.
Its easy to be overwhelmed by the sheer volume of stories focused on the top politicians, and difficult to boil it all down and get the facts. The focus on Washington D.C. also makes it harder to track your local races. To help solve this Wavii organizes what’s happening by the people involved, their affiliation (e.g., Democrat, from Ohio), and what they did (e.g., released a new attack ad, criticized someone, raised funding). We use the structured data we generate about these stories to visualize them, so you can explore the topics you care about.
Give it a try at https://wavii.com/politics
Our last few posts have focused on the nuts and bolts of our backend pipeline: how do we recognize and deduplicate named entities in articles, and how do we figure out and extract what those entities or topics did? Answers to these questions give a sense for how we put together what we call our global feed, a real time stream of everything happening in the world…but, Wavii is also about personalization. To make your feed, we need to filter everything down to the specific topics and events you care about.
Building your feed
How do we know what is going to be relevant to you? The simplest possible solution is to ask you to follow a few topics, and then show you every story that includes at least one of those topics. This is similar to how Twitter will show you every tweet by someone you follow.
This is a great starting point. It’s a clear and straightforward user experience, and we can be confident about your interests without the need for further analysis and approximations. But it’s also rigid and limited, relying solely on your reported interests. It puts the burden on you to know exactly what you want.
Let’s visualize this. Suppose that we have a many-dimensional topic space, where every topic has unique coordinates, and the more similar or related two topics are, the closer they are to each other. We can represent people and news stories as points or collections of points in this space. For a user without a lot of follows, it will probably look something like this:
This approach is equivalent to representing you as every topic you follow, and representing each story as the topics involved in it. Then we find the intersection between the two (the orange points in the figure above) to generate a personalized feed that you’ll like.
Pretty simple, right? On the surface it is, but there are problems that arise with this solution.
Two ends of the spectrum
This approach really only works if a user follows just the right amount of topics. Follow too few and your feed will be sparse…follow too many and it can feel flooded, especially if some topics tend to produce far more feed items than others.
In the example above, the user follows only a few topics so they might have a pretty sparse, stale feed, and miss out on a lot of feed items they’d probably like to see. Given that they’re following Radiohead, they’d probably like to know about Thom Yorke’s collaboration as well. And a lot of users would want to know about Neil Armstrong’s death, even though they’d probably never think to follow him.
Then we have another set of users that are following a ton of topics, and for them the situation looks more like this…
In this case, the problem is possible information overload — which kind of defeats the point of Wavii. So how do we decide which events to show you? Just pick a few stories randomly? Pick the most recent ones at your time of login?
As Wavii continues to improve and generate even more updates for your feed, this challenge only increases. So we’re working on a solution that’s dynamic, adaptive, and able to accommodate both ends of the spectrum.
Representing a user
First, when determining whether you’ll want to see a topic in your feed, we can look for signals beyond whether or not you explicitly follow it. For example, we use things like:
Looking at these signals allows us to compute your expected interest in a topic, so we can figure out things you’ll like that you’re not even following!
Second, we can look for other attribtues of stories you care about, besides the topics involved. For example, you might be very interested if Kim Kardashian starts dating someone, but less interested when she is merely spotted somewhere. Or, maybe you particularly like events based on stories from TechCrunch or Business Insider, but not other sources.
Finally, we use your interest level in one topic to guess it for related topics. I.e., if you are interested in the Democratic Party, you’re probably also interested in the Democratic National Convention.
Coming back to our topic space, we can use these approaches to create interest clouds around each topic that are proportionate in size to your interest level in each one. Larger interest clouds extend into nearby topics, indicating possible interest in those topics. Given the same follows from the example above, a user’s full interest profile might look something like this:
Representing the global feed
Similarly, when we generate a new event we try to determine how globally popular and important it is or will become. This gives us additional clues about how much each user might like to see it in their feed. To determine this we consider things like:
Again, we can model each of the stories as a point with a specific magnitude of relevancy, represented here with the most important or interesting stories being the darkest shades of red:
Neil Armstrong’s death and Obama’s first presidential debate are the biggest stories, while a musical tour and an app update are much less globally relevant.
Putting it all together
Now, the problem of feed generation becomes clear: we simply overlay the user representation with the global feed representation and find points with the greatest heat. The simplest way is visualize this is to sum the two heat maps, but we can use any function of the two depending on how we want user-specific relevance to interact with global relevance. Putting the two plots above together, we get something like this:
Comparing this to our original plot of the user’s feed, what we now have is not merely a set of stories, but a clear ranking of stories. Top-ranked are stories that have high heat for the particular user, as well as some heat in the global feed: “Hot Chip is on tour”, followed by “Obama debates”. These are both about topics that are followed and for which the user has demonstrated a high degree of interest. Then come stories that have a little bit of heat in both: “Thom Yorke collaborates” and “Wavii updates app”. Neither Wavii nor Thom Yorke is actually followed, but for Wavii we’ve gathered other interest signals, while Thom Yorke has gained heat by sitting in the cloud of nearby Radiohead. Lastly are stories with only heat in the global feed, typically major breaking news or something that has gone viral. Sometimes these stories, as in the case of “Neil Armstrong dies” can garner enough heat from the global feed alone to make it into a user’s feed, but most remain just a dim speck like “Jen Aniston engaged”.
Having a ranking, rather than an unordered set, provides us a lot more flexibility in crafting a user’s feed, no matter how many follows they have. In practice, we have to consider thousands of events for a user’s feed, not just the five or so from this example. Being able to quantify both their user-specific relevance and global relevance allows us to make decisions about the best events to show each user.
Most of us don’t care about things that occur all of the time. We don’t need an update that “Microsoft was developing software today,” that “Jane listened to a song on Spotify,” or that “Anand harvested crops on his virtual farm”. That isn’t news. But, we are concerned with things that are less expected, matter on a larger scale, and are aligned with our interests, so these are the types of stories that our team is training Wavii’s tech to understand and produce.
Knowing which story concepts to train into Wavii involves two steps:
1. The data tells us what matters
As Wavii scours the web for stories, it creates summaries of what’s happening to the people, companies, products, and other topics that our system recognizes. Initially, each summary is unstructured, meaning that Wavii believes it’s important and knows which topics are involved, but doesn’t understand exactly what happened. The result is that when something unprecedented occurs (like, say, a landing on Mars), Wavii still gets the story and creates a feed item for it, so you don’t miss out.
Here’s an example of a couple of summaries we produced about interviews the press gave.
As Wavii’s tech produces more and more summaries that involve the same things (e.g., people giving interviews), we cluster them into semantically related groups. This signals which concepts are the most important and popular so that we can prioritize training them into our system. Then, our tech will produce structured feed items for these types going forward.
For example, in this case, a more structured feed item for Interviews, showing the topics involved and some other interviews from the same source.
2. Mapping out the rest of the story
This approach points us to the most important story concepts, but it’s not comprehensive enough on its own because it doesn’t take into account related stories that are less popular, but important for capturing the full narrative…the experience that we want people to have on Wavii.
For example, the system might see ‘dating’, ‘break up’ and ‘marriage’ over and over, and tell us to train those concepts. That’s great, because it helps us prioritize relationships as something people care about (and write about), but to really show the bigger picture on our site we also need to identify ‘caught cheating’, ‘engaged’, and ‘divorced’ as well. So, our tech points us to a concept, and then we design a narrative around it.
This fills in the gaps, ensuring that when an interesting storyline starts to unfold and enter the Wavii feed, we pick it up at each stage of its development.
How we train our tech to discern these details is a story for another day…
-Jenna, Sarah, and Marcus