Bush is back!….

Bet you’re wondering which one? Bush 41, Bush 43, the bandthe brand, the city,….eugh, turns out there’s a million of em.

How does Wavii know which one?

Wavii allows you to customize your feed by following your topics of interests…from  Politicians to Actors and Entertainers, from Products and Companies to Places and Movies. Each feed item tells you what’s happening with these topics, but in order to figure that out Wavii first needs to be able to identify topics uniquely when they appears in the news, tweets, and everywhere else on the internet. In Natural Language Processing (NLP) parlance, these Wavii topics are known as “Named-Entities”.

Let’s say Wavii sees some news from MTV.com. Our system has to figure out which artists, companies, or other people it’s about…and from there, try to figure out what’s happening with them. In this case, we found that “Frank Ocean“, “Lollapalooza” and “Chicago” were all mentioned. This feeds in to another part of our system that figures out what is happening with those three topics, namely that “Frank Ocean performs at Lollapalooza in Chicago“.

We can’t really tell you what happened in the world, unless we can also tell you to whom or what it’s happening…so, it’s really important we get it right. Imagine if our system reports that “George Bush is on a fundraising tour”, when really “Bush (the band) is going on a musical tour.” Uhm…slight difference.

To take care of this, we have a system called “NER” (Named Entity Recognition) whose job it is to figure out which specific topics are mentioned in an article.

So how does this thing work?

Imagine you’re reading this article….and you wanted to figure out all the topics in it. One obvious approach is to find all the proper nouns and see if we have them in our database: “Frank Ocean”, “Ocean”, “Chicago”, “Lollapalloza”, “Lolla”.

Uh oh…looks like we have a few problems. First off, Chicago is in our database a whole bunch of times…so which one is being mentioned here? Chicago the city, Chicago the band, Chicago the musical, Chicago Bears, etc. Even worse, “Lolla” is the name of some random town in Sri-Lanka and not “Lollapalloza” (who even calls it that?). And don’t even get me started on “Ocean.”

So we’ve got this problem of ambiguity – some of these words in the article match a bunch of different topics in our database. And some of these words match none…as in the case of “Lolla.”

But now imagine if you knew ahead of the time the type of the topic you’re trying to look up. For example, only give me cities with the name Chicago. Now I can go ahead and ignore all the musicals, sports teams, bands, etc….and my job is so much easier.

Figuring out the type?

But how did we get that type in the first place?

As it turns out, sentences make it pretty obvious what’s coming next. “I was born in ____” is almost always followed by a location, right?

In this article, we see:

  • Frank Ocean said that”
  • “with Ocean’s voice providing”
  • Ocean wasn’t afraid”

In that first one, you have to think….what do people usually write as “___x said that”? Most of the time it’s people, and occasionally companies. So Frank Ocean is most likely a person, with a small chance he’s a company. In the second and third sentences, though, it becomes pretty clear that it’s a person. I mean, when was the last time a company had a voice or wasn’t afraid?

We’re lucky enough to have other info as well….for example, we know that Frank is a really common first-name, so might as well use that.

Now we need to translate this sort of thinking into an algorithm the computer can understand. The obvious approach is to hand-craft the rules for each type…a person, a place, a company, etc. But you’ll soon find yourself mired by the sheer volume and complexity of the rules. Instead of a human writing the rules, we train a supervised machine learning algorithm to learn the rules from some examples we give to it.

Now, since we’re using a computer to figure these things out….why not give it tons of other info and let it decide if it’s useful. Maybe capitalization and part of speech are important. Maybe there’s some common words that are really important — for example, Inc. almost always means a company. Or how about punctuation? The list goes on and on. Instead of thinking about each one, all we do is hand-label example sentences and train an algorithm, and then let it figure out which things matter. We use state-of-the-art conditional random fields to do this.

And now we’re done, right?

Not quite. Once we have the type, we’ve narrowed the list…but there’s still a little ways to go. What happens if we see Chris Brown…is he the singer or the hockey player?

Guess who?

Think of how you would solve this problem. If you’re reading an article in Vibe Magazine, it mentions “Drake,” and talks about rap, it’s probably Chris Brown the singer. But if it’s in Inside Hockey Magazine, it’s probably the hockey player.

Our system uses similar techniques to figure out which one is best.

But what if we’re just not convinced that ANY of the Chris Browns in our system is the right one….well, that happens too! We call it a miss…and track it. If we keep seeing the same Chris Brown appear on the web a lot, our Entity Learning system will deal with adding it to the database, so we stay current.

So that’s all, right?

Nope, just a little more to go!

He sucks…

The MTV article has a whole bunch of he, she, it, etc. in it…  Some of the sentences are very informative:

  • ” … he captivated the audience …”
  • He introduced Bad Religion…”

The last step in our process is to figure out which topic these pronouns (or in fancy linguistic speak – anaphors) are talking about.

Sometimes it’s in the same sentence…”Ocean said that he”… Other times, it could be in the prior sentence…”Ocean was tired. He said that…” or even further removed. You also get some clues by looking at the gender. “He” probably refers to a man and not a company, right?

Sometimes, though, this can get pretty complicated: “Ocean said that Flea and he had never played together.” Huh?

“He” in this sentence refers to Ocean and not Flea. To handle this issue, we look at the syntactic structure of the sentence. But that’s for another day!

Done! Fini! No more!

Hopefully this gives you a clue of the sorts of problems we’re trying to solve over at Wavii. They’re tons of fun, but lots of hard work…. Hope you’ve enjoyed!

– Manish (no, not the cricketer, the other one)

What’s in a name? that which we call a rose
By any other name would smell as sweet

About Manish

In a love-hate relationship with data.
This entry was posted in Engineering. Bookmark the permalink.

6 Responses to Bush is back!….

  1. Excellent tool to sort out our big data problem. Very nice.

  2. Pingback: Duped by Dupes | Wavii Blog

  3. Pingback: The building blocks of Wavii | Wavii Blog

  4. Pingback: Knowing what you want (before you do) | Wavii Blog

  5. Pingback: Iterative Feature Generation | Wavii Blog

Share your thoughts:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s