Thursday, May 31, 2007

Blogging Peter Norvig's Talk

I've never done a live blog report before, so I thought I'd try my hand at it. However, I'm not going to do one of those transcript blogs where I try to write down everything that's said, I'll just try to capture what I think is interesting.

I'd first like to point out that the Google conference music is far cooler than the music at the salesforce.com conference .

Poor Peter ended up violating one of Guy Kawasaki's first rules of speaking - try to speak in a tiny room so that the crowd feels more intense. People have started trickling home already, so the crowd is lighter than one would expect.

"Why do you want to go to Google?"
"That's where the data is"

The rise of probability models
Percentage of ACL Papers with statistical/probablistic concepts in the title:
1979: 0%
1989: 6%
2006: 55%

Size of training corpus is far more important than the algorithm applied. Duh.

The LDC corpus contains about 100 gigabytes of speech -- but the internet contains about 100 trillion words (10^14). Google's LDC N-gram corpus contains 1,024,908,267,229 tokens and 95,119,665,584 sentences. It's for sale in a lovely 7 CD set. As I've always said, if anyone is going to make machine learning work for extraction, it'll be Google.

I wonder if one could identify potential terrorists using Google sets...Hmm. A set that starts out with only Osama bin Laden as the root identifies "US Presidential Candidates", "Bill Clinton", "George W. Bush", "Taliban Islamic Movement", "Ayman Al Zawahiri", "Terrorism", "John Gotti", "War", and... "Bob Hope". (When given two names, say, Osama bin Laden and Ayman Al Zawahiri, it did a much better job)

Interesting comparison of most commonly occuring unique terms in particular categories (such as "drugs") vs. most common queries in that category. There is a huge mismatch.

Many things he has spoken about before -- statistical machine translation, human NLP techniques vs machine learning, etc. On machine translation, Chinese is one of the hardest, which is borne out by the ACE test results I've seen. He gave an example showing how incredibly hard translating Chinese is, especially because you need to take into account complex word sequences.

He discussed all kinds of technical nuances and tricks in terms of bit representations, lexical co-occurrence representations, etc. They did a series of experiments that showed that truncating words at 7-8 characters is almost as good as true stemming. Truncating at 4 characters is actually better than true stemming at capturing meaning. (?!?) That's going to bother me for awhile.

He also discussed that better models would be based both on the writer and the searcher and the interaction between the two. (Of course, Google and the other search engines have the advantage of seeing both.) But he didn't go into this in any more depth than that.

Lots of interesting questions. One of the most interesting for me personally was around whether Google was investigating predictive analytics in terms of, say, "reading" financial information and being able to predict future stock performance. The answer Peter gave was "no", which either means he's being secretive or that Google has missed a really interesting and cool application of technology. This seems to point up that in "organizing the world's information" Google is still not really organizing all of the world's information.

They also were asked about their current focus on organizing and analyzing textual information. Peter indicated that one of their future forays will be in image analysis, both in photo and video, now that they have a huge library of those to work from in doing machine-generated image analysis.

Someone asked if there were plans to produce an open Google API that webmasters could use to automatically stop comment spam. Peter said no, but said he found the idea highly intriguing. So do I; sounds like an interesting viral marketing method "Comments protected by an anti-spam filter powered by XYZ".

Another question was around Google's efforts to measure true user satisfaction - IE, while they can understand whether result #3 got more clickthroughs than result #1, they have no real way of knowing if when I click on result #1, did I really like it when I got there? He said they tried having a Google toolbar that would let you rate the results, but people in general only give rankings when they don't like a result, so that wasn't very useful.









No comments: