lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: Lucene for Sentiment Analysis
Date Fri, 07 Mar 2008 23:26:33 GMT
Aaron Schon wrote:
> ...I was wondering if taking a bag of words approach might work. For example chunking
the sentences to be analyzed and running a Lucene query against an index storing sentiment
polarity. Has anyone had success with this approach? I do not need a super accurate system,
something that is "reasonably" accurate.

Even the best sentiment analyzers aren't that good.

And they need to be trained per domain (e.g. "easy to
use" is good for electronics and "leaky" is bad, but
your mileage varies in other domains, where "fuel efficient"
or "entertaining" might be good).

You'll take a hit in performance using a bag of
words (or stemmed, stoplisted, case-normalized terms)
because you lose subword generalizations if the stemmer's
not great or if word segmentation varies, and you'll lose
cross-word discriminitive power going a word at a time.

Using TF/IDF to weight the terms can help.

> Also, could you suggest good publicly available training datasets? I am aware of the
Cornell Movie Reviews dataset[1]

The Pang and Lee data from Cornell was collected automatically
from Rotten Tomatoes and IMDB.  Gathering more data like that
from Amazon, C-net, etc. should be easy.  That's what everyone's
doing for evaluations.

But these are all at the review level, not at the sentence
level.  We've actually had customers annotating at the sentence
level, which can produce much tighter training sets.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message