Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <47D1CF29.8010201@alias-i.com>
Date: Fri, 07 Mar 2008 18:26:33 -0500
From: Bob Carpenter <carp@alias-i.com>
User-Agent: Thunderbird 2.0.0.12 (Windows/20080213)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Lucene for Sentiment Analysis
References: <661804.3785.qm@web50711.mail.re2.yahoo.com>
In-Reply-To: <661804.3785.qm@web50711.mail.re2.yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Aaron Schon wrote:
> ...I was wondering if taking a bag of words approach might work. For example chunking the sentences to be analyzed and running a Lucene query against an index storing sentiment polarity. Has anyone had success with this approach? I do not need a super accurate system, something that is "reasonably" accurate.

Even the best sentiment analyzers aren't that good.

And they need to be trained per domain (e.g. "easy to
use" is good for electronics and "leaky" is bad, but
your mileage varies in other domains, where "fuel efficient"
or "entertaining" might be good).

You'll take a hit in performance using a bag of
words (or stemmed, stoplisted, case-normalized terms)
because you lose subword generalizations if the stemmer's
not great or if word segmentation varies, and you'll lose
cross-word discriminitive power going a word at a time.

Using TF/IDF to weight the terms can help.

> Also, could you suggest good publicly available training datasets? I am aware of the Cornell Movie Reviews dataset[1]

The Pang and Lee data from Cornell was collected automatically
from Rotten Tomatoes and IMDB.  Gathering more data like that
from Amazon, C-net, etc. should be easy.  That's what everyone's
doing for evaluations.

But these are all at the review level, not at the sentence
level.  We've actually had customers annotating at the sentence
level, which can produce much tighter training sets.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org