Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 77493 invoked from network); 7 Mar 2008 23:26:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Mar 2008 23:26:54 -0000 Received: (qmail 58888 invoked by uid 500); 7 Mar 2008 23:26:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58477 invoked by uid 500); 7 Mar 2008 23:26:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58466 invoked by uid 99); 7 Mar 2008 23:26:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2008 15:26:44 -0800 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [66.216.127.90] (HELO rs35.luxsci.com) (66.216.127.90) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2008 23:25:56 +0000 Received: from [192.168.1.98] (dsl253-005-254.nyc1.dsl.speakeasy.net [66.253.5.254]) (authenticated bits=0) by rs35.luxsci.com (8.13.1/8.13.7) with ESMTP id m27NQEaE025245 for ; Fri, 7 Mar 2008 17:26:14 -0600 Message-ID: <47D1CF29.8010201@alias-i.com> Date: Fri, 07 Mar 2008 18:26:33 -0500 From: Bob Carpenter User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene for Sentiment Analysis References: <661804.3785.qm@web50711.mail.re2.yahoo.com> In-Reply-To: <661804.3785.qm@web50711.mail.re2.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Aaron Schon wrote: > ...I was wondering if taking a bag of words approach might work. For example chunking the sentences to be analyzed and running a Lucene query against an index storing sentiment polarity. Has anyone had success with this approach? I do not need a super accurate system, something that is "reasonably" accurate. Even the best sentiment analyzers aren't that good. And they need to be trained per domain (e.g. "easy to use" is good for electronics and "leaky" is bad, but your mileage varies in other domains, where "fuel efficient" or "entertaining" might be good). You'll take a hit in performance using a bag of words (or stemmed, stoplisted, case-normalized terms) because you lose subword generalizations if the stemmer's not great or if word segmentation varies, and you'll lose cross-word discriminitive power going a word at a time. Using TF/IDF to weight the terms can help. > Also, could you suggest good publicly available training datasets? I am aware of the Cornell Movie Reviews dataset[1] The Pang and Lee data from Cornell was collected automatically from Rotten Tomatoes and IMDB. Gathering more data like that from Amazon, C-net, etc. should be easy. That's what everyone's doing for evaluations. But these are all at the review level, not at the sentence level. We've actually had customers annotating at the sentence level, which can produce much tighter training sets. - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org