Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 79625 invoked from network); 18 Apr 2011 17:32:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Apr 2011 17:32:44 -0000 Received: (qmail 76613 invoked by uid 500); 18 Apr 2011 17:32:44 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 76570 invoked by uid 500); 18 Apr 2011 17:32:43 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 76561 invoked by uid 99); 18 Apr 2011 17:32:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2011 17:32:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2011 17:32:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B0C03A7BFA for ; Mon, 18 Apr 2011 17:32:05 +0000 (UTC) Date: Mon, 18 Apr 2011 17:32:05 +0000 (UTC) From: "Chris Jordan (JIRA)" To: dev@mahout.apache.org Message-ID: <818432793.64963.1303147925720.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1205695927.64947.1303147685753.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (MAHOUT-675) LuceneIterator throws an IllegalStateException when a null TermFreqVector is encountered for a document instead of skipping to the next one MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Jordan updated MAHOUT-675: -------------------------------- Attachment: MAHOUT-675 I have patch the LuceneIterator. It is a mild change that also adds a Logger instance. > LuceneIterator throws an IllegalStateException when a null TermFreqVector is encountered for a document instead of skipping to the next one > ------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-675 > URL: https://issues.apache.org/jira/browse/MAHOUT-675 > Project: Mahout > Issue Type: Improvement > Components: Utils > Reporter: Chris Jordan > Attachments: MAHOUT-675 > > > The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an IllegalStateException if it encounters a document with a null term frequency vector for the target field in the computeNext() method. That is problematic for people who are developing text mining applications on top of lucene as it forces them to check that the documents that they are adding to their lucene indexes actually have terms for the target field. While that check may sound reasonable, it actually is not in practice. > Lucene in most cases will apply an analyzer to a field in a document as it is added to the index. The StandardAnalyzer is pretty lenient and barely removes any terms. In most cases though, if you want to have better text mining performance, you will create your own custom analyzer. For example, in my current work with document clustering, in order to generate tighter clusters and have more human readable top terms, I am using a stop word list specific to my subject domain and I am filtering out terms that contain numbers. The net result is that some of my documents have no terms for the target field which is a desirable outcome. When I attempt to dump the lucene vectors though, I encounter an IllegalStateException because of those documents. > Now it is possible for me to check the TokenStream of the target field before I insert into my index however, if we were to follow that approach, it means for each of my applications, I would have to perform this check. That isn't a great practice as someone could be experimenting with custom analyzers to improve text mining performance and then encounter this exception without any real indication that it was due to the custom analyzer. > I believe a better approach is to log a warning with the field id of the problem document and then skip to the next one. That way, a warning will be in the logs and the lucene vector dump process will not halt. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira