mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Jordan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-675) LuceneIterator throws an IllegalStateException when a null TermFreqVector is encountered for a document instead of skipping to the next one
Date Thu, 21 Apr 2011 16:12:05 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022808#comment-13022808
] 

Chris Jordan commented on MAHOUT-675:
-------------------------------------

Oops, sorry about that :-/

Yes, it is definitely ok to license this under the Apache License.

> LuceneIterator throws an IllegalStateException when a null TermFreqVector is encountered
for a document instead of skipping to the next one
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-675
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-675
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Chris Jordan
>             Fix For: 0.5
>
>         Attachments: MAHOUT-675, MAHOUT-675-1, MAHOUT-675-1.patch, MAHOUT-675.patch,
MAHOUT-675.patch, MAHOUT-675.patch
>
>
> The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an IllegalStateException
if it encounters a document with a null term frequency vector for the target field in the
computeNext() method. That is problematic for people who are developing text mining applications
on top of lucene as it forces them to check that the documents that they are adding to their
lucene indexes actually have terms for the target field. While that check may sound reasonable,
it actually is not in practice.
> Lucene in most cases will apply an analyzer to a field in a document as it is added to
the index. The StandardAnalyzer is pretty lenient and barely removes any terms. In most cases
though, if you want to have better text mining performance, you will create your own custom
analyzer. For example, in my current work with document clustering, in order to generate tighter
clusters and have more human readable top terms, I am using a stop word list specific to my
subject domain and I am filtering out terms that contain numbers. The net result is that some
of my documents have no terms for the target field which is a desirable outcome. When I attempt
to dump the lucene vectors though, I encounter an IllegalStateException because of those documents.
> Now it is possible for me to check the TokenStream of the target field before I insert
into my index however, if we were to follow that approach, it means for each of my applications,
I would have to perform this check. That isn't a great practice as someone could be experimenting
with custom analyzers to improve text mining performance and then encounter this exception
without any real indication that it was due to the custom analyzer.
> I believe a better approach is to log a warning with the field id of the problem document
and then skip to the next one. That way, a warning will be in the logs and the lucene vector
dump process will not halt.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message