lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Look for strange encodings -- tokenization
Date Wed, 05 Sep 2007 12:51:33 GMT
poeta simbolista wrote:
> I'd want to know the best way to look for strange encodings on a Lucene
> index.
> i have several inputs where input can have been encoded on different sets. I
> not always know if my guess about the encoding has been ok. Hence, I'd
> thought of querying the index for some typical strings that would show bad
> encodings.

In my experience, the best thing to do first is to look at a random
sample of the data you suspect to be problematic, and keep track of what
you find.  Then, decide based on what you find whether it's worth it to
pursue it further.  (Data is messy, and sometimes it's not worth the
effort to find and fix everything, as long as you know that the
probability of problems is relatively low.)

If you do find that it's worth pursuing, I'd guess that the best spot to
find problems is at index time rather than query time, mostly because at
query time, you don't necessarily know what to look for.  If you did,
then you could already improve your guesser at index time, right?

One technique that you might find useful is to see if a document
contains too many previously unseen terms.  You could index documents in
the same language and subject domain as those which might have
problematic charset conversion issues, but which do not have those
issues themselves, and then tokenize potentially problematically
converted documents, checking for the existence of each term in the
index[1] and keeping track of the ratio of previously unseen terms to
the total number of terms.  If you compare this ratio to that for the
average known good document (and/or the worst-case near-last addition to
the index), you could get an idea about whether or not the document in
question has issues.

Steve

[1]
<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message