lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From poeta simbolista <>
Subject Re: Look for strange encodings -- tokenization
Date Wed, 05 Sep 2007 16:12:09 GMT

Thank you Steven,

I have problems while providing those searches, I think it is because of the
StandardAnalyzer is taking those bad-encoding characters as separators hence
not creating such tokens when reading...

Regarding the other idea you provided, did you mean then, that if a document
contains many unseen terms that may mean encoding problems?

Also, what I would like is to be able to at least, measure the impact of
such problems, so I can decide whether the effort will be paid back :)


Steven Rowe wrote:
> poeta simbolista wrote:
>> I'd want to know the best way to look for strange encodings on a Lucene
>> index.
>> i have several inputs where input can have been encoded on different
>> sets. I
>> not always know if my guess about the encoding has been ok. Hence, I'd
>> thought of querying the index for some typical strings that would show
>> bad
>> encodings.
> In my experience, the best thing to do first is to look at a random
> sample of the data you suspect to be problematic, and keep track of what
> you find.  Then, decide based on what you find whether it's worth it to
> pursue it further.  (Data is messy, and sometimes it's not worth the
> effort to find and fix everything, as long as you know that the
> probability of problems is relatively low.)
> If you do find that it's worth pursuing, I'd guess that the best spot to
> find problems is at index time rather than query time, mostly because at
> query time, you don't necessarily know what to look for.  If you did,
> then you could already improve your guesser at index time, right?
> One technique that you might find useful is to see if a document
> contains too many previously unseen terms.  You could index documents in
> the same language and subject domain as those which might have
> problematic charset conversion issues, but which do not have those
> issues themselves, and then tokenize potentially problematically
> converted documents, checking for the existence of each term in the
> index[1] and keeping track of the ratio of previously unseen terms to
> the total number of terms.  If you compare this ratio to that for the
> average known good document (and/or the worst-case near-last addition to
> the index), you could get an idea about whether or not the document in
> question has issues.
> Steve
> [1]
> <>
> -- 
> Steve Rowe
> Center for Natural Language Processing
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message