lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From poeta simbolista <>
Subject Look for strange encodings -- tokenization
Date Tue, 04 Sep 2007 14:38:40 GMT

Hi all,

I'd want to know the best way to look for strange encodings on a Lucene
i have several inputs where input can have been encoded on different sets. I
not always know if my guess about the encoding has been ok. Hence, I'd
thought of querying the index for some typical strings that would show bad

All the index has been already constructed using the StandardAnalyzer. I
have read using another analyzer could yield some unexpected results... But
I suppose that's ok for my purposes - testing quality of the index.

Which way do you think it is better to tackle this issue? I've been taking a
look at the Analyzers -- the StandardAnalyzer. I thought about creating a
custom tokenizer that splits on letter, number, spaces so it only leaves
"weird" strings as tokens -- they will show bad encodings. Still, and
possibly due to lack of knowledge of lucene .) I have the feeling this can
be done better somehow.

Thanks a lot in advance!
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message