Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <12504196.post@talk.nabble.com>
Date: Wed, 5 Sep 2007 09:12:09 -0700 (PDT)
From: poeta simbolista <poetasimbolista@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Look for strange encodings -- tokenization
In-Reply-To: <46DEA655.6070108@syr.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <12479370.post@talk.nabble.com> <46DEA655.6070108@syr.edu>


Thank you Steven,

I have problems while providing those searches, I think it is because of the
StandardAnalyzer is taking those bad-encoding characters as separators hence
not creating such tokens when reading...

Regarding the other idea you provided, did you mean then, that if a document
contains many unseen terms that may mean encoding problems?

Also, what I would like is to be able to at least, measure the impact of
such problems, so I can decide whether the effort will be paid back :)

Cheers
 P


Steven Rowe wrote:
> 
> poeta simbolista wrote:
>> I'd want to know the best way to look for strange encodings on a Lucene
>> index.
>> i have several inputs where input can have been encoded on different
>> sets. I
>> not always know if my guess about the encoding has been ok. Hence, I'd
>> thought of querying the index for some typical strings that would show
>> bad
>> encodings.
> 
> In my experience, the best thing to do first is to look at a random
> sample of the data you suspect to be problematic, and keep track of what
> you find.  Then, decide based on what you find whether it's worth it to
> pursue it further.  (Data is messy, and sometimes it's not worth the
> effort to find and fix everything, as long as you know that the
> probability of problems is relatively low.)
> 
> If you do find that it's worth pursuing, I'd guess that the best spot to
> find problems is at index time rather than query time, mostly because at
> query time, you don't necessarily know what to look for.  If you did,
> then you could already improve your guesser at index time, right?
> 
> One technique that you might find useful is to see if a document
> contains too many previously unseen terms.  You could index documents in
> the same language and subject domain as those which might have
> problematic charset conversion issues, but which do not have those
> issues themselves, and then tokenize potentially problematically
> converted documents, checking for the existence of each term in the
> index[1] and keeping track of the ratio of previously unseen terms to
> the total number of terms.  If you compare this ratio to that for the
> average known good document (and/or the worst-case near-last addition to
> the index), you could get an idea about whether or not the document in
> question has issues.
> 
> Steve
> 
> [1]
> <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)>
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12504196
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org