Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 1058 invoked from network); 5 Sep 2007 16:12:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Sep 2007 16:12:48 -0000 Received: (qmail 1297 invoked by uid 500); 5 Sep 2007 16:12:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1255 invoked by uid 500); 5 Sep 2007 16:12:35 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1238 invoked by uid 99); 5 Sep 2007 16:12:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 09:12:35 -0700 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 16:12:30 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1ISxU9-0003c8-M8 for java-user@lucene.apache.org; Wed, 05 Sep 2007 09:12:09 -0700 Message-ID: <12504196.post@talk.nabble.com> Date: Wed, 5 Sep 2007 09:12:09 -0700 (PDT) From: poeta simbolista To: java-user@lucene.apache.org Subject: Re: Look for strange encodings -- tokenization In-Reply-To: <46DEA655.6070108@syr.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: poetasimbolista@gmail.com References: <12479370.post@talk.nabble.com> <46DEA655.6070108@syr.edu> X-Virus-Checked: Checked by ClamAV on apache.org Thank you Steven, I have problems while providing those searches, I think it is because of the StandardAnalyzer is taking those bad-encoding characters as separators hence not creating such tokens when reading... Regarding the other idea you provided, did you mean then, that if a document contains many unseen terms that may mean encoding problems? Also, what I would like is to be able to at least, measure the impact of such problems, so I can decide whether the effort will be paid back :) Cheers P Steven Rowe wrote: > > poeta simbolista wrote: >> I'd want to know the best way to look for strange encodings on a Lucene >> index. >> i have several inputs where input can have been encoded on different >> sets. I >> not always know if my guess about the encoding has been ok. Hence, I'd >> thought of querying the index for some typical strings that would show >> bad >> encodings. > > In my experience, the best thing to do first is to look at a random > sample of the data you suspect to be problematic, and keep track of what > you find. Then, decide based on what you find whether it's worth it to > pursue it further. (Data is messy, and sometimes it's not worth the > effort to find and fix everything, as long as you know that the > probability of problems is relatively low.) > > If you do find that it's worth pursuing, I'd guess that the best spot to > find problems is at index time rather than query time, mostly because at > query time, you don't necessarily know what to look for. If you did, > then you could already improve your guesser at index time, right? > > One technique that you might find useful is to see if a document > contains too many previously unseen terms. You could index documents in > the same language and subject domain as those which might have > problematic charset conversion issues, but which do not have those > issues themselves, and then tokenize potentially problematically > converted documents, checking for the existence of each term in the > index[1] and keeping track of the ratio of previously unseen terms to > the total number of terms. If you compare this ratio to that for the > average known good document (and/or the worst-case near-last addition to > the index), you could get an idea about whether or not the document in > question has issues. > > Steve > > [1] > > > -- > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12504196 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org