Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 62877 invoked from network); 28 Aug 2008 22:54:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Aug 2008 22:54:34 -0000 Received: (qmail 79493 invoked by uid 500); 28 Aug 2008 22:54:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 79461 invoked by uid 500); 28 Aug 2008 22:54:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 79450 invoked by uid 99); 28 Aug 2008 22:54:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Aug 2008 15:54:24 -0700 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Aug 2008 22:53:26 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1KYqKE-0002Zy-NF for java-user@lucene.apache.org; Thu, 28 Aug 2008 15:50:46 -0700 Message-ID: <19210665.post@talk.nabble.com> Date: Thu, 28 Aug 2008 15:50:46 -0700 (PDT) From: gaz77 To: java-user@lucene.apache.org Subject: Re: Confused with NGRAM results In-Reply-To: <19202310.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: gareth.cole@bit10.net References: <19202310.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org Thanks for the pointer. I've gone into this in some depth, using the AnalyzerUtils class from the lucene in action book. It seems that the NGramTokenFilter is only processing part of the string that goes in. It stops tokenising the words part way through. That's why the documents weren't found in results. I've had a look at the source code, and I think it's because the next() function returns null when it hits a token smaller than the min ngram size. For example, if I set the minimum to 3, then a 2-character token will cause it to return null. I'm not sure if this is by design or a bug. either way, at least I know what's causing it now. Cheers -- View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org