lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gaz77 <gareth.c...@bit10.net>
Subject Re: Confused with NGRAM results
Date Mon, 01 Sep 2008 11:48:36 GMT

Hi Otis,

The original message text is:



> Hi,
> 
> I'd appreciate if someone could explain the results I'm getting.
> 
> I've written a simple custom analyzer that applies the NGramTokenFilter to
> the token stream during indexing. It's never applied during searching. The
> purpose of this is to match sub-words.
> 
> Without the ngram filter, if I searched on the word 'postcode' it returns
> 2 documents. If I searched on 'code' it returns 6 documents (with no
> overlap on the postcode results).
> 
> If I apply the ngram filter with min 1 and max 10, searching 'postcode'
> returns the same 2 docs, while searching 'code' returns 9 docs. This
> sort-of feels right.
> 
> The problem comes when I set the min ngram size to 3, and the max to 5.
> Searching 'postcode' returns no results (as expected), but searching
> 'code' only returns 2 docs (the 2 normally returned by a 'postcode'
> search).
> 
> This last result for 'code' just doesn't seem correct - it should be
> returning at least the 6 docs from the original search.
> 
> I'd really appreciate some advice on what is going on with the ngram
> filter.
> 
> Thanks
> 



Otis Gospodnetic wrote:
> 
> This actually sounds bugish to me, but you removed the text from your
> original email, so I don't know what context this was in.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: gaz77 <gareth.cole@bit10.net>
>> To: java-user@lucene.apache.org
>> Sent: Friday, August 29, 2008 12:50:46 AM
>> Subject: Re: Confused with NGRAM results
>> 
>> 
>> Thanks for the pointer.
>> 
>> I've gone into this in some depth, using the AnalyzerUtils class from the
>> lucene in action book.
>> 
>> It seems that the NGramTokenFilter is only processing part of the string
>> that goes in. It stops tokenising the words part way through. That's why
>> the
>> documents weren't found in results.
>> 
>> I've had a look at the source code, and I think it's because the next()
>> function returns null when it hits a token smaller than the min ngram
>> size.
>> For example, if I set the minimum to 3, then a 2-character token will
>> cause
>> it to return null.
>> 
>> I'm not sure if this is by design or a bug. either way, at least I know
>> what's causing it now.
>> 
>> Cheers
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19253386.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message