lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Partial / starts with searching
Date Fri, 13 Feb 2009 09:52:08 GMT
If you attach an NgramTokenFilter to your analyzer at index and query  
time you should be able to query for parts of the word.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html

The classes are available in the contrib/analyzer module.

You might want to boost edges a bit more than inner parts, start  
trying out with something like 3-5 grams.

Be aware, this will produce a rather large index.


       karl

13 feb 2009 kl. 10.43 skrev d-fader:

> Karl,
>
> As a matter of fact I more or less did. I'm not really into NGrams,  
> but I read some articles about this technique and I eventually ended  
> up at the 'Did you mean: Lucene?' article written by Tom White. To  
> make a long story short, this solved my problem partially. I do have  
> 2 indexes now and I've written code to extract all terms a user  
> entered, put them through the suggestion engine and tries to be  
> clever about what suggestion should be used. It includes that stop  
> words are ignored, when the entered term exists for more than x  
> times in the index already it's probably good (and thus a suggestion  
> is not needed) and when there are suggestions available, the  
> suggestion with the most occurences in the index is presented. After  
> that the original query is being built up again, preserving all  
> command codes (like ", ( ), AND, OR, etc. etc.).
> As said, this system works pretty well and mostly if there's a  
> suggestion available, it's actually quite accurate, so thanks for  
> this.
>
> Still, it doesn't solve my problem fully. But I think I now know why  
> Lucene can't search 'truely' partially. To find a document fast, all  
> terms are stored with a list of documents which contain the term and  
> when a user searches, Lucene can identify the documents by comparing  
> the terms entered to the terms on that list, right? If so, it's  
> understandable that a true partial search never will work, but then  
> I just don't understand how Google manages to do this :)
>
> Jori.
>
>
>
>
> Karl Wettin wrote:
>> Hi again Jori,
>>
>> did you try N-grams as suggested in the reply on -dev?
>>
>>
>>     karl
>>
>> 13 feb 2009 kl. 09.05 skrev d-fader:
>>
>>> Hi,
>>>
>>> I've actually posted this message in de dev mailing list earlier,
>>> because I though my 'issue' is a limitation of the functionality of
>>> Lucene, but they redirected me to this mailinglist, so I hope one  
>>> of you
>>> guys can help me out :)
>>>
>>> Maybe the 'issue' I'm addressing now is discussed thouroughly  
>>> already,
>>> in that case I think I need some redirection to the sources of those
>>> discussions :) Anyway, here's the thing.
>>> For all I know it's impossible to search partial words with Lucene
>>> (except the asterix method with e.g. the StandardAnalyzer ->  
>>> ambul* to
>>> find ambulance). My problem with that method is that my index  
>>> consists
>>> of quite a few terms. This means that if a user would search for  
>>> 'ambu
>>> amster' (ambulance amsterdam), there will be so many terms to  
>>> search,
>>> the waiting time is just inacceptable. Now I started thinking why  
>>> it's
>>> impossible to search only a 'part' of a term or even only the  
>>> 'start' of
>>> a term and the only reason I could think of was that the Index  
>>> terms are
>>> stored tokenized (in that way you (of course) can't find partial  
>>> terms,
>>> since the index doesn't actually contain the literal terms, but  
>>> tokens
>>> instead). But Lucene can also store all terms untokenized, so in  
>>> that
>>> case, in my humble opinion, a partial search would be possible,  
>>> since
>>> all terms would be stored 'literally'.
>>>
>>> Maybe my thinking is wrong, I only have a black box view of  
>>> Lucene, so I
>>> don't know much about indexing algorithm and all, but I just want to
>>> know if this could be done or else why not :) You see, the users  
>>> of my
>>> index want to know why they can't search parts of the words they  
>>> enter
>>> and I still can't give them a really good answer, except the 'it  
>>> would
>>> result in too many OR operators in the query' statement :) . I've  
>>> tried
>>> using a Dutch stemmer (most of the data I'm indexing is Dutch) but  
>>> that
>>> didn't work out quite good. Furthermore users sometimes search for a
>>> certain 'filename' and mostly they just enter a part of the name and
>>> thus don't find anything.
>>>
>>> I hope someone can enlighten me :) Thanks in advance!
>>>
>>> Jori
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message