lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Partial / starts with searching
Date Sat, 14 Feb 2009 11:27:44 GMT
You probably only want to use Ngrams for the text fields, leaving the  
user name field untokenized. As for loosing text field words less than  
3 characters long: consider letting them through, perhaps by  
implementing a filter that pass longer word to an Ngram filter while  
you just return the shorter input tokens.


13 feb 2009 kl. 14.39 skrev d-fader:

> Well, it worked. I indexed a test database and it indeed grew  
> somewhat (from 16 MiB to 200 MiB :)), and it works flawlessly.  
> Still, I can't use the result in my application :)
> The 'live' index database contains about 2 million documents and is  
> used by a multi-user application. As you probably can imagine, not  
> everyone may see everything, there are documents that can be seen by  
> everyone, documents that can be seen by some and also documents that  
> only can be seen by one person. At design time, since we used the  
> StandardAnalyzer, we decided to create a field in each document in  
> which we store the 'login name' of each user that may see the  
> document (2 to 4 characters per user, in most cases 2) and that's  
> where the hick-up occurs. When I index it with the NGramTokenFilter  
> (3-5) it doesn't seem to index anything with 2 letters. I checked in  
> Luke too, if I search for UserInitials:(JS BD), Luke's query  
> explanation is empty. When I search for UserInitials:(ABC) it seems  
> to do the job well but I when I search for DEFG, the query  
> explanation looks like UserAccessInitials:"def efg defg" and that is  
> inacceptable, since there can be a user DEFG and a user EFG  
> available in the system.
> So I think in my case it just won't work, unless I rewrite the 'who  
> may see this document' code pretty drastically, if even possible  
> without losing too much 'searching' speed.
> ...or am I wrong?
> Karl Wettin wrote:
>> If you attach an NgramTokenFilter to your analyzer at index and  
>> query time you should be able to query for parts of the word.
>> The classes are available in the contrib/analyzer module.
>> You might want to boost edges a bit more than inner parts, start  
>> trying out with something like 3-5 grams.
>> Be aware, this will produce a rather large index.
>>      karl
>> 13 feb 2009 kl. 10.43 skrev d-fader:
>>> Karl,
>>> As a matter of fact I more or less did. I'm not really into  
>>> NGrams, but I read some articles about this technique and I  
>>> eventually ended up at the 'Did you mean: Lucene?' article written  
>>> by Tom White. To make a long story short, this solved my problem  
>>> partially. I do have 2 indexes now and I've written code to  
>>> extract all terms a user entered, put them through the suggestion  
>>> engine and tries to be clever about what suggestion should be  
>>> used. It includes that stop words are ignored, when the entered  
>>> term exists for more than x times in the index already it's  
>>> probably good (and thus a suggestion is not needed) and when there  
>>> are suggestions available, the suggestion with the most occurences  
>>> in the index is presented. After that the original query is being  
>>> built up again, preserving all command codes (like ", ( ), AND,  
>>> OR, etc. etc.).
>>> As said, this system works pretty well and mostly if there's a  
>>> suggestion available, it's actually quite accurate, so thanks for  
>>> this.
>>> Still, it doesn't solve my problem fully. But I think I now know  
>>> why Lucene can't search 'truely' partially. To find a document  
>>> fast, all terms are stored with a list of documents which contain  
>>> the term and when a user searches, Lucene can identify the  
>>> documents by comparing the terms entered to the terms on that  
>>> list, right? If so, it's understandable that a true partial search  
>>> never will work, but then I just don't understand how Google  
>>> manages to do this :)
>>> Jori.
>>> Karl Wettin wrote:
>>>> Hi again Jori,
>>>> did you try N-grams as suggested in the reply on -dev?
>>>>    karl
>>>> 13 feb 2009 kl. 09.05 skrev d-fader:
>>>>> Hi,
>>>>> I've actually posted this message in de dev mailing list earlier,
>>>>> because I though my 'issue' is a limitation of the functionality  
>>>>> of
>>>>> Lucene, but they redirected me to this mailinglist, so I hope  
>>>>> one of you
>>>>> guys can help me out :)
>>>>> Maybe the 'issue' I'm addressing now is discussed thouroughly  
>>>>> already,
>>>>> in that case I think I need some redirection to the sources of  
>>>>> those
>>>>> discussions :) Anyway, here's the thing.
>>>>> For all I know it's impossible to search partial words with Lucene
>>>>> (except the asterix method with e.g. the StandardAnalyzer ->  
>>>>> ambul* to
>>>>> find ambulance). My problem with that method is that my index  
>>>>> consists
>>>>> of quite a few terms. This means that if a user would search for  
>>>>> 'ambu
>>>>> amster' (ambulance amsterdam), there will be so many terms to  
>>>>> search,
>>>>> the waiting time is just inacceptable. Now I started thinking  
>>>>> why it's
>>>>> impossible to search only a 'part' of a term or even only the  
>>>>> 'start' of
>>>>> a term and the only reason I could think of was that the Index  
>>>>> terms are
>>>>> stored tokenized (in that way you (of course) can't find partial  
>>>>> terms,
>>>>> since the index doesn't actually contain the literal terms, but  
>>>>> tokens
>>>>> instead). But Lucene can also store all terms untokenized, so in  
>>>>> that
>>>>> case, in my humble opinion, a partial search would be possible,  
>>>>> since
>>>>> all terms would be stored 'literally'.
>>>>> Maybe my thinking is wrong, I only have a black box view of  
>>>>> Lucene, so I
>>>>> don't know much about indexing algorithm and all, but I just  
>>>>> want to
>>>>> know if this could be done or else why not :) You see, the users  
>>>>> of my
>>>>> index want to know why they can't search parts of the words they  
>>>>> enter
>>>>> and I still can't give them a really good answer, except the 'it  
>>>>> would
>>>>> result in too many OR operators in the query' statement :) .  
>>>>> I've tried
>>>>> using a Dutch stemmer (most of the data I'm indexing is Dutch)  
>>>>> but that
>>>>> didn't work out quite good. Furthermore users sometimes search  
>>>>> for a
>>>>> certain 'filename' and mostly they just enter a part of the name  
>>>>> and
>>>>> thus don't find anything.
>>>>> I hope someone can enlighten me :) Thanks in advance!
>>>>> Jori
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message