lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d-fader <dfa...@gmail.com>
Subject Re: Partial / starts with searching
Date Fri, 13 Feb 2009 13:52:39 GMT
Thanks, I will. It might take a while, but I'll post my findings.

Erick Erickson wrote:
> *really* think about using Filters here for the user permissions.
> I think you'd be surprised how quickly you could construct one,
> and depending upon how many users you have you might be able
> to get some benefit from CachingWrapperFilter.
>
> And ignore my entire diatribe about whether you can restrict
> your wildcards to 3 or more characters, that approach doesn't
> fit your problem space at all....
>
> Best
> Erick
>
> On Fri, Feb 13, 2009 at 8:39 AM, d-fader <dfader@gmail.com> wrote:
>
>   
>> Well, it worked. I indexed a test database and it indeed grew somewhat
>> (from 16 MiB to 200 MiB :)), and it works flawlessly. Still, I can't use the
>> result in my application :)
>> The 'live' index database contains about 2 million documents and is used by
>> a multi-user application. As you probably can imagine, not everyone may see
>> everything, there are documents that can be seen by everyone, documents that
>> can be seen by some and also documents that only can be seen by one person.
>> At design time, since we used the StandardAnalyzer, we decided to create a
>> field in each document in which we store the 'login name' of each user that
>> may see the document (2 to 4 characters per user, in most cases 2) and
>> that's where the hick-up occurs. When I index it with the NGramTokenFilter
>> (3-5) it doesn't seem to index anything with 2 letters. I checked in Luke
>> too, if I search for UserInitials:(JS BD), Luke's query explanation is
>> empty. When I search for UserInitials:(ABC) it seems to do the job well but
>> I when I search for DEFG, the query explanation looks like
>> UserAccessInitials:"def efg defg" and that is inacceptable, since there can
>> be a user DEFG and a user EFG available in the system.
>>
>> So I think in my case it just won't work, unless I rewrite the 'who may see
>> this document' code pretty drastically, if even possible without losing too
>> much 'searching' speed.
>>
>> ...or am I wrong?
>>
>>
>> Karl Wettin wrote:
>>
>>     
>>> If you attach an NgramTokenFilter to your analyzer at index and query time
>>> you should be able to query for parts of the word.
>>>
>>>
>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
>>>
>>> http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html
>>>
>>> The classes are available in the contrib/analyzer module.
>>>
>>> You might want to boost edges a bit more than inner parts, start trying
>>> out with something like 3-5 grams.
>>>
>>> Be aware, this will produce a rather large index.
>>>
>>>
>>>      karl
>>>
>>> 13 feb 2009 kl. 10.43 skrev d-fader:
>>>
>>>  Karl,
>>>       
>>>> As a matter of fact I more or less did. I'm not really into NGrams, but I
>>>> read some articles about this technique and I eventually ended up at the
>>>> 'Did you mean: Lucene?' article written by Tom White. To make a long story
>>>> short, this solved my problem partially. I do have 2 indexes now and I've
>>>> written code to extract all terms a user entered, put them through the
>>>> suggestion engine and tries to be clever about what suggestion should be
>>>> used. It includes that stop words are ignored, when the entered term exists
>>>> for more than x times in the index already it's probably good (and thus a
>>>> suggestion is not needed) and when there are suggestions available, the
>>>> suggestion with the most occurences in the index is presented. After that
>>>> the original query is being built up again, preserving all command codes
>>>> (like ", ( ), AND, OR, etc. etc.).
>>>> As said, this system works pretty well and mostly if there's a suggestion
>>>> available, it's actually quite accurate, so thanks for this.
>>>>
>>>> Still, it doesn't solve my problem fully. But I think I now know why
>>>> Lucene can't search 'truely' partially. To find a document fast, all terms
>>>> are stored with a list of documents which contain the term and when a user
>>>> searches, Lucene can identify the documents by comparing the terms entered
>>>> to the terms on that list, right? If so, it's understandable that a true
>>>> partial search never will work, but then I just don't understand how Google
>>>> manages to do this :)
>>>>
>>>> Jori.
>>>>
>>>>
>>>>
>>>>
>>>> Karl Wettin wrote:
>>>>
>>>>         
>>>>> Hi again Jori,
>>>>>
>>>>> did you try N-grams as suggested in the reply on -dev?
>>>>>
>>>>>
>>>>>    karl
>>>>>
>>>>> 13 feb 2009 kl. 09.05 skrev d-fader:
>>>>>
>>>>>  Hi,
>>>>>           
>>>>>> I've actually posted this message in de dev mailing list earlier,
>>>>>> because I though my 'issue' is a limitation of the functionality
of
>>>>>> Lucene, but they redirected me to this mailinglist, so I hope one
of
>>>>>> you
>>>>>> guys can help me out :)
>>>>>>
>>>>>> Maybe the 'issue' I'm addressing now is discussed thouroughly already,
>>>>>> in that case I think I need some redirection to the sources of those
>>>>>> discussions :) Anyway, here's the thing.
>>>>>> For all I know it's impossible to search partial words with Lucene
>>>>>> (except the asterix method with e.g. the StandardAnalyzer -> ambul*
to
>>>>>> find ambulance). My problem with that method is that my index consists
>>>>>> of quite a few terms. This means that if a user would search for
'ambu
>>>>>> amster' (ambulance amsterdam), there will be so many terms to search,
>>>>>> the waiting time is just inacceptable. Now I started thinking why
it's
>>>>>> impossible to search only a 'part' of a term or even only the 'start'
>>>>>> of
>>>>>> a term and the only reason I could think of was that the Index terms
>>>>>> are
>>>>>> stored tokenized (in that way you (of course) can't find partial
terms,
>>>>>> since the index doesn't actually contain the literal terms, but tokens
>>>>>> instead). But Lucene can also store all terms untokenized, so in
that
>>>>>> case, in my humble opinion, a partial search would be possible, since
>>>>>> all terms would be stored 'literally'.
>>>>>>
>>>>>> Maybe my thinking is wrong, I only have a black box view of Lucene,
so
>>>>>> I
>>>>>> don't know much about indexing algorithm and all, but I just want
to
>>>>>> know if this could be done or else why not :) You see, the users
of my
>>>>>> index want to know why they can't search parts of the words they
enter
>>>>>> and I still can't give them a really good answer, except the 'it
would
>>>>>> result in too many OR operators in the query' statement :) . I've
tried
>>>>>> using a Dutch stemmer (most of the data I'm indexing is Dutch) but
that
>>>>>> didn't work out quite good. Furthermore users sometimes search for
a
>>>>>> certain 'filename' and mostly they just enter a part of the name
and
>>>>>> thus don't find anything.
>>>>>>
>>>>>> I hope someone can enlighten me :) Thanks in advance!
>>>>>>
>>>>>> Jori
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>             
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>           
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message