lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Dhussa <va...@mapmyindia.com>
Subject Re: Fuzzy search change
Date Sat, 20 Jun 2009 06:18:13 GMT
Hi,

I can port the code to java. I do not know the Lucene file structures
etc. as of now. So if someone with experience on that to store trigrams
and index them is can work on that part, I can port the rest of the code.

Regards

Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com



Michael McCandless wrote:
> This would make an awesome addition to Lucene!
>
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
>
> Would you be able to port it to java?
>
> Mike
>
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<varun@mapmyindia.com> wrote:
>   
>> Hi,
>>
>> I wrote on this a long time ago, but haven't followed it up. I just finished
>> a C++ implementation of a spell check module in my software. I borrowed the
>> idea from Xapian. It is to use a trigram index to filter results, and then
>> use Edit Distance on the filtered set. Would such a solution be acceptable
>> to the Lucene Community? The details of my implementation are as follows:
>>
>> 1) QDBM data store hash map
>> 2) Trigram tokenizer on the input string
>> 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
>> 4) Use trigram tokenizer and match with the trigram index
>> 5) Get the IDs within the input cutoff
>> 6) Run Edit Distance on the list and return
>>
>> In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
>> runs in <0.5 sec with a keyword record count of about 1,000,000 records.
>> This is at least 3-4 times less than the current search times on Lucene.
>>
>> Since the results can be put in a thread safe hash table structure, the
>> trigram search can be distributed over a thread pool also.
>>
>> Does this seem like a workable suggestion to the community?
>>
>> Regards
>>
>> --
>> Varun Dhussa
>> Product Architect
>> CE InfoSystems (P) Ltd
>> http://www.mapmyindia.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message