lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Dhussa <va...@mapmyindia.com>
Subject Fuzzy search change
Date Thu, 18 Jun 2009 11:12:23 GMT
Hi,

I wrote on this a long time ago, but haven't followed it up. I just 
finished a C++ implementation of a spell check module in my software. I 
borrowed the idea from Xapian. It is to use a trigram index to filter 
results, and then use Edit Distance on the filtered set. Would such a 
solution be acceptable to the Lucene Community? The details of my 
implementation are as follows:

1) QDBM data store hash map
2) Trigram tokenizer on the input string
3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
4) Use trigram tokenizer and match with the trigram index
5) Get the IDs within the input cutoff
6) Run Edit Distance on the list and return

In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, 
it runs in <0.5 sec with a keyword record count of about 1,000,000 
records. This is at least 3-4 times less than the current search times 
on Lucene.

Since the results can be put in a thread safe hash table structure, the 
trigram search can be distributed over a thread pool also.

Does this seem like a workable suggestion to the community?

Regards

-- 
Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message