lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject DoubleMetaphoneQuery
Date Fri, 19 Dec 2003 19:51:44 GMT

I've seen discussions about using the double metaphone algorithm with 
Lucene (basically: like soundex, used
to find works that sound similar in English at least) but couldn't find 
an implementation, so I spent
a few minutes and wrote a Query and TermEnum object for this. I may have 
missed the prior art so sorry if I did...

[1] Here are some mail msgs that mention double metaphone wrt Lucene:

http://www.geocrawler.com/archives/3/2626/2000/10/0/4566951/
http://www.geocrawler.com/archives/3/2626/2001/8/50/6382300/
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg04648.html

[2] And Phoenix has a double metaphone  Analyzer, but not a Query, which 
I guess is another angle on things:

http://www.tangentum.biz/en/products/phonetix/api/com/tangentum/phonetix/lucene/PhoneticAnalyzer.html


[3] Attached are 2 files (DoubleMetaPhoneQuery and 
DoubleMetaphoneTermEnum) that I think are valid contributions
to the Lucene Sandbox. Hopefully all that has to be done is change the 
package line if the powers that be accept this.

Note: My impl uses the Jakarta CODEC package ( 
http://jakarta.apache.org/commons/codec/ ) for the double metaphone 
algorithm implementation.

Also, any query expansion such as this could exceed the bounds of a 
boolean query, thus BooleanQuery.setMaxClauseCount
may need to be used to avoid an exception.

[4] I've updated my Lucene demo site which has the ~3500 RFCs indexed 
and searchable by Lucene. I added an "advanced query"
page to try out the DoubleMetaphoneQuery:

It's a few lines down at this URL:

http://www.hostmon.com/rfc/advanced.jsp


[5] Most of the above is redundantly stated here as a kind of perma-link:

http://www.tropo.com/techno/java/lucene/metaphone.html

[6]

While it's easy to write additonal Query classes, I suspect they are a 
kind of dead end and won't really be
used unless they are integrated into the QueryParser - thus one concept 
is that the Lucene syntax should
have some extension mechanism so you can pass a query like 
"metaphone::protokal" to it and "metaphone::"
(note the double colons)  would mean to use DoubleMetaphoneQuery for 
this term. Maybe an extensible query parser
should be the subject of another email?

So: let me know if this is useful and plz enter it into the sandbox...

thx,
 Dave Spencer










Mime
View raw message