lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject double metaphone for misspellings
Date Thu, 18 Dec 2008 04:50:50 GMT
Apache commons codec library has double metaphone algorithm. I tried a
series of experiments around storing the double metaphone
representations of strings in the index itself, and searching using
doublemetaphone version of search terms when the field I am searching
against is stored as double metaphone. This works very well. For my test
rig, I added 4 variants of a  field to the document. The four variants
were: 
 
1) name-tokenized-doublemetaphone
2) name-tokenized
3)name-untokenized-doublemetaphone
4)name-untokenized
 
 
Here is the code where I wrote added the 4 variants to the index:
 
    private void addProductNamesToDoc(Document poiDocument, IdentityType
id) {
        DoubleMetaphone dm = new DoubleMetaphone();
        dm.setMaxCodeLen(100);
        for(Object name: id.getNames().getPOIName()){ //for each name in
list of names. Name can be "SCHAAD FAMILY ALMONDS" for example
 
if(log.isDebugEnabled())log.debug(((POINameType)name).getText());
            if(null != ((POINameType)name).getText()){
                String[] splits =
((POINameType)name).getText().split("\\s"); //tokenize manually. (gosh,
I thought the analyser would do this)
                //add tokenized double metaphone and plain tokenized
variants of name
                for(String component:splits){
                    poiDocument.add(new
Field("name-tokenized-doublemetaphone",dm.doubleMetaphone(component),
Field.Store.YES, Field.Index.ANALYZED));
                    poiDocument.add(new
Field("name-tokenized",component, Field.Store.YES,
Field.Index.ANALYZED));                    
                }
                //add untokenized double metaphone and untokenized plain
                poiDocument.add(new
Field("name-untokenized-doublemetaphone",dm.doubleMetaphone(((POINameTyp
e)name).getText()), Field.Store.YES, Field.Index.ANALYZED));
                poiDocument.add(new
Field("name-untokenized",((POINameType)name).getText(), Field.Store.YES,
Field.Index.ANALYZED));
            }
        }
    }
 
Results of testing misspelled terms with PhraseQuery show that only
name-tokenized-doublemetaphone can tolerate misspellings.So this seems
to be a nice and efficient way to allow inputs that are wildly
misspelled.
 
Can someone explain to me exactly what Field.Store.YES and
Field.Index.ANALYZED do? Should I tune these values?
 

Geoff Hendrey

Software Architect
deCarta
Four North Second Street, Suite 950
San Jose, CA  95113
office 408.625.3522
www.decarta.com <blocked::http://www.decarta.com> 

 

 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message