lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Postcode/zipcode search
Date Tue, 06 May 2008 16:38:28 GMT
You might have a look at using a phrase query when you have more than  
one term in the query in addition to your term query, but giving the  
phrase query more weight (i.e. give an exact match more weight) and  
keep your original tokenization process.

Something like:
"NW10 7NY"^5 OR NW10 OR 7NY

or even downweighting the individual terms.  Thus, exact matches on  
the full phrase will weigh much higher, and you can still do  
individual term matching for the single term case (NW10)

-Grant

On May 6, 2008, at 12:28 PM, Chris Mannion wrote:

> Hi all
>
> I've got a bit of a niggling problem with how one of my searches is  
> working
> as opposed to how my users would like it too work.  We're indexing  
> on UK
> postcodes, which are in the format of a 3 or 4 character area code  
> followed
> by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11  
> 1LQ".
> We originally had the values being indexed as tokenized and used a  
> very
> simple search string in the format "postcode:xxx xxx", with no  
> grouping or
> boosting or fuzzy searching, just an straight search on whatever the  
> user
> answered.  This had the benefit of finding exact matches to searches  
> and
> allowing us to search just on the area part of the code to return all
> records with that area code, eg a search on "NW2" returning anything
> starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.
>
> However, the downside to that was that searches could also return  
> records
> only tenuously related to what was searched for, eg. a search for  
> "NW10 7NY"
> would also return a record with a postcode "SE9 6NY" because of the  
> slight
> match of the "NY".  Obviously this was technically correct but users
> complained because their searches were returning records from  
> completely
> different areas.  Our first step to put this right was to take off the
> tokenization of the field, which we also weren't happy with so have
> continued to fiddle.
>
> The current status is as follows - we index the values by stripping  
> out
> spaces and tokeniing them and use a keywordAnalyzer.  In searching  
> we also
> strip spaces from the search term entered and search with a
> keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find  
> all
> exact matches but also any full values that are partial matches  
> (e.g. some
> records just have "NW10" as their postcode field and the "NW10 7NY"  
> search
> pulls them back too), but searches for partial postcodes e.g. "NW10"  
> still
> only finds exact matches, e.g. it only pulls back those record that  
> have
> just "NW10" as their postcode, rather than anything *starting* with  
> NW10 as
> we'd like it to do.
>
> Can anyone help me get this working in the way we need it too please?
>
> -- 
> Chris Mannion
> iCasework and LocalAlert implementation team
> 0208 144 4416

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message