Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 88941 invoked from network); 6 May 2008 16:54:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 May 2008 16:54:20 -0000 Received: (qmail 33160 invoked by uid 500); 6 May 2008 16:54:14 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 33125 invoked by uid 500); 6 May 2008 16:54:14 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 33114 invoked by uid 99); 6 May 2008 16:54:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 May 2008 09:54:14 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of willjohnsonsearch@gmail.com designates 66.249.82.230 as permitted sender) Received: from [66.249.82.230] (HELO wx-out-0506.google.com) (66.249.82.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 May 2008 16:53:28 +0000 Received: by wx-out-0506.google.com with SMTP id h29so571621wxd.20 for ; Tue, 06 May 2008 09:53:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:to:references:in-reply-to:subject:date:message-id:mime-version:content-type:content-transfer-encoding:x-mailer:thread-index:content-language:from; bh=Y3zN6oL6mJOzgirgHc0bKgbiC+cJkUkYI9kSRMw+Ht8=; b=SPFPKbeiC9tbUBgKFUhfcEaEg7p5w/aXoJrq2cY5qjc1VtWW0K3BjWhVnGvhPKO2qCb5J7fIMgDXUYpUU3o5wTqvniyiYlRWie7MyKEIgYuSMS5CyB8mhg3cpvwAe30Okd0F0tp5kzXA9qzgjevZ2LbrIT+Hnn1BYAqtK66itXs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=to:references:in-reply-to:subject:date:message-id:mime-version:content-type:content-transfer-encoding:x-mailer:thread-index:content-language:from; b=cfb9SpLiM6UJvNObc8no5KuuBqSTOVcUEI1yaef1W25fMRAYtcn/eGk0QwssW7WE8/tmwa53kc8pbmE5+ZRCVS1kMUJqFcOS68cSF/wtmMBHGlE3iDRNdxIoxd4aLOK0iUcjeTU9+/TDy1MAaLfMTZt7K6BpT8djLyQRixSKktE= Received: by 10.90.103.3 with SMTP id a3mr1421974agc.46.1210092821681; Tue, 06 May 2008 09:53:41 -0700 (PDT) Received: from will ( [209.104.232.98]) by mx.google.com with ESMTPS id 6sm998281agd.31.2008.05.06.09.53.40 (version=SSLv3 cipher=RC4-MD5); Tue, 06 May 2008 09:53:40 -0700 (PDT) To: References: <11e852b50805060928t1039a523o3a20c20681592c54@mail.gmail.com> In-Reply-To: <11e852b50805060928t1039a523o3a20c20681592c54@mail.gmail.com> Subject: RE: Postcode/zipcode search Date: Tue, 6 May 2008 12:53:38 -0400 Message-ID: <005701c8af99$bc7aa330$356fe990$@com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcivlkuyKTfMC1hdRJSmOZiZl+LzVQAAkD+w Content-Language: en-us From: Will Johnson X-Virus-Checked: Checked by ClamAV on apache.org You could split up the field into 2 separate fields: Postcode:NW10 7NY -> post1:NW10 post2:7NY Then rewrite user's queries using the same logic: ie if the enter 1 term 'NW10' it gets rewritten to post1:NW10, if they enter 2 terms post1:NW10 AND post2:7NY. It also lets you do fuzzy search ie post1:NW10 post2:7?Y and so on. - will -----Original Message----- From: Chris Mannion [mailto:chris.mannion@nonstopgov.com] Sent: Tuesday, May 06, 2008 12:28 PM To: java-user@lucene.apache.org Subject: Postcode/zipcode search Hi all I've got a bit of a niggling problem with how one of my searches is working as opposed to how my users would like it too work. We're indexing on UK postcodes, which are in the format of a 3 or 4 character area code followed by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ". We originally had the values being indexed as tokenized and used a very simple search string in the format "postcode:xxx xxx", with no grouping or boosting or fuzzy searching, just an straight search on whatever the user answered. This had the benefit of finding exact matches to searches and allowing us to search just on the area part of the code to return all records with that area code, eg a search on "NW2" returning anything starting NW2, like "NW2 6TB", "NW2 1ER" etc etc. However, the downside to that was that searches could also return records only tenuously related to what was searched for, eg. a search for "NW10 7NY" would also return a record with a postcode "SE9 6NY" because of the slight match of the "NY". Obviously this was technically correct but users complained because their searches were returning records from completely different areas. Our first step to put this right was to take off the tokenization of the field, which we also weren't happy with so have continued to fiddle. The current status is as follows - we index the values by stripping out spaces and tokeniing them and use a keywordAnalyzer. In searching we also strip spaces from the search term entered and search with a keywordAnalyzer. Searches for full postcodes, e.g. "NW10 7NY" find all exact matches but also any full values that are partial matches (e.g. some records just have "NW10" as their postcode field and the "NW10 7NY" search pulls them back too), but searches for partial postcodes e.g. "NW10" still only finds exact matches, e.g. it only pulls back those record that have just "NW10" as their postcode, rather than anything *starting* with NW10 as we'd like it to do. Can anyone help me get this working in the way we need it too please? -- Chris Mannion iCasework and LocalAlert implementation team 0208 144 4416 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org