Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 98993 invoked from network); 19 Dec 2010 21:35:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Dec 2010 21:35:42 -0000 Received: (qmail 55018 invoked by uid 500); 19 Dec 2010 21:35:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 54416 invoked by uid 500); 19 Dec 2010 21:35:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 53913 invoked by uid 99); 19 Dec 2010 21:35:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Dec 2010 21:35:36 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aserba@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Dec 2010 21:35:30 +0000 Received: by qwh6 with SMTP id 6so2354478qwh.35 for ; Sun, 19 Dec 2010 13:35:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=d7P3RS0PAFivyGlgAU1g/hfNaL/YOFt5VgWjglokn1k=; b=uhktwl7aqfVOHFJUyipJvSg9YySdUy6euZX9zEGBy6qx/dvqtsieuoXYoew0i1Ij2l dQBHYmu8xmKoOx43keIHDrqh37EQjL7wwAt2mEBTQanNsNNCp+w1dz1EKl5Sz++fmY3y fb05ePMN+Eq9QR1UY5aaXBzscgxnTF3CNhI7o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=l4+8ltxOPKX7cQEisFQcasFzZWlzK+Nh4Ph/xQ6kzdjhSYk8BxrFMAHY+9fjMMy3A7 5ZdF4Dc9NpjDTl4G5bwVDZQZYpbk/cxkAxd9M7jvzcnmhq95j2Qrj7eUxsMea/HVW76O WMWRxPdiF6tIEDAxflx3a8S7gOSoJV1iRR+qo= MIME-Version: 1.0 Received: by 10.229.229.18 with SMTP id jg18mr1955169qcb.276.1292794509160; Sun, 19 Dec 2010 13:35:09 -0800 (PST) Received: by 10.229.248.202 with HTTP; Sun, 19 Dec 2010 13:35:09 -0800 (PST) In-Reply-To: References: Date: Mon, 20 Dec 2010 00:35:09 +0300 Message-ID: Subject: Re: Custom scoring for searhing geographic objects From: Alexey Serba To: java-user@lucene.apache.org Cc: solr-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Pavel, I had the similar problem several years ago - I had to find geographical locations in textual descriptions, geocode these objects to lat/long during indexing process and allow users to filter/sort search results to specific geographical areas. The important issue was that there were several types of geographical objects - street < town < region < country. The idea was to geocode to most narrow geographical area as possible. Relevance logic in this case could be specified as "find the most narrow result that is unique identified by your text or search query". So I came up with custom algorithm that was quite good in terms of performance and precision/recall. Here's the simple description: * You can intersect all text/searchquery terms with locations dictionary to find only geo terms * Search in your locations Lucene index and filter only street objects (the most narrow areas). Due to tf*idf formula you'll get the most relevant results. Then you need to post process N (3/5/10) results and verify that they are matches indeed. I did intersect search terms with result's terms and make another lucene search to verify if these terms are unique identifying the match. If it's then return matching street. If there's no any match proceed using the same algorithm with towns, regions, countries. HTH, Alexey On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov wrote= : > Hi, > Please give me advise how to create custom scoring. I need to result that > documents were in order, depending on how popular each term in the docume= nt > (popular =3D how many times it appears in the index) and length of the > document (less terms - higher in search results). > > For example, index contains following data: > > ID =A0 =A0| SEARCH_FIELD > ------------------------------ > 1 =A0 =A0 | Russia > 2 =A0 =A0 | Russia, Moscow > 3 =A0 =A0 | Russia, Volgograd > 4 =A0 =A0 | Russia, Ivanovo > 5 =A0 =A0 | Russia, Ivanovo, Altayskaya street 45 > 6 =A0 =A0 | Russia, Moscow, Kremlin > 7 =A0 =A0 | Russia, Moscow, Altayskaya street > 8 =A0 =A0 | Russia, Moscow, Altayskaya street 15 > 9 =A0 =A0 | Russia, Moscow, Altayskaya street 15/26 > > > And I should get next results: > > > Query =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | Document result set > ---------------------------------------------- > Russia =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| 1,2,4,3,6,7,8,9,5 > Moscow =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| 2,6,7,8,9 > Ivanovo =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| 4,5 > Altayskaya =A0 =A0 =A0 =A0 =A0 =A0 =A0| 7,8,9,5 > > In fact --- it is a search for geographic objects (cities, streets, house= s). > At the same time can be given only part of the address, and the results > should appear the most relevant results. > > Thanks. > -- > Pavel Minchenkov > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org