Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 12774 invoked from network); 6 May 2009 16:04:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 May 2009 16:04:34 -0000 Received: (qmail 38669 invoked by uid 500); 6 May 2009 16:04:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 38528 invoked by uid 500); 6 May 2009 16:04:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 38449 invoked by uid 99); 6 May 2009 16:04:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 16:04:31 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of karl.wettin@gmail.com designates 209.85.220.158 as permitted sender) Received: from [209.85.220.158] (HELO mail-fx0-f158.google.com) (209.85.220.158) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 16:04:21 +0000 Received: by fxm2 with SMTP id 2so267364fxm.5 for ; Wed, 06 May 2009 09:04:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=aGj1HGCy3EvoMJSnmcg9YkA4AZtxdf0II8mkJuXf7H8=; b=kuItngkhX+A08eLOYWxbUQ8nl3EKF+qgoQIznJqZtIjQtUInTtmxQa3J5Cv3VzF8XK ksRqe3QqGKlAtorKl4VWyz9UZ2zZ3ouVeOVukyS+/yxyJzO9YJEeb/P3kIb5uTlFa0P9 q+JxDKDRUIDHyqKxlcv8LKsKoENb+I/f05/Ck= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=mDnNEb2uxb0ixUgLYvb6It0iQ2Mt77i7NZKpgxPYWHHRIPAUikBzplZxjQVzLR+ZuT 6klMQUubq/5ZgmyF+c/aLKNxRylJ39sJBA1+opelby08s+TECXlwDlOi5sneYj628c/E hY7/h/BTrINpvoPwk634nXg59r/1WQDiyHES4= Received: by 10.103.248.17 with SMTP id a17mr981075mus.83.1241625841645; Wed, 06 May 2009 09:04:01 -0700 (PDT) Received: from ?192.168.1.201? (c-c98770d5.029-18-6d6c6d2.cust.bredbandsbolaget.se [213.112.135.201]) by mx.google.com with ESMTPS id b9sm260443mug.54.2009.05.06.09.04.00 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 06 May 2009 09:04:00 -0700 (PDT) Message-Id: From: Karl Wettin To: java-user@lucene.apache.org In-Reply-To: <1FD944C6-DDB2-4595-8CCF-D5C4CAA15378@cs.vu.nl> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Exact match on entire field Date: Wed, 6 May 2009 18:03:59 +0200 References: <1FD944C6-DDB2-4595-8CCF-D5C4CAA15378@cs.vu.nl> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org You should probably tell us the reason to why you need this functionallity. Given you only load the stored comparative field for the first it doesn't really have to be that expensive. If you know that the first hit was not a perfect match then you know that any matching documents with a lower score isn't a perfect match. Stemming et c could however mess things up for you. There is nothing in Lucene that tells your if the query yielded a perfect match or not, only how much greater precition one hit has compared to another. Depending on your needs and your corpus it's possible to use this information to solve the problem. You could try to find a delta score threadshold that tells you where perfect matches begin and end in the results. With some luck the length normalization built in to Lucene is enough to find this. If not you can look at more expensive solutions that increase the score of perfect matches by adding BOL and EOL token markers in your index and (0-slop) query: index: "^", "bloemendaal", "$" "^", "adele", "bloemendaal", "$" query: ("bloemendaal") OR ("^", "bloemendaal") OR ("bloemendaal", "$") OR ("^", "bloemendaal", "$") You could use either span queries or shingles and you'll probably have to fiddle around with boosts on the clauses. Be aware, it's rather expensive to search for tokens that exists in all documents, so it's probably a lot speedier to use shingles and skip single BOL/EOL tokens in the index as required by span queries. But shingles will make your index explode in size. And lots of BOL/EOL tokens can mess with the idf(t). There has been a bit of talk about adding functionallity to retrieve what queryies matched a specific document. If this was in place you could simple check if the ("^", "bloemendaal", "$") clause matched and you'll know it was a perfect match. At current rate such a patch might be available in a few months from now. You are of course more than welcome to implement and contribute such a patch if you have the time. I hope this helped, karl 6 maj 2009 kl. 10.50 skrev Laura Hollink: > Hi, > > I am trying to distinguish between a document that matches the query > because the query *appears* in one of the fields, and a document > that matches the query because the query equals the complete field. > I do want to use an Analyzer for case- and punctuation > normalization. For example: > > The query "bloemendaal" matches the complete field "Bloemendaal" in > a document in my result list. > The query "adele" only partly matches the field "Adele Bloemendaal" > in another document. > > What is the best way to do this? > > I currently solve it by first searching in a normal way, and than > using the QueryParser on both the query and the relevant field in > the documents in my result list. Finally, I simply compare the > parsed query and the parsed field. > > QueryParser parser = new QueryParser(field,new StandardAnalyzer()); > Query query = parser.parse(q); > Hits hits = is.search(query); > ... > Document doc = hits.doc(i); > Query myfield = parser.parse(doc.get("skos:prefLabel")); > if(myfield.equals(query)) System.out.println("Query exactly matches > the entire field."); > else System.out.println("The field contains the query."); > > Is there a better way? > > Thanks, > Laura > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org