Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D22B06E9D for ; Tue, 19 Jul 2011 07:38:05 +0000 (UTC) Received: (qmail 83033 invoked by uid 500); 19 Jul 2011 07:37:59 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 82676 invoked by uid 500); 19 Jul 2011 07:37:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 82648 invoked by uid 99); 19 Jul 2011 07:37:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2011 07:37:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of trewig@mufin.com designates 195.214.216.123 as permitted sender) Received: from [195.214.216.123] (HELO mx3.de.magix.net) (195.214.216.123) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2011 07:37:41 +0000 Received: from [192.168.1.18] (port=64508) by mx3.de.magix.net with esmtpa (Exim 4.69) (envelope-from ) id 1Qj4ro-0002SQ-S1 for java-user@lucene.apache.org; Tue, 19 Jul 2011 09:37:20 +0200 Message-ID: <4E253428.8080402@mufin.com> Date: Tue, 19 Jul 2011 09:37:12 +0200 From: Thomas Rewig Organization: mufin User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; de; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: TermQuery - ExactMatching, Lucene 3.1.0 vs. 3.3.0, special character behavior References: <4E204844.6020502@mufin.com> <4E24056F.6090908@mufin.com> <002401cc4533$4e4d7710$eae86530$@thetaphi.de> In-Reply-To: <002401cc4533$4e4d7710$eae86530$@thetaphi.de> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-SA-Exim-Connect-IP: 192.168.1.18 X-SA-Exim-Mail-From: trewig@mufin.com X-SA-Exim-Scanned: No (on mx3.de.magix.net); SAEximRunCond expanded to false X-Virus-Checked: Checked by ClamAV on apache.org Hi Uwe, the docIds are these from Lucene. (and all Indexes are built with its Lucene versions) So if I order e.g. with 'Sort.INDEXORDER' and I understand the sorting principle correctly, the following is correct because the aim溝脇しほみ Term is the first in index: 0 Score=12,2324 Doc.Id=8060 id=709579 name=aim溝脇しほみ 1 Score=12,2324 Doc.Id=227606 id=716893 name=aim To avoid these problems right from the start, I need to use a different analyser for indexing? (So that the docs 'aim溝脇しほみ' and 'aim' have different scores) Thanks Thomas Am 18.07.2011 12:13, schrieb Uwe Schindler: > Hi Thomas, > > Just one question: Are these docIds from Lucene or your own ones? And second, are the underlying indexes also built with the corresponding Lucene versions? > > The reason behind: Nothing in Lucene guarantees the order of docIds for same scores, they can be arbitrary. One change in Lucene 3.3 is for example the use of TieredMergePolicy, that reorders documents during indexing for more efficient merging. So when you indexed also with Lucene 3.3 and the displayed document IDs are your own application specific ones (not the internal Lucene ones), the different order of search results can simply be caused by the fact, that the indexer in 3.3 can suddenly reorder the documents during merging (TieredMergePolicy). There is nothing wrong with that. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de > > >> -----Original Message----- >> From: Thomas Rewig [mailto:trewig@mufin.com] >> Sent: Monday, July 18, 2011 12:06 PM >> To: java-user@lucene.apache.org >> Subject: Re: TermQuery - ExactMatching, Lucene 3.1.0 vs. 3.3.0, special >> character behavior >> >> Hi Ian, >> >> yes the score is identical but the inner ordering of same scores seems to be >> different in the versions. >> >> In Lucene 3.3.0 it seems that terms with special characters will be preferred >> before the exact hit. >> >> My code is: >> >> PhraseQuery query = new PhraseQuery(); >> query.add(new Term("name", strQueryName)); >> >> //topDocs = this.indexSeacher.search(query, 10); >> //topDocs = this.indexSeacher.search(query, 10, Sort.RELEVANCE); >> topDocs = this.indexSeacher.search(query, 10, Sort.INDEXORDER); >> >> In all variants there are similar ordering problems even if they do not always >> occur at the same query. >> e.g. if I order by Sort.RELEVANCE the "queen" Doc problem doesn't occur but >> there is a wrong ordering in the token aim (query name:aim) >> >> >> 0 Score=12,2324 Doc.Id=8060 id=709579 name=aim溝脇しほみ >> 1 Score=12,2324 Doc.Id=227606 id=716893 name=aim >> >> >> Is there a way to guarantee the inner sorting of same scores? Or how can I >> avoid that documente with special characters have the same score as >> documente of exact matches? >> >> Thanks in advance! >> Thomas >> >> >> >> >> Am 18.07.2011 10:08, schrieb Ian Lea: >>> I'm not sure what you are getting at. A search using 3.1.0 and 3.3.0 >>> returns the same docs with identical scores, except that one gives >>> them in order A,B and the other in order B,A? What search method are >>> you using? Does it guarantee anything about the order of returning >>> docs with identical scores? >>> >>> >>> -- >>> Ian. >>> >>> >>> On Fri, Jul 15, 2011 at 3:01 PM, Thomas Rewig wrote: >>>> Hello, >>>> >>>> there is a index with a lot of docs, 2 of them are: >>>> >>>> doc1: >>>> >>>> 1.Field=id ITSVopfOLB=ITS---f0-- Value= 192 >>>> 2.Field=name ITSVopfOLB=ITS----0-- Value= queen >>>> >>>> doc2: >>>> >>>> 1.Field=id ITSVopfOLB=ITS---f0-- Value= 701492 >>>> 2.Field=name ITSVopfOLB=ITS----0-- Value= queen板野友美 (Here >> are chinese >>>> characters - hopefully you can see them) >>>> >>>> if I search in the index - with a TermQuery there is a different >>>> behavior between Lucene 3.1.0 and 3.3.0 : >>>> >>>> Query: >>>> >>>> Term:field='name' text='queen' >>>> >>>> Result Lucene 3.1.0: >>>> >>>> 0 Score=13,2132 Doc.Id=176002 id=192 name=queen >>>> 1 Score=13,2132 Doc.Id=523407 id=701492 name=queen板野友美 >>>> >>>> Result Lucene 3.3.0: >>>> >>>> 0 Score=13,2132 Doc.Id=523407 id=701492 name=queen板野友美 >>>> 1 Score=13,2132 Doc.Id=176002 id=192 name=queen >>>> >>>> The result from Lucene 3.1.0 is that, what I would expect if I do a >>>> 'exact matching' Term Query. >>>> Each index was indexed with its associated LuceneVersion. >>>> I tested it with luke and with my own Code - the result was always the >> same. >>>> Is it a new feature in Lucene 3.3.0 or a bug? >>>> >>>> Thanks in advance! >>>> Thomas >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org