lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy <yhl...@sohu.com>
Subject RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)
Date Thu, 13 Feb 2014 03:03:21 GMT
Hi Uwe, 

thanks a lot, I will try with that. 


Uwe Schindler wrote
> Hi andy,
> 
> unfortunately, that is not easy to show with one simple code. You have to
> change the Similarity used.
> 
> Before starting to do this, you should be sure, that this affects you
> users. The example you gave is showing very short documents. Lucene is
> optimized to handle larger documents, for short documents, the document
> statistics are not behaving in an ideal way - that’s the main issue here.
> Instead of trying to change the very basic Lucene statictics, you should
> first verify that this affects a large part of your user queries and
> documents, not just this example which looks like special case. Otherwise
> it is not an option.
> 
> Please read the documentation of Lucene how to change the similarity,
> specifically the length norm, while indexing/searching:
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/package-summary.html#changingScoring
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: 

> uwe@

> 
> 
>> -----Original Message-----
>> From: andy [mailto:

> yhlweb@

> ]
>> Sent: Wednesday, February 12, 2014 10:53 AM
>> To: 

> java-user@.apache

>> Subject: RE: Length of the filed does not affect the doc score accurately
>> for
>> chinese analyzer(SmartChineseAnalyzer)
>> 
>> Thanks Uwe,could you please give me a more detail example about how to
>> change the lucene behavior
>> 
>> 
>> Uwe Schindler wrote
>> > Hi Erick,
>> >
>> > a statement like " Adding &debug=all to the query will show you if
>> > this is the case" will not help a Lucene user, as it is only available
>> > in the Solr server. But Andy uses Lucene directly. In his case he
>> > should use IndexSearcher's explain functionalities to retrieve a
>> > structured output of how the documents are scored for this query for
>> debugging:
>> >
>> >
>> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/Inde
>> > xSearcher.html#explain(org.apache.lucene.search.Query,
>> > int)
>> >
>> > But yes, the length norm is encoded with loss of precsision in Lucene
>> > (it is a float values encoded to 1 byte only). With Lucene 4 there are
>> > ways to change that behavior, but that included changing the
>> > similarity implementation and use a different DocValues type for
>> encoding
>> the norms.
>> > In most cases this is not needed, because user won't notice.
>> >
>> > Uwe
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail:
>> 
>> > uwe@
>> 
>> >
>> >
>> >> -----Original Message-----
>> >> From: Erick Erickson [mailto:
>> 
>> > erickerickson@
>> 
>> > ]
>> >> Sent: Wednesday, January 15, 2014 1:30 PM
>> >> To: java-user
>> >> Subject: Re: Length of the filed does not affect the doc score
>> >> accurately for chinese analyzer(SmartChineseAnalyzer)
>> >>
>> >> the lengths of fields are encoded and lose some precision. So I
>> >> suspect the length of the field calculated for the two documents are
>> >> the same after encoding.
>> >>
>> >> Adding &debug=all to the query will show you if this is the case.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jan 15, 2014 at 3:39 AM, andy &lt;
>> 
>> > yhlweb@
>> 
>> > &gt; wrote:
>> >> > Hi guys,
>> >> >
>> >> > As the topic,it seems that the length of filed does not affect the
>> >> > doc score accurately for chinese analyzer in my source code
>> >> >
>> >> > index source code
>> >> >
>> >> >  private static Directory DIRECTORY;
>> >> >
>> >> >
>> >> >     @BeforeClass
>> >> >     public static void before() throws IOException {
>> >> >           DIRECTORY = new RAMDirectory();
>> >> >           Analyzer chineseanalyzer = new
>> >> > SmartChineseAnalyzer(Version.LUCENE_40);
>> >> >           IndexWriterConfig indexWriterConfig = new
>> >> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
>> >> >           FieldType nameType = new FieldType();
>> >> >           nameType.setIndexed(true);
>> >> >           nameType.setStored(true);
>> >> >           nameType.setOmitNorms(false);
>> >> >           try {
>> >> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
>> >> > indexWriterConfig);
>> >> >
>> >> >               List
>> > 
> <String>
>> >  nameList = new ArrayList
>> > 
> <String>
>> > ();
>> >> >
>> >> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
>> >> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
>> >> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司
>> ");
>> >> >               for (int i = 0; i < nameList.size(); i++) {
>> >> >                   Document document = new Document();
>> >> >                   document.add(new Field("name", nameList.get(i),
>> >> > nameType));
>> >> >                   document.add(new
>> >> > Field("id",String.valueOf(i+1),nameType));
>> >> >                   indexWriter.addDocument(document);
>> >> >             }
>> >> >               indexWriter.commit();
>> >> >           } catch (IOException e) {
>> >> >               // TODO Auto-generated catch block
>> >> >               e.printStackTrace();
>> >> >           }
>> >> >     }
>> >> >
>> >> > search snippet:
>> >> >  @Test
>> >> >     public void testChinese() throws IOException, ParseException {
>> >> >         String keyword = "咨询公司";
>> >> >         System.out.println("Searching for:" + keyword);
>> >> >         System.out.println();
>> >> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
>> >> >         IndexSearcher indexSearcher = new
>> IndexSearcher(indexReader);
>> >> >         Query query = null;
>> >> >         query = new QueryParser(Version.LUCENE_40,"name",new
>> >> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
>> >> >         TopDocs topDocs = indexSearcher.search(query,15);
>> >> >         System.out.println("Search Result:");
>> >> >         if (null !=topDocs && 0 < topDocs.totalHits) {
>> >> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>> >> >                 System.out.println("doc id:" +
>> >> > indexSearcher.doc(scoreDoc.doc).get("id"));
>> >> >                 String name =
>> >> indexSearcher.doc(scoreDoc.doc).get("name");
>> >> >                 System.out.println("content of Field:" + name);
>> >> >                 dumpCNTokens(name);
>> >> >                 System.out.println("score:" + scoreDoc.score);
>> >> >
>> >> > System.out.println("-------------------------------------------");
>> >> >             }
>> >> >         } else {
>> >> >             System.out.println("no results");
>> >> >         }
>> >> >
>> >> >     }
>> >> >
>> >> >
>> >> > And search result as follows:
>> >> > Searching for:咨询公司
>> >> >
>> >> > Search Result:
>> >> > doc id:1
>> >> > content of Field:咨询公司
>> >> > Terms:咨询        公司
>> >> > score:0.74763227
>> >> > -------------------------------------------
>> >> > doc id:2
>> >> > content of Field:飞鹰咨询管理咨询公司
>> >> > Terms:飞鹰        咨询      管理      咨询      公司
>> >> > score:0.6317303
>> >> > -------------------------------------------
>> >> > doc id:3
>> >> > content of Field:北京中标咨询公司
>> >> > Terms:北京        中标      咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:4
>> >> > content of Field:重庆咨询公司
>> >> > Terms:重庆        咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:5
>> >> > content of Field:商务咨询服务公司
>> >> > Terms:商务        咨询      服务      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> > doc id:6
>> >> > content of Field:法律咨询公司
>> >> > Terms:法律        咨询      公司
>> >> > score:0.5981058
>> >> > -------------------------------------------
>> >> >
>> >> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
>> >> > should have a higner score than the doc 3,5, becase the doc 4 and
>> >> > doc
>> >> > 6 have three terms ,doc 3,5 have four terms.
>> >> > Am I right? who can give me a explanation? And how to get the
>> >> > expected result?
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-aff
>> >> > ect
>> >> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-t
>> >> > p41 11390.html Sent from the Lucene - Java Users mailing list
>> >> > archive at Nabble.com.
>> >> >
>> >> > -------------------------------------------------------------------
>> >> > --
>> >> > To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> >> > For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> >> For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail:
>> 
>> > java-user-unsubscribe@.apache
>> 
>> > For additional commands, e-mail:
>> 
>> > java-user-help@.apache
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Length-
>> of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-
>> SmartChineseAnalyz-tp4111390p4116850.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

>> For additional commands, e-mail: 

> java-user-help@.apache

> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 

> java-user-unsubscribe@.apache

> For additional commands, e-mail: 

> java-user-help@.apache





--
View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4117051.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message