From dev-return-80618-apmail-lucene-dev-archive=lucene.apache.org@lucene.apache.org Sat Oct 1 12:47:57 2011 Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF7B59196 for ; Sat, 1 Oct 2011 12:47:57 +0000 (UTC) Received: (qmail 78791 invoked by uid 500); 1 Oct 2011 12:47:56 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 78720 invoked by uid 500); 1 Oct 2011 12:47:56 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 78713 invoked by uid 99); 1 Oct 2011 12:47:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Oct 2011 12:47:56 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Oct 2011 12:47:55 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 8A58828AC58 for ; Sat, 1 Oct 2011 12:47:34 +0000 (UTC) Date: Sat, 1 Oct 2011 12:47:34 +0000 (UTC) From: "sebastian L. (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <26053823.14.1317473254568.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1920138710.45487.1316511189594.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118784#comment-13118784 ] sebastian L. commented on LUCENE-3440: -------------------------------------- Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT. Another patch, another idea! :) Some thoughts: - With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used. - Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case. - Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway. Possible Solution: 1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation - Introduction of TermInfo.fieldName - Introduction of WeightedFragInfo.phraseInfos - Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index 2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore() {code} /** * Compute WeightedFragInfo.score based on query-boosts * @throws IOException */ public List calculateScore( List weightedFragInfos, IndexReader reader ) throws IOException{ for( WeightedFragInfo wfi : weightedFragInfos ){ for( WeightedPhraseInfo wpi : wfi.phraseInfos ){ wfi.score += wpi.boost; } } return weightedFragInfos; } {code} 3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore() - In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder. - But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy. - Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea. - The _sum-of-distinct-weight_-approach is the same as presented in the last patch. {code} /** * Compute WeightedFragInfo.score based on IDF-weighted terms * @throws IOException */ @Override public List calculateScore( List weightedFragInfos, IndexReader reader ) throws IOException{ Map lookup = new HashMap(); HashSet distinctTerms = new HashSet(); int numDocs = reader.numDocs() - reader.numDeletedDocs(); int docFreq; int length; float boost; float weight; for( WeightedFragInfo wfi : weightedFragInfos ){ uniqueTerms.clear(); length = 0; boost = 0; for( WeightedPhraseInfo wpi : wfi.phraseInfos ){ for( TermInfo ti : wpi.termInfos ) { length++; if( !distinctTerms.add( ti.text ) ) continue; if ( lookup.containsKey( ti.text ) ) weight = lookup.get( ti.text ).floatValue(); else { docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) ); weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 ); lookup.put( ti.text, new Float( weight ) ); } boost += Math.pow( weight, 2 ) * wpi.boost; } } wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) ); } return weightedFragInfos; } {code} With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore(). I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList: {code:xml} {code} Further notes: - As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score". - As shown in this patch "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder". > FastVectorHighlighter: IDF-weighted terms for ordered fragments > ---------------------------------------------------------------- > > Key: LUCENE-3440 > URL: https://issues.apache.org/jira/browse/LUCENE-3440 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/highlighter > Affects Versions: 3.5, 4.0 > Reporter: sebastian L. > Priority: Minor > Labels: FastVectorHighlighter > Fix For: 3.5, 4.0 > > Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html > > > The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. > This patch provides ordered fragments with IDF-weighted terms: > total weight = total weight + IDF for unique term per fragment * boost of query; > The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. > The patch is simple, but it works for us. > Some ideas: > - A better approach would be moving the whole fragments-scoring into a separate class. > - Switch scoring via parameter > - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not > - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org