lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4498) pulse docfreq=1 DOCS_ONLY for 4.1 codec
Date Mon, 22 Oct 2012 20:48:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481768#comment-13481768
] 

Michael McCandless commented on LUCENE-4498:
--------------------------------------------

Looks good:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct
diff
                 Respell       86.70      (3.0%)       84.04      (2.6%)   -3.1% (  -8% -
   2%)
               OrHighMed       41.52      (5.8%)       40.44      (6.1%)   -2.6% ( -13% -
   9%)
               OrHighLow       25.43      (6.0%)       24.77      (6.4%)   -2.6% ( -14% -
  10%)
              OrHighHigh        9.38      (5.9%)        9.15      (6.4%)   -2.5% ( -14% -
  10%)
                Wildcard       93.94      (4.1%)       92.36      (2.0%)   -1.7% (  -7% -
   4%)
                 MedTerm      211.10     (12.3%)      208.78     (13.4%)   -1.1% ( -23% -
  27%)
                  IntNRQ       10.74     (11.3%)       10.62      (7.8%)   -1.1% ( -18% -
  20%)
                HighTerm       25.59     (14.0%)       25.35     (15.0%)   -1.0% ( -26% -
  32%)
             MedSpanNear       13.77      (2.3%)       13.68      (1.6%)   -0.7% (  -4% -
   3%)
        HighSloppyPhrase        4.09      (5.4%)        4.07      (5.2%)   -0.5% ( -10% -
  10%)
            HighSpanNear        6.84      (2.9%)        6.81      (2.1%)   -0.4% (  -5% -
   4%)
                 Prefix3       17.81      (5.7%)       17.74      (1.5%)   -0.4% (  -7% -
   7%)
                  Fuzzy1       77.54      (2.5%)       77.25      (2.7%)   -0.4% (  -5% -
   4%)
              AndHighLow      719.17      (2.7%)      716.49      (2.3%)   -0.4% (  -5% -
   4%)
                  Fuzzy2       68.94      (2.4%)       68.69      (2.8%)   -0.4% (  -5% -
   5%)
             LowSpanNear       12.89      (1.8%)       12.85      (1.3%)   -0.3% (  -3% -
   2%)
         MedSloppyPhrase       29.92      (3.4%)       29.85      (3.4%)   -0.2% (  -6% -
   6%)
                 LowTerm      500.58      (5.9%)      500.52      (7.0%)   -0.0% ( -12% -
  13%)
         LowSloppyPhrase        9.57      (4.4%)        9.60      (4.3%)    0.4% (  -7% -
   9%)
               LowPhrase        9.64      (2.8%)        9.70      (3.0%)    0.7% (  -4% -
   6%)
              AndHighMed       86.68      (1.2%)       87.26      (1.2%)    0.7% (  -1% -
   3%)
               MedPhrase        7.07      (4.3%)        7.15      (4.6%)    1.1% (  -7% -
  10%)
              HighPhrase        4.79      (4.8%)        4.84      (5.6%)    1.1% (  -8% -
  12%)
             AndHighHigh       25.81      (1.7%)       26.20      (1.2%)    1.5% (  -1% -
   4%)
                PKLookup      193.31      (2.1%)      204.74      (1.6%)    5.9% (   2% -
   9%)
{noformat}

                
> pulse docfreq=1 DOCS_ONLY for 4.1 codec
> ---------------------------------------
>
>                 Key: LUCENE-4498
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4498
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Robert Muir
>         Attachments: LUCENE-4498_lazy.patch, LUCENE-4498.patch, LUCENE-4498.patch
>
>
> We have pulsing codec, but currently this has some downsides:
> * its very general, wrapping an arbitrary postingsformat and pulsing everything in the
postings for an arbitrary docfreq/totalTermFreq cutoff
> * reuse is hairy: because it specializes its enums based on these cutoffs, when walking
thru terms e.g. merging there is a lot of sophisticated stuff to avoid the worst cases where
we clone indexinputs for tons of terms.
> On the other hand the way the 4.1 codec encodes "primary key" fields is pretty silly,
we write the docStartFP vlong in the term dictionary metadata, which tells us where to seek
in the .doc to read our one lonely vint.
> I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just write the
lone doc delta where we would write docStartFP. 
> We can avoid the hairy reuse problem too, by just supporting this in refillDocs() in
BlockDocsEnum instead of specializing.
> This would remove the additional seek for "primary key" fields without really any of
the downsides of pulsing today.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message