lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
Date Fri, 05 Jan 2018 14:26:02 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4198:
---------------------------------
    Attachment: LUCENE-4198.patch

New patch. This time it has tests, does basic testing in CheckIndex and does not clone too
much.

Results are very good on queries that score on a single term, almost too good, I'm currently
thinking about how we could change the API to have something that is easier to propagate with
boolean queries, even if it means term queries can't be as fast.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct
diff
              AndHighLow     2050.37      (4.2%)     1745.54      (2.0%)  -14.9% ( -20% -
  -9%)
               OrHighLow      922.62      (3.7%)      844.54      (2.4%)   -8.5% ( -14% -
  -2%)
              AndHighMed      277.85      (1.8%)      258.11      (2.6%)   -7.1% ( -11% -
  -2%)
            OrNotHighLow     1105.41      (3.6%)     1044.69      (2.0%)   -5.5% ( -10% -
   0%)
             AndHighHigh      128.97      (1.1%)      121.89      (2.7%)   -5.5% (  -9% -
  -1%)
                  Fuzzy2      166.62      (6.2%)      158.38      (6.3%)   -4.9% ( -16% -
   8%)
               OrHighMed      177.56      (2.3%)      170.05      (1.9%)   -4.2% (  -8% -
   0%)
                  Fuzzy1      199.16      (4.4%)      193.05      (5.5%)   -3.1% ( -12% -
   7%)
         MedSloppyPhrase       53.92      (2.2%)       52.40      (2.3%)   -2.8% (  -7% -
   1%)
               LowPhrase      201.13      (1.7%)      195.87      (1.0%)   -2.6% (  -5% -
   0%)
             LowSpanNear      363.85      (3.0%)      355.07      (2.5%)   -2.4% (  -7% -
   3%)
              HighPhrase       62.68      (1.6%)       61.32      (1.2%)   -2.2% (  -4% -
   0%)
       HighTermMonthSort      218.42      (9.8%)      214.35      (8.3%)   -1.9% ( -18% -
  18%)
             MedSpanNear       46.65      (1.4%)       45.89      (1.5%)   -1.6% (  -4% -
   1%)
               MedPhrase      178.02      (1.5%)      175.24      (1.2%)   -1.6% (  -4% -
   1%)
            HighSpanNear       10.21      (3.4%)       10.11      (3.4%)   -1.0% (  -7% -
   6%)
        HighSloppyPhrase       32.32      (7.3%)       32.01      (7.1%)   -1.0% ( -14% -
  14%)
         LowSloppyPhrase       18.01      (2.7%)       17.85      (2.7%)   -0.9% (  -6% -
   4%)
                 Respell      320.99      (2.1%)      321.02      (2.4%)    0.0% (  -4% -
   4%)
                  IntNRQ       29.29     (11.6%)       29.42     (12.5%)    0.4% ( -21% -
  27%)
                Wildcard      189.97      (4.6%)      191.87      (3.9%)    1.0% (  -7% -
   9%)
                 Prefix3      166.43      (6.2%)      169.95      (5.4%)    2.1% (  -8% -
  14%)
              OrHighHigh       48.00      (3.7%)       49.09      (3.9%)    2.3% (  -5% -
  10%)
   HighTermDayOfYearSort      146.88      (7.4%)      150.76      (8.0%)    2.6% ( -11% -
  19%)
                 LowTerm      830.79      (2.6%)     2246.40      (9.9%)  170.4% ( 153% -
 187%)
            OrNotHighMed      180.11      (1.5%)     1454.55     (15.7%)  707.6% ( 680% -
 735%)
                 MedTerm      216.16      (1.7%)     3834.73     (37.0%) 1674.0% (1608% -
1742%)
                HighTerm      109.49      (2.0%)     1944.44     (45.3%) 1675.9% (1597% -
1757%)
            OrHighNotMed       57.55      (1.1%)     1292.66     (57.7%) 2146.2% (2064% -
2229%)
            OrHighNotLow       84.00      (1.1%)     1996.82     (75.4%) 2277.2% (2176% -
2379%)
           OrNotHighHigh       58.22      (1.3%)     1479.53     (53.5%) 2441.4% (2356% -
2528%)
           OrHighNotHigh       66.91      (1.2%)     2042.54     (55.1%) 2952.6% (2862% -
3045%)
{noformat}

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation
currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to
support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite'
your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed
the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another
problem is that it needs access to the term and collection statistics up front, rather than
after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch
with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message