lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7979) Move disjunctions to a radix heap
Date Tue, 26 Sep 2017 17:14:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-7979:
---------------------------------
    Attachment: LUCENE-7979.patch

Here is a patch. It does not pass all tests as eg. the new priority queue does not work exactly
as MinShouldMatchSumScorer expects but it should be enough for benchmarking.

I tried wikimedium10m on the following tasks file, bulk scoring is disabled:

{noformat}
OrHighHigh: several following # freq=436129 freq=416515
OrHighHigh: publisher end # freq=1289029 freq=526636
OrHighHigh: 2009 film # freq=887702 freq=432758
OrHighHigh: http known # freq=3493581 freq=607158
OrHighHigh: south county # freq=560468 freq=521126
OrHighMed: international chris # freq=418261 freq=85523
OrHighMed: right million # freq=630423 freq=175554
OrHighMed: known created # freq=607158 freq=220831
OrHighMed: its universal # freq=1173450 freq=47078
OrHighMed: 9 network # freq=574434 freq=164997
OrHighLow: 2005 valois # freq=835460 freq=2277
OrHighLow: until universalist # freq=425389 freq=1230
OrHighLow: made forays # freq=742313 freq=799
OrHighLow: do bush's # freq=511178 freq=2681
OrHighLow: 10 racedetail.html # freq=918339 freq=870
Or5High5Med5Low: several publisher 2009 http south chris million created universal network
valois universalist forays bush's racedetail.html
Or5High5Med5Low: id title s called 2 reform face draft summary 1923 weed violently cantrell
10.1371 veneration
Or128Med: second june several october december july high because 20 general government m books
him language february end august list issue same often area november 15 county international
2000 2004 times u.s although based small british group like each series film 18 place now
against death her until pp 25 j great west major ii 13 london 14 long e 16 30 us 2003 center
large day citation references could x d example population b even another style found do 2012
n 2002 what form those 2001 br public four 17 22 much following east 24 very needed article
modern 19 country around f french v according old king within include still did jpg set music
doi 21 age power family external using links order own house home german
Or128Med: common r different non among 23 due science class reflist 28 27 political 26 ndash
line way military law william kingdom 1999 development she company back central 29 en began
period story without england president link original zh category roman short europe party
white further image david though given h along human top society ja france school james 01
make 1998 best pdf late point robert man named service research information term local european
led w western members present union convert la published important 1997 various popular l
off former america text official control water considered uk black third river near five become
army just usually established single how said result george down st others edition retrieved
02 land 1996 church support air full few 03 free less
{noformat}

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct
diff
               OrHighMed       27.75      (2.0%)       19.46      (0.6%)  -29.9% ( -31% -
 -27%)
               OrHighLow       45.52      (2.5%)       32.28      (0.4%)  -29.1% ( -31% -
 -26%)
              OrHighHigh       34.59      (1.5%)       25.91      (0.5%)  -25.1% ( -26% -
 -23%)
         Or5High5Med5Low        3.08      (1.6%)        2.84      (0.6%)   -7.9% (  -9% -
  -5%)
                Or128Med        0.24      (1.8%)        0.27      (0.5%)   12.8% (  10% -
  15%)
{noformat}

This matches my intuition that the radix heap performs better when there are many terms, but
the threshold looks quite high: even with 15 terms the regular binary heap still performs
better.

Maybe there are ways we could make it perform better for common numbers of terms in a disjunction?

> Move disjunctions to a radix heap
> ---------------------------------
>
>                 Key: LUCENE-7979
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7979
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-7979.patch
>
>
> An Elasticsearch user argued that we should look into using radix heaps in order to run
disjunctions so I wanted to give it a try. I'm creating this issue to share findings. Spoiler:
so far it does not seem to help but maybe I'm just doing it wrong?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message