lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Han Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
Date Fri, 02 Aug 2013 15:57:51 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: LUCENE-3069.patch

Uploaded patch.

It is optimized for wildcardquery, and I did a quick test on 1M wiki data:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct
diff
                PKLookup      314.63      (1.5%)      314.64      (1.2%)    0.0% (  -2% -
   2%)
                  Fuzzy1       91.32      (3.7%)       92.50      (1.6%)    1.3% (  -3% -
   6%)
                 Respell      104.54      (3.9%)      106.97      (1.6%)    2.3% (  -2% -
   8%)
                  Fuzzy2       38.22      (4.1%)       39.16      (1.2%)    2.5% (  -2% -
   8%)
                Wildcard      109.56      (3.1%)      273.42      (5.0%)  149.6% ( 137% -
 162%)
{noformat}

and TempFSTOrd vs. Lucene41, on 1M data:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct
diff
                 Respell      134.85      (3.7%)      106.30      (0.6%)  -21.2% ( -24% -
 -17%)
                  Fuzzy2       47.78      (4.1%)       39.03      (0.9%)  -18.3% ( -22% -
 -13%)
                  Fuzzy1      112.02      (3.0%)       91.95      (0.6%)  -17.9% ( -20% -
 -14%)
                Wildcard      326.68      (3.5%)      273.41      (1.9%)  -16.3% ( -20% -
 -11%)
                PKLookup      194.61      (1.8%)      314.24      (0.7%)   61.5% (  57% -
  65%)
{noformat}

But I'm not happy with it :(, the hack I did here is to consume another big block to store
the last byte of each term. So for wildcard query ab*c, we have external information to tell
the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput
to jump to the next target term.

Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c',
so this optimization comes to be a big chunk.

However I don't really like this patch :(, we have to increase index size (521M => 530M),
and the code comes to be mess up, since we always have to foresee the next arc on current
stack. 
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch,
LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch,
LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a delta codec
file for scanning to terms. Some environments have enough memory available to keep the entire
FST based term dict in memory. We should add a TermDictionary implementation that encodes
all needed information for each term into the FST (custom fst.Output) and builds a FST from
the entire term not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message