lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2588) terms index should not store useless suffixes
Date Thu, 05 Aug 2010 01:35:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895506#action_12895506
] 

Robert Muir commented on LUCENE-2588:
-------------------------------------

i think this patch as-is is a good improvement (at least as a defensive measure against "noise"
terms and other things). it also seems to buy more savings on the non-latin data i tested
(60kb -> 40kb). +1 to commit

{quote}
In the future we could do crazier things. EG there's no real reason why the indexed terms
must be regular (every N terms), so, we could instead pick terms more carefully, say "approximately"
every N, but favor terms that have a smaller net prefix
{quote}

I think we should explore this in the future. "randomly" selecting every N terms isn't optimal
when allowing a "fudge" of the interval maybe +/- 5 or 10% could intentionally select terms
that differ very quickly from their previous term, without wasting a bunch of cpu or unbalancing
the terms index... 

if additional smarts like this could save enough size on average maybe we could rethink lowering
the default interval of 128?

> terms index should not store useless suffixes
> ---------------------------------------------
>
>                 Key: LUCENE-2588
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2588
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2588.patch
>
>
> This idea came up when discussing w/ Robert how to improve our terms index...
> The terms dict index today simply grabs whatever term was at a 0 mod 128 index (by default).
> But this is wasteful because you often don't need the suffix of the term at that point.
> EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, instead of
storing that full term you only need to store ab.  The suffix is useless, and uses up RAM
since we load the terms index into RAM.
> The patch is very simple.  The optimization is particularly easy because terms are now
byte[] and we sort in binary order.
> I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index (tii) file
from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer, indexing body field tokenized
but title / date fields untokenized).  I expect on noisier terms dicts, especially ones w/
bad terms accidentally indexed, that the savings will be even more.
> In the future we could do crazier things.  EG there's no real reason why the indexed
terms must be regular (every N terms), so, we could instead pick terms more carefully, say
"approximately" every N, but favor terms that have a smaller net prefix.  We can also index
more sparsely in regions where the net docFreq is lowish, since we can afford somewhat higher
seek+scan time to these terms since enuming their docs will be much faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message