lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4764) Faster but more RAM/Disk consuming DocValuesFormat for facets
Date Sat, 09 Feb 2013 15:59:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575189#comment-13575189
] 

Shai Erera commented on LUCENE-4764:
------------------------------------

bq. i wonder how it would perform if it wrote and kept in ram packed ints, since it knows
whats in the byte[]

We've tried that in the past. I don't remember on which issue we posted the results, but they
were not compelling. I.e. what we tried is to keep the ints as int[] vs packed-ints. int[]
performed (IIRC) 50% faster, while packed-int only ~6-10% faster. Also, their RAM footprint
was very close. The problem is that packed-ints is only good if you know something about the
numbers, i.e. their size, distribution etc. But with category ordinals, on this Wikipedia
index, there's nothing "special" about them. Really every document keeps close to arbitrary
integers between 1 - 2.2M ...

If the following math holds -- 25 ords per document (that's 100 bytes/doc) x 6.6M documents
-- that's going to be ~660MB (offsets not included). I suspect that packed-ints will consume
approximately the same size (at least, per past results) but won't yield significantly better
performance. Therefore if we want to cache anything at the int level, we should do an int[]
caching aggregator.

Mike, correct me if I'm wrong.
                
> Faster but more RAM/Disk consuming DocValuesFormat for facets
> -------------------------------------------------------------
>
>                 Key: LUCENE-4764
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4764
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.2, 5.0
>
>         Attachments: LUCENE-4764.patch
>
>
> The new default DV format for binary fields has much more
> RAM-efficient encoding of the address for each document ... but it's
> also a bit slower at decode time, which affects facets because we
> decode for every collected docID.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message