lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: [jira] Commented: (LUCENE-1410) PFOR implementation
Date Tue, 06 Oct 2009 21:59:12 GMT
the point I was trying to make with this example was extreme,  but realistic. Imagine 100Mio
docs, sorted on field user_rights,  a term user_rights:XX selects 40Mio of them (user rights...).
To encode this, you need format with  two integers (for more of such intervals you would need
slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly
speaking this term is dense, but highly compressible and could be inlined with pulsing trick...

cheers, eks  

>From: Paul Elschot <>
>Sent: Tuesday, 6 October, 2009 23:33:03
>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>>>     [
>>> Eks Dev commented on LUCENE-1410:
>>> ---------------------------------
>>> Mike, 
>>> That is definitely the way to go, distribution dependent encoding, where every
Term gets individual treatment.
>>> Take for an example simple, but not all that rare case where Index gets sorted
on some of the indexed fields (we use it really extensively, e.g. presorted doc collection
on user_rights/zip/city, all indexed). There you get perfectly "compressible"  postings by
simply managing intervals of set bits. Updates distort this picture, but we rebuild index
periodically and all gets good again.  At the moment we load them into RAM as Filters in IntervalSets.
if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such
super dense fields was killing us, even in RAMDirectory) ... 
>You could try switching the Filter to OpenBitSet when that takes fewer bytes than SortedVIntList.
>>Paul Elschot

View raw message