lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: [jira] Commented: (LUCENE-1410) PFOR implementation
Date Tue, 06 Oct 2009 22:17:24 GMT
On Tuesday 06 October 2009 23:59:12 eks dev wrote:
> Paul,
> the point I was trying to make with this example was extreme,  but realistic. Imagine
100Mio docs, sorted on field user_rights,  a term user_rights:XX selects 40Mio of them (user
rights...). To encode this, you need format with  two integers (for more of such intervals
you would need slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor...
 ). Strictly speaking this term is dense, but highly compressible and could be inlined with
pulsing trick...

Well, I've been considering to add compressed consecutive ranges to SortedVIntList, but I
did not
get further than considering. This sounds like the perfect use case for that.

Regards,
Paul Elschot


> 
> cheers, eks  
> 
> 
> 
> 
> >
> >From: Paul Elschot <paul.elschot@xs4all.nl>
> >To: java-dev@lucene.apache.org
> >Sent: Tuesday, 6 October, 2009 23:33:03
> >Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
> >
> >Eks,
> >
> >
> >> 
> >>>     [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742
] 
> >>> 
> >>> Eks Dev commented on LUCENE-1410:
> >>> ---------------------------------
> >>> 
> >>> Mike, 
> >>> That is definitely the way to go, distribution dependent encoding, where
every Term gets individual treatment.
> >>> 
> >>> Take for an example simple, but not all that rare case where Index gets
sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection
on user_rights/zip/city, all indexed). There you get perfectly "compressible"  postings by
simply managing intervals of set bits. Updates distort this picture, but we rebuild index
periodically and all gets good again.  At the moment we load them into RAM as Filters in IntervalSets.
if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such
super dense fields was killing us, even in RAMDirectory) ... 
> >
> >
> >You could try switching the Filter to OpenBitSet when that takes fewer bytes than
SortedVIntList.
> >
> >
> >Regards,
> >>Paul Elschot
> >
> >
> >
> 
> 
>       


Mime
View raw message