lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: [jira] Commented: (LUCENE-1410) PFOR implementation
Date Tue, 06 Oct 2009 22:13:45 GMT
if you would drive this example further in combination with flex-indexing permitting per term
postings format, I could imagine some nice tools for optimizeHard() , where normal index construction
works with defaults as planned for solid mix-performance case and at the end you run optimizeHard()
where postings get resorted on such fields (basically enabling rle encoding to work) and at
the same time all other terms get optimal encoding format for postings... perfect for read
only indexes where you want to max performance and reduce ix size


>
>From: eks dev <eksdev@yahoo.co.uk>
>To: java-dev@lucene.apache.org
>Sent: Tuesday, 6 October, 2009 23:59:12
>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>
>
>Paul,
>the point I was trying to make with this example was extreme,  but realistic. Imagine
100Mio docs, sorted on field user_rights,  a term user_rights:XX selects 40Mio of them (user
rights...). To encode this, you need format with  two integers (for more of such intervals
you would need slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor...
 ). Strictly speaking this term is dense, but highly compressible and could be inlined with
pulsing trick...
>
>cheers, eks  
>
>
>
>
>>
>>From: Paul Elschot <paul.elschot@xs4all.nl>
>>To: java-dev@lucene.apache.org
>>Sent: Tuesday, 6 October, 2009 23:33:03
>>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>>
>>Eks,
>>
>>
>>> 
>>>>>     [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742
] 
>>>>> 
>>>>> Eks Dev commented on LUCENE-1410:
>>>>> ---------------------------------
>>>>> 
>>>>> Mike, 
>>>>> That is definitely the way to go, distribution dependent encoding, where
every Term gets individual treatment.
>>>>> 
>>>>> Take for an example simple, but not all that rare case where Index gets
sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection
on user_rights/zip/city, all indexed). There you get perfectly "compressible"  postings by
simply managing intervals of set bits. Updates distort this picture, but we rebuild index
periodically and all gets good again.  At the moment we load them into RAM as Filters in IntervalSets.
if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such
super dense fields was killing us, even in RAMDirectory) ... 
>>
>>
>>You could try switching the Filter to OpenBitSet when that takes fewer bytes than
SortedVIntList.
>>
>>
>>Regards,
>>>>Paul Elschot
>>
>>
>>
>


      
Mime
View raw message