Thanks for the comments…
My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet
Yes I was also thinking about this. But we are on 4.x and did not take the
plunge. But as you said, it should be a good idea to test on SFBS
I'm curious if you already performed any kind of benchmarking of this
> approach?
Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most
of our data as RAM resident and then CPU hot-spots came up...
There were few take-aways from the test. I am listing down few of them..
It's kind of lengthy. Please read through...
a) Postings-List issue, as discussed above…
b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
Our segments are already sorted before participating in a merge. On mostly
linear merge, we ended up decoding the same chunk again and again. Simply
storing the last chunk resulted in good speed-ups for us...
c) Once above steps were corrected, the CPU hotspot shifted to
BlockDocsEnum. Here most of our postings-list < 128 docs. So
Lucene41Postings started using vInts… I did try ForUtil encoding even for
< 128 docs. It definitely went easy on CPU. But failed to measure resulting
file-size increase.
I realised not just SMP but any other merge must face the same issue and
left it at that..
--
Ravi
On Mon, Apr 20, 2015 at 12:56 PM, Adrien Grand <jpountz@gmail.com> wrote:
> I like these ideas, the int[] we are using today are wasteful. My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet (added in 5.0, given your code snippets I
> assume you are still on 4.x)?
>
> I'm curious if you already performed any kind of benchmarking of this
> approach?
>
>
> On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> > We were experimenting with SortingMergePolicy and came across an
> alternate
> > solution to TimSort of postings-list using FBS & GrowableWriter.
> >
> > I have attached relevant code-snippet. It would be nice if someone can
> > clarify whether it is a good idea to implement...
> >
> > public class SortingAtomicReader {
> > …
> > …
> > class SortingDocsEnum {
> >
> > //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
> > //constructor. It is assumed that these 2 variables are init during
> > //merge start & they are then re-used till merge completes...
> >
> >
> > public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
> > final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
> > olddocToFreq) throws IOException {
> >
> > ….
> >
> > …
> >
> > while (true) {
> >
> > //Instead of Tim-Sorting as in existing code
> >
> > doc = in.nextDoc();
> >
> > int newdoc = docMap.oldToNew(doc);
> >
> > newdoclist.set(newdoc);
> >
> > if(withFreqs) {
> >
> > olddocToFreq.set(doc, in.freq());
> >
> > }
> >
> > }
> >
> >
> > @Override
> >
> > public int nextDoc() throws IOException {
> >
> > if (++docIt >= upto) {
> >
> > return NO_MORE_DOCS;
> >
> > }
> >
> > currDoc = newdoclist.nextSetBit(++currDoc);
> >
> > if(currDoc == -1) {
> >
> > return NO_MORE_DOCS;
> >
> > }
> >
> > //clear the set-bit here before returning...
> >
> > newdoclist.clear(currDoc);
> >
> > return currDoc;
> >
> > }
> >
> >
> > @Override
> >
> > public int freq() throws IOException {
> >
> > if(withFreqs && docIt < upto) {
> >
> > return (int)olddocToFreq.getMutable()
> >
> > .get(docMap.newToOld(currDoc));
> >
> > }
> >
> > return 1;
> >
> > }
> >
> > }
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
|