lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
Date Thu, 18 Mar 2010 18:38:27 GMT


Michael McCandless commented on LUCENE-2329:

bq. Actually, when I talked about the TermVectors I meant we should explore to store the termIDs
on disk, rather than the strings. It would help things like similarity search and facet counting.

Ahhhh that would be great!

bq. Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right?
It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist
class currently contains (freq, lastOffset, lastPosition)? And the index into that array would
be the termID of course.

Hmm the challenge is that the tracking done for term vectors is just within a single doc.
 Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller),
vs the primary hash that holds terms for all docs in the current RAM buffer.  So we'd be burning
up much more RAM if we also key into the term vector's parallel arrays using the primary term

But I do think we should cutover to parallel arrays for TVTW, too.

bq. How does the read performance of packed ints compare to "normal" int[] arrays? I think
nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice
a bit more RAM for higher search performance?

It's definitely slower to read/write to/from packed ints, and I agree, indexing and searching
speed trumps RAM efficiency.

bq. Oh man, will we need flexible indexing for the in-memory index too?

EG custom attrs appearing in the TokenStream?  Yes we will need to... but hopefully once we
get serialization working cleanly for the attrs this'll be easy?  With ByteSliceWriter/Reader
you just .writeBytes and .readBytes...

I don't think we should allow Codecs to be used in the RAM buffer anytime soon though... ;)

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>                 Key: LUCENE-2329
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in TermsHashPerField
we want to switch to parallel arrays.  The termsHash will simply be a int[] which maps each
term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in parallel
arrays, where the termID is the index into the arrays.  This will avoid the need for object
pooling, will remove the overhead of object initialization and garbage collection.  Especially
garbage collection should benefit significantly when the JVM runs out of memory, because in
such a situation the gc mark times can get very long if there is a big number of long-living
objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid the need
of having to store the term string per document in the TermVector.  Instead we could just
store the segment-wide termIDs.  This would reduce the size and also make it easier to implement
efficient algorithms that use TermVectors, because no term mapping across documents in a segment
would be necessary.  Though this improvement we can make with a separate jira issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message