lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations
Date Wed, 29 Sep 2010 09:43:34 GMT


Michael McCandless commented on LUCENE-2575:

Correct. The example of where everything could go wrong is the
rewriting of a byte slice forwarding address while a reader is
traversing the same slice.

Ahh right that's a real issue.

bq. It's not like 3.x's situation with FieldCache or terms dict index, for example....

What's the GC issue with FieldCache and terms dict?

In 3.x, the string index FieldCache and the terms index generate tons
of garbage, ie allocate zillions of tiny objects.  (This is fixed in

My only point was that having 32 KB arrays as garbage is much less GC
load than having the same net KB across zillions of tiny objects...

the term-freq parallel array, however if getReader is never
called, it's a single additional array that's essentially
innocuous, if useful.

Hmm the full copy of the tf parallal array is going to put a highish
cost on reopen?  So some some of transactional (incremental
copy-on-write) data structure is needed (eg PagedInts)...

We don't store tf now do we?  Adding 4 bytes per unique term isn't

OK, I think there's a solution to copying the actual byte[],
we'd need to alter the behavior of BBPs. It would require always
allocating 3 empty bytes at the end of a slice for the
forwarding address,

Good idea -- this'd make the byte[] truly write-once.

This would really decrease RAM efficiency low-doc-freq (eg 1) terms,
though, because today they make use of those 3 bytes.  We'd need to
increase the level 0 slice size...

The reason this would work is, past readers that are
iterating their term docs concurrently with the change to the
posting-upto array, will stop at the maxdoc anyways. This'll
be fun to implement.

Hmm... but the reader needs to read 'beyond' the end of a given slice,
still?  Ie say global maxDoc is 42, and a given posting just read doc
27 (which in fact is its last doc).  It would then try to read the
next doc?

Oh, except, the next byte would be a 0 (because we always clear the
byte[]), which [I think] is never a valid byte value in the postings
stream, except as a first byte, which we would not hit here (since we
know we always have at least a first byte).  So maybe we can get by
w/o fully copy of postingUpto?

> Concurrent byte and int block implementations
> ---------------------------------------------
>                 Key: LUCENE-2575
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message