lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben van Klinken" <>
Subject VInt's as prefix. Was: bytecount as prefix
Date Thu, 11 May 2006 10:24:10 GMT

I'm the author of CLucene (a c++ port of lucene). I've been following
the 'using byte count as prefix' discussion and I think this
discussion sort of ties into something we are trying to achieve.

We are trying to optimise the way the index writing works, and we also
want to be able to index & store fields which are using a Reader

The second part is in theory a very easy solution, we can use a
streamfilter to buffer the reads that the analyser makes, and
integrate the FieldsWriter into the invertDocument function so that
the buffers are written while the analysers are run. Since there is no
way of knowing the length of the reader, we would then have to go back
and write the field length. Here is where the problem is, though: this
is not possible currently because we are using a VInt for the field
data length.

If we can use non variable length integers for the field data length
it makes it much easier for two things:

1) memory optimisations like the compressed field can benefit from
this: we don't have to store the entire compressed output in memory,
but can rather write it directly to the fields output stream.
2) it makes it possible to store AND index a field using a reader in a
single pass, thus removing the need to read twice (which might not
always be possible for some reader implementations).

The second feature is very important for us!

So I would like to propose a discussion on how this could be achieved:

My idea is to set a bit in the config like FIELD_DONT_USE_VINT. I dont
think using a static Int for every field is necessary, this few extra
(unnecessary) bytes for each field would add up to a lot. A static Int
is only used when completely necessary, and the implementation could
decide when to use this.

These are the rough changes that i think would need to be made:

final Document doc(int n) throws IOException {
	byte bits = fieldsStream.readByte();
	boolean dontUseVint = (bits & FieldsWriter.FIELD_DONT_USE_VINT) != 0;
	<<Binary fields like compressed or binary is an easy change...>>
	if ((bits & FieldsWriter.FIELD_IS_BINARY) != 0) {
		final byte[] b = new byte[dontUseVint?
			fieldsStream.readVInt()]; << CHANGE HERE
	if (compressed) {
		final byte[] b = new byte[dontUseVint?
			fieldsStream.readVInt()]; << CHANGE HERE
	<<Reading a field value as a string>>
	string value;
	if ( dontUseVint ){
		<< I'm not completely sure about this section,
			since changes relating to 'bytecount as prefix' would affect this >>
		int length = readInt();
	    char[] chars = new char[length];
	    readChars(chars, 0, length);
	    value = new String(chars, 0, length);
		value = fieldsStream.readString()
	Field f = new Field(,     // name
	    value, // read value  << CHANGE HERE - use different string length

Now is probably the best time to implement something like this before
lucene 2.0 is released. I think it wouldn't be a complicated change;
for now, we don't need to make any changes to the FieldWriter
(optimisations using this can be done later).


On 5/7/06, Marvin Humphrey <> wrote:
> Got it.
> This was the problem, in TermInfosWriter.writeTerm():
> -    lastTerm = term;
> +    lastBytes = bytes;
>    }
> Without lastTerm being updated, the auxiliary term dictionary got
> screwed up.  This problem only manifested on large tests because small
> tests never moved past the first entry, which is always a field number
> of -1 and an empty string.
> I'll post a full working patch to JIRA as soon as I'm at a location
> where I can connect my laptop to the net.
> Marvin Humphrey
> Rectangular Research
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message