lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Proper use of TermsEnum.seek?
Date Mon, 21 Feb 2011 15:00:54 GMT
Hey Toke,

On Mon, Feb 21, 2011 at 3:27 PM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> My low-memory sorting/faceting-hacking requires terms to be accessed by
> ordinals. With Lucene 4.0 I cannot depend on TermsEnums supporting ord()
> and seek(long), so the code switches to a cache that keeps track of
> every X terms if they are not implemented. When the terms for an ordinal
> is requested, it jumps to the nearest previously cached term and calls
> next() from there until the ordinal matches. So far so good.
>
> Two methods for seeking terms are:
> seek(BytesRef text) and seek(BytesRef term, TermState state).
>
> The JavaDoc indicates that the seek with TermState is (potentially) the
> fastest in this scenario as implementations can seek very efficient
> using a custom TermState.

For all real codecs seek(BR, TermState) should be as fast as it gets.
There are some codecs which simply forward to seek(BR) so if you have
the TermState already you won't loose anything. This might also answer
your other question, if you pass an empty BytesRef to a codec that did
not override the seek(BR, TermState) method it will seek to the empty
term and your code might not work anymore. Historically, this was
needed for standard TermsReader to initialize the DeltaBytesReader
which is gone now. I am not sure if it is only needed for
initialization though, it seems so. I think we can now reinvestigate
if we still really need it since we now read terms in blocks.
>
> My problem is that I am going for low memory and it seems that I need to
> keep track of both BytesRef term and TermState state in order to use
> this method. This is quite a burden, memory-wise.

yeah I agree this is a problem without a solution right away.

>
> I tried calling with an empty BytesRef term. This gave me an empty
> result back for the call itself, but the correct terms for subsequent
> calls to next. This works perfectly for my scenario. However, that was
> just an experiment using the default variable gap codec, so I am unsure
> if I can count on this behavior for any given codec?

what do you mean by an empty result for the call itself?

>
> Any thoughts on how to reduce the memory needed for ordinal-based
> lookup, without killing performance, would be appreciated.

can't you us a codec that supports ord for your facet / sort fields?

simon
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message