lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction
Date Fri, 07 Aug 2009 10:22:14 GMT


Michael McCandless commented on LUCENE-1789:

It is nice that DocValues gives us the freedom to do this, but.... I'm
not sure we should, because it's a sizable performance trap.

Ie, we'll be silently inserting a call to ReaderUtil.subSearcher on
every doc value lookup (vs previously when it was a single top-level
array lookup).

While client code that has relied on this in the past will nicely
continue to function properly, if we make this change, its performance
is going to silently take a [possibly sizable] hit.

In general, with Lucene, we can do the per-segment switching "up high"
(which is what the core now does, exclusively), or we can do it "down
low" (creating MultiTermDocs, MultiTermEnum, MultiTermPositions,
MultiDocValues, etc.), which has sizable performance costs.  It's also
costly for us because we'll have N different places where we must
create & maintain a MultiXXX class.  I would love to someday deprecate
all of the "down low" switching classes :)

In the core I think we should always switch "up high".  We've already
done this w/ searching and collection/sorting.  In LUCENE-1771 we're
fixing IndexSearcher.explain to do so as well.

With external code, I'd like over time to strongly encourage only
switching "up high" as well.

Maybe it'd be best if we could somehow allow this "down low" switching
for 2.9, but 1) warn that you'll see a performance hit right off, 2)
deprecate it, and 3) and somehow state that in 3.0 you'll have to send
only a SegmentReader to this API, instead.

EG, imagine an app that created an external custom HitCollector that
calls say FloatFieldSource on the top reader in order to use of a
float value per doc in each collect() call.  On upgrading to 2.9, this
app will already have to make the switch to the Collector API, which'd
be a great time for them to also then switch to pulling these float
values per-segment.  But, if we make the proposed change here, the app
could in fact just keep working off the top-level values (eg if the
ctor in their class is pulling these values), thinking everything is
fine when in fact there is a sizable, silent perf hit.  I'd prefer in
2.9 for them to also switch their DocValues lookup to be per segment.

[Aside: once we gain clarity on LUCENE-831, hopefully we can do away
{Byte,Short,Int,Ord,ReverseOrd}FieldSource, etc.  Ie these classes
basically copy what FieldCache does, but expose a per-doc method call
instead of a fixed array lookup.]

> getDocValues should provide a MultiReader DocValues abstraction
> ---------------------------------------------------------------
>                 Key: LUCENE-1789
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Hoss Man
>            Priority: Minor
>             Fix For: 2.9
> When scoring a ValueSourceQuery, the scoring code calls ValueSource.getValues(reader)
on *each* leaf level subreader -- so DocValue instances are backed by the individual FieldCache
entries of the subreaders -- but if Client code were to inadvertently  called getValues()
on a MultiReader (or DirectoryReader) they would wind up using the "outer" FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here that we don't
have with FieldCache API (which is required to provide direct array access). getValues(IndexReader)
could be implimented so that *IF* some a caller inadvertently passes in a reader with non-null
subReaders, getValues could generate a DocValues instance for each of the subReaders, and
then wrap them in a composite "MultiDocValues".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message