lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction
Date Fri, 07 Aug 2009 14:47:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740574#action_12740574
] 

Hoss Man commented on LUCENE-1789:
----------------------------------

{quote}
While client code that has relied on this in the past will nicely
continue to function properly, if we make this change, its performance
is going to silently take a [possibly sizable] hit.
{quote}

Correct: a change like this could cause 2.9 to introduce a _time_ based performance hit from
the added method call to resolve the sub(reader|docvalue) on each method call ... but if we
don't have a change like this, 2.9 could introduce a _memory_ based performance hit from the
other FieldCache changes as it client code accessing DocValues for the  top level reader will
create a duplication of the whole array.

Incidently: I'm willing to believe you that the time based perf hit would be high, but my
instinct is that it wouldn't be that bad: the DocValues API already introduces at least one
method call per doc lookup (two depending on datatype).  adding a second method call to delegate
to a sub-DocValues isntance doesn't seem that bad (especially since a new MultDocValues class
could get the subReader list and compute the docId offsets on init, and then reuse them on
each method call)

bq. In the core I think we should always switch "up high".

(In case there is any confusion: wasn't suggesting that we stop using "up high" switching
on DocValues in code included in the Lucene dist, i was suggesting that if someone uses DocValues
directly in their code (against a top level reader) then we help them out by giving them the
"down low" switching ... so "expected" usages wouldn't pay the added time based hit, just
"unexpected" usages (which would be saved from the memory hit))

{quote}
Maybe it'd be best if we could somehow allow this "down low" switching
for 2.9, but 1) warn that you'll see a performance hit right off, 2)
deprecate it, and 3) and somehow state that in 3.0 you'll have to send
only a SegmentReader to this API, instead.
{quote}

that would get into really sticky territory for people writting custom IndexReaders (or using
FilteredIndexReader)

bq. But, if we make the proposed change here, the app could in fact just keep working off
the top-level values (eg if the ctor in their class is pulling these values), thinking everything
is fine when in fact there is a sizable, silent perf hit.

I agree ... but unless i'm missing something about the code on the trunk, that situation already
exists: the developer might switch to using the Collector API, but nothing about the   current
trunk will prevent/warn him that this...

{code}
ValueSource vs = new ValueSource("aFieldIAlsoSortOn");
IndexReader r = getCurrentReaderThatCouldBeAMultiReader();
DocValues vals = vs.getDocValues(r);
{code}

...could have a sizable, silent, _memory_ perf hit in 2.9

(ValueSource.getValues has a javadoc indicating that caching will be done on the IndexReader
passed in, but your comment suggests that if 2.9 were released today (with hte current trunk)
people upgrading would have some obvious way of noticing that they need to pass a sub reader
to getValues)






> getDocValues should provide a MultiReader DocValues abstraction
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1789
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1789
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Hoss Man
>            Priority: Minor
>             Fix For: 2.9
>
>
> When scoring a ValueSourceQuery, the scoring code calls ValueSource.getValues(reader)
on *each* leaf level subreader -- so DocValue instances are backed by the individual FieldCache
entries of the subreaders -- but if Client code were to inadvertently  called getValues()
on a MultiReader (or DirectoryReader) they would wind up using the "outer" FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here that we don't
have with FieldCache API (which is required to provide direct array access). getValues(IndexReader)
could be implimented so that *IF* some a caller inadvertently passes in a reader with non-null
subReaders, getValues could generate a DocValues instance for each of the subReaders, and
then wrap them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message