lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4694) Add back IndexReader.fields() -> Multi*, or discourage term vectors in some better way
Date Sun, 10 Mar 2013 08:31:13 GMT


Robert Muir commented on LUCENE-4694:

I personally think it's ok if IndexReader lets you get docsValues(doc), document(doc), getTV(doc)
and termDocsEnum(term). There's nothing inefficient about supporting them, as far as I can

this is not correct at all. 

for the sorted types we need to iterate through all of the values and create a datastructure
mapping per-segment ordinals to global ones, and also cache this somewhere. 

additionally, all docvalues types and norms on a composite reader would pay the cost of binary-search
for *each* docid access: and due to the way they are used, typically many docids are accessed.

stored fields are used for summary results, so on a 100 million doc index who cares if you
do 10 or 20 binary searches: who cares.

term vectors are used for highlighting summary results, MoreLikeThis, etc: both of which are
small top-N just like the stored fields case. so its also fine.

but docvalues is used in scoring and sorting, so this would be 100 million binary searches.
its a big damn difference.

the postings is pretty much just an additional check per document, so its a little more up
in the air what to do. but as mentioned in the description, users look at
and the only postings api they see is term vectors.

> Add back IndexReader.fields() -> Multi*, or discourage term vectors in some better
> --------------------------------------------------------------------------------------
>                 Key: LUCENE-4694
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4694.patch
> Users can easily get term vectors from any indexreader, but not postings lists. this
encourages them to do really slow things: like pulling term vectors for every single document.
> this is really really so much worse than going through multifields or whatever. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message