lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
Date Tue, 17 May 2011 16:21:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034850#comment-13034850
] 

Michael McCandless commented on LUCENE-3112:
--------------------------------------------

{quote}
We should really think through the consequences of this though.

If core features of lucene become implemented in a way that they rely upon these sequential
docids, we then lock ourselves out of future optimizations such as reordering docids for optimal
index compression.
{quote}

I agree it's somewhat dangerous we are making an (experimental)
guarantee that these docIDs will remain adjacent "forever".  We
normally are very protective about letting apps rely on docID
assignment/order.

But, I think this will not be "core" functionality that relies on
sub-docs (adjacent docs), but rather modules -- grouping, faceting,
nestedqueries/queries.  And, even if you use these modules, it's
optional whether the app did sub-docs.  Ie we would still have the
'generic" grouping collector, but then also an optimized one that
takes advantage of sub-docs.

Finally, I think doing this today would not preclude doing docID
reording in the future because the sub docs would be recomputable
based on the "identifier" field which grouped them in the first
place.

Ie the worst case future scenario (an app uses this new sub-docs
feature, but then has a big index they don't want to reindex and wants
to take advantage of a future docid reording compression we add) would
still be solvable because we could use this identifier field to find
blocks of sub-docs.

I suppose we could consider changing the index format today to record
which docs are subs... but I think we don't need to.  Maybe I should
strengthen the @experimental to explain the risk that a future
reindexing could be required?


> Add IW.add/updateDocuments to support nested documents
> ------------------------------------------------------
>
>                 Key: LUCENE-3112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3112
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3112.patch
>
>
> I think nested documents (LUCENE-2454) is a very compelling addition
> to Lucene.  It's also a popular (many votes) issue.
> Beyond supporting nested document querying, which is already an
> incredible addition since it preserves the relational model on
> indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
> should also enable speedups in grouping implementation when you group
> by a nested field.
> For the same reason, it can also enable very fast post-group facet
> counting impl (LUCENE-3097) when you what to
> count(distinct(nestedField)), instead of unique documents, as your
> "identifier".  I expect many apps that use faceting need this ability
> (to count(distinct(nestedField)) not distinct(docID)).
> To support these use cases, I believe the only core change needed is
> the ability to atomically add or update multiple documents, which you
> cannot do today since in between add/updateDocument calls a flush (eg
> due to commit or getReader()) could occur.
> This new API (addDocuments(Iterable<Document>), updateDocuments(Term
> delTerm, Iterable<Document>) would also further guarantee that the
> documents are assigned sequential docIDs in the order the iterator
> provided them, and that the docIDs all reside in one segment.
> Segment merging never splits segments apart, so this invariant would
> hold even as merges/optimizes take place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message