lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4272) another idea for updatable fields
Date Mon, 30 Jul 2012 20:15:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425182#comment-13425182
] 

Robert Muir commented on LUCENE-4272:
-------------------------------------

Well I think there are a few other advantages: 

complexity, e.g. not having to stack segments keeps the number of "dimensions" the same. 
The general structure of the index would be unchanged as well.

to IndexSearcher/Similarity/etc everything would appear just as if someone had deleted and
re-added
completely like today: this means we dont have to change our search APIs to have maxDoc(field)
or anything
else: scoring works just fine.

it seems possible we could support tryXXX incremental updates by docid via just like LUCENE-4203
too, though
thats just an optimization.

as far as tiny fields on otherwise massive docs, i think we can break this down into 3 layers:
# document 'build' <-- retrieving from your SQL database / sending over the wire / etc
# field 'analyze' <-- actually doing the text analysis etc on the doc
# field 'indexing' <-- consuming the already-analyzed pieces thru the indexer chain/codec
flush/etc

Today people 'pay' for 1, 2, and 3. If they use the solr/es approach they only pay 2 and 3
I think?
With this approach its just 3. I think for the vast majority of apps it will be fast enough,
as I
am totally convinced 1 and 2 are the biggest burden on people. I think these are totally possible
to fix without hurting search performance. I cant imagine many real world apps where its 3,
not
1 and 2, that are their bottleneck AND they are willing to trade off significant search performance
for that.

                
> another idea for updatable fields
> ---------------------------------
>
>                 Key: LUCENE-4272
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4272
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Robert Muir
>
> I've been reviewing the ideas for updatable fields and have an alternative
> proposal that I think would address my biggest concern:
> * not slowing down searching
> When I look at what Solr and Elasticsearch do here, by basically reindexing from stored
fields, I think they solve a lot of the problem: users don't have to "rebuild" their document
from scratch just to update one tiny piece.
> But I think we can do this more efficiently: by avoiding reindexing of the unaffected
fields.
> The basic idea is that we would require term vectors for this approach (as the already
store a serialized indexed version of the doc), and so we could just take the other pieces
from the existing vectors for the doc.
> I think we would have to extend vectors to also store the norm (so we dont recompute
that), and payloads, but it seems feasible at a glance.
> I dont think we should discard the idea because vectors are slow/big today, this seems
like something we could fix.
> Personally I like the idea of not slowing down search performance to solve the problem,
I think we should really start from that angle and work towards making the indexing side more
efficient, not vice-versa.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message