lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <st...@designware.dk>
Subject How to migrate content of a collection to a new collection
Date Wed, 23 Jul 2014 06:45:34 GMT
Hi

We have numerous collections each with numerous shards spread across 
numerous machines. We just discovered that all documents have a field 
with a wrong value and besides that we would like to add a new field to 
all documents
* The field with the wrong value is a long, DocValued, Indexed and 
Stored. Some (about half) of the documents need to have a constant added 
to their current value
* The field we want to add will be and int, DocValued, Indexed and 
Stored. Needs to be added to all documents, but will have different 
values among the documents

How to achieve our goal in the easiest possible way?

We thought about spooling/streaming from the existing collection into a 
"twin"-collection, then delete the existing collection and finally 
rename the "twin"-collection to have the same name as the original 
collection. Basically indexing all documents again. If that is the 
easiest way, how do we query in a way so that we get all documents 
streamed. We cannot just do a *:* query that returns everything into 
memory and the index from there, because we have billions of documents 
(not enough memory). Please note that we are on 4.4, which does not 
contain the new CURSOR-feature. Please also note that speed is an 
important factor for us.

Guess this could also be achieved by doing 1-1 migration on shard-level 
instead of collection-level, keeping everything in the new collections 
on the same machine as where they lived in the old collections. That 
could probably complete faster than the 1-1 on collection-level 
approach. But this 1-1 on shard-level approach is not very good for us, 
because the long field we need to change is also part of the id 
(controlling the routing to a particular shard) and therefore actually 
we also need to change the id on all documents. So if we do the 1-1 on 
shard-level approach, we will end up having documents in shards that 
they actually do not be to (they would not have been routed there by the 
routing system in Solr). We might be able to live with this disadvantage 
if 1-1 on shard-level can be easily achieved much faster than the 1-1 on 
collection-level.

Any input is very much appreciated! Thanks

Regards, Per Steffensen

Mime
View raw message