lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Performance potential for updating (reindexing) documents
Date Thu, 24 Mar 2016 16:25:19 GMT
Impossible to say if for no other reason than you haven't told us
how many physical machines this is spread over ;).

For the process you've outlined to work, all the fields are stored,
right? So why not use Atomic Updates? You still have to query
the docs.

About querying. If I'm reading this right, you'll form some query
like q=whatever_identifies_docs_that_should_get_values_X_Y_Z
then process each one of those. So, really, all you need here is
the id of all the queries that satisfy that clause. You should
consider the /export handler (Streaming Aggregation). It's
designed to return large result sets with minimal memory.

So the process I'm thinking of is this (and it assumes all your
fields are stored so Atomic updates work).

Use the CloudSolrStream for each query. As the stream
comes back, you get the IDs you need and use them
to do an atomic update that adds the relevant fields.

Note that when _adding_ fields, you can change the schema
to include the new fields on an existing collection. All that
means is that any new docs added can have these fields.

Now, if all the fields are _not_ stored at least once, you can't
use atomic updates and you'll have to re-index from the system
of record.

Best,
Erick

On Thu, Mar 24, 2016 at 7:18 AM, tedsolr <tsmith@sciquest.com> wrote:
> With a properly tuned solr cloud infrastructure and less than 1B total docs
> spread out over 50 collections where the largest collection is 100M docs,
> what is a reasonable target goal for entirely reindexing a single
> collection?
>
> I understand there are a lot of variables, so I'm hypothetically wiping them
> away by assuming "a properly tuned infrastructure". So the hardware, RAM,
> etc. is configured correctly (not so in my case).
>
> The scenario is to add 3 fields to all the existing docs in one collection.
> The fields are the same but the values vary based on the docs. So a search
> is performed and finds 100 matches - all 100 docs will get the same updates.
> Then another search is performed that matches 15000 docs, and these are
> updated. This continues 10-20,000 times until essentially all the docs have
> been updated.
>
> The docs all have 100 - 200 fields, mostly text and mostly small in size.
> What's the best possible throughput I can expect? 1000 docs/sec? 5000
> docs/sec?
>
> Using SolrJ for querying and indexing against a v5.2.1 cloud.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message