lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Wallace ...@rwmotloc.com>
Subject Re: Strange missing docs when reindexing with threads.
Date Fri, 12 Jun 2009 20:34:02 GMT
Right after I sent the email I went on and checked for uniqueness of 
documents...

In theory the were all supposed to be unique... But i've realized that 
the platform I'm using to reindex, is delaying sending the requests, 
this in combination with my reindexers reusing document fields (instead 
of creating new instances to save on GC) lead to the same document being 
sent many times with invalid data...

I am fairly sure now that this is the source of my problem... My 
reindexers originally used LuceneWriter directly, which blocks thread 
excecution until the document is added to the index, and the new 
framework i'm using uses messaging which releases control back to the 
thread before the documents are actually sent to be indexed, my threads 
update the document fields meanwhile, so the data written to the index 
is transitioning and invalid...

I've done an adjustment to my reindexing threads to ensure new instances 
of everything are used... I will test it shortly...

But you point out exactly why i have less documents than 'add' requests...

Thanks!

Shalin Shekhar Mangar wrote:
> On Fri, Jun 12, 2009 at 11:40 PM, Alexander Wallace <aw@rwmotloc.com> wrote:
>
>   
>> Hi all!
>>
>> I'm using Solr 1.3 and currently testing reindexing...
>>
>> In my client app, i am sending 17494 requests to add documents...  In 3
>> different scenarios:
>>
>> a) not using threads
>> b) using 1 thread
>> c) using 2 threads
>>
>> In scenario a), everything seems to work fine... In my client log, is see
>> 17494 requests sent to solr, in solr's log, I see the same number of 'add'
>> requests received, and If i search the index, i can see the same amount of
>> documents.
>>
>> However, if I use 1 thread, I see the right amount of requests in logs, but
>> I only find 15k or so documents (this varies a bit every time i run this
>> scenario).
>>
>> It gets way worse if I use 2 threads... I can see the right amount of
>> requests in both logs, but i end up with ~ 600 docs in the index!
>>
>> In all scenarios, I don't see any errors on the logs...
>>
>> As you can imagine, I need to be able to use multiple threads to speed up
>> the process... It is also very concertning that I don't get any errors
>> anywhere...
>>
>> Looking at solr's admin stats, I see also 17494 cumulative adds, but only a
>> tiny fraction of actual documents can be found...
>>
>> Any clues?
>>
>>     
>
> What is the uniqueKey in your schema.xml? Is it possible that those 17494
> documents have a common uniqueKey and are therefore getting overwritten?
>
>   

Mime
View raw message