lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wawok, Brian" <Brian.Wa...@cmegroup.com>
Subject solr best practice to submit many documents
Date Wed, 07 Apr 2010 15:47:57 GMT
Hello,

I am using SOLR for some proof of concept work, and was wondering if anyone has some guidance
on a best practice.

Background:
Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 500,000 pages.
For my proof of concept I am using a single 100,000 page report.
I want to see how fast I can make SOLR handle this single report, and then can see how we
can scale out to meet the total indexing demand (if needed).

Trial 1:

1)      Set up a solr server on server A with the default settings. Added a few new fields
to index, including a full text index of the report.

2)      Set up a simple Python script on serve B. It splits the report into 100,000 small
documents, pulls out a few key fields to be sent along to index, and uses a python implementation
of curl to shove the documents into the server (with 4 threads posting away).

3)      After all 100,000 documents are posted, we post an index and let the server index.


I was able to get this method to work, and it took around 340 seconds for the posting, and
10 seconds for the indexing. I am not sure if that indexing speed is a red hearing, and it
was really doing a little bit of the indexing during the posts, or what.

Regardless, it seems less than ideal to make 100,000 requests to the server to index 100,000
documents.  Does anyone have an idea for how to make this process more efficient? Should I
look into making an XML document with 100,000 documents enclosed? Or what will give me the
best performance?  Will this be much better than what I am seeing with my post method?  I
am not against writing a custom parser on the SOLR side, but if there is already a way in
SOLR to send many documents efficiently,  that is better.


Thanks!

Brian Wawok


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message