lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kranti Parisa <kranti.par...@gmail.com>
Subject Re: Indexing huge data
Date Thu, 06 Mar 2014 23:38:45 GMT
thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <rallavagu@gmail.com> wrote:

> Yeah. I have thought about spitting out JSON and run it against Solr using
> parallel Http threads separately. Thanks.
>
>
> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>
>> One more suggestion is to collect/prepare the data in CSV format (1-2
>> million sample depending on size) and then import data direct into Solr
>> using CSV handler & curl.  This will give you the pure indexing time & the
>> differences.
>>
>> Thanks,
>> Susheel
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Wednesday, March 05, 2014 8:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing huge data
>>
>> Here's the easiest thing to try to figure out where to concentrate your
>> energies..... Just comment out the server.add call in your SolrJ program.
>> Well, and any commits you're doing from SolrJ.
>>
>> My bet: Your program will run at about the same speed it does when you
>> actually index the docs, indicating that your problem is in the data
>> acquisition side. Of course the older I get, the more times I've been wrong
>> :).
>>
>> You can also monitor the CPU usage on the box running Solr. I often see
>> it idling along < 30% when indexing, or even < 10%, again indicating that
>> the bottleneck is on the acquisition side.
>>
>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>> _problem_ before worrying about a solution.
>>
>> Best,
>> Erick
>>
>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <jack@basetechnology.com>
>> wrote:
>>
>>> Make sure you're not doing a commit on each individual document add.
>>> Commit every few minutes or every few hundred or few thousand
>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Rallavagu
>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing huge data
>>>
>>>
>>> All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db
>>> and other source (data is not located in one resource). Trying with
>>> solrj based solution to collect data from difference resources to
>>> index into Solr. It takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message