lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <whosc...@cloudera.com>
Subject Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents
Date Wed, 18 Jun 2014 19:11:47 GMT
Consider giving the MR tasks more RAM, for example via 

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx2000m’ ...

Wolfgang.

On May 26, 2014, at 10:48 AM, Costi Muraru <costimuraru@gmail.com> wrote:

> Hey Erick,
> 
> The job reducers began to die with "Error: Java heap space", after 1h and
> 22 minutes being stucked at ~80%.
> 
> I did a few more tests:
> 
> Test 1.
> 80,000 documents
> Each document had *20* fields. The field names were* the same *for all the
> documents. Values were different.
> Job status: successful
> Execution time: 33 seconds.
> 
> Test 2.
> 80,000 documents
> Each document had *20* fields. The field names were *different* for all the
> documents. Values were also different.
> Job status: successful
> Execution time: 643 seconds.
> 
> Test 3.
> 80,000 documents
> Each document had *50* fields. The field names were *the same* for all the
> documents. Values were different.
> Job status: successful
> Execution time: 45.96 seconds.
> 
> Test 4.
> 80,000 documents
> Each document had *50* fields. The field names were *different* for all the
> documents. Values were also different.
> Job status: failed
> Execution time: after 1h reducers failed.
> Unfortunately, this is my use case.
> 
> My guess is that the reduce time (to perform the merges) depends if the
> field names are the same across the documents. If they are different the
> merge time increases very much. I don't have any knowledge behind the solr
> merge operation, but is it possible that it tries to group the fields with
> the same name across all the documents?
> In the first case, when the field names are the same across documents, the
> number of buckets is equal to the number of unique field names which is 20.
> In the second case, where all the field names are different (my use case),
> it creates a lot more buckets (80k documents * 50 different field names = 4
> million buckets) and the process gets slowed down significantly.
> Is this assumption correct / Is there any way to get around it?
> 
> Thanks again for reaching out. Hope this is more clear now.
> 
> This is how one of the 80k documents looks like (json format):
> {
> "id" : "442247098240414508034066540706561683636",
> "items" : {
>   "IT49597_1180_i" : 76,
>   "IT25363_1218_i" : 4,
>   "IT12418_1291_i" : 95,
>   "IT55979_1051_i" : 31,
>   "IT9841_1224_i" : 36,
>   "IT40463_1010_i" : 87,
>   "IT37932_1346_i" : 11,
>   "IT17653_1054_i" : 37,
>   "IT59414_1025_i" : 96,
>   "IT51080_1133_i" : 5,
>   "IT7369_1395_i" : 90,
>   "IT59974_1245_i" : 25,
>   "IT25374_1345_i" : 75,
>   "IT16825_1458_i" : 28,
>   "IT56643_1050_i" : 76,
>   "IT46274_1398_i" : 50,
>   "IT47411_1275_i" : 11,
>   "IT2791_1000_i" : 97,
>   "IT7708_1053_i" : 96,
>   "IT46622_1112_i" : 90,
>   "IT47161_1382_i" : 64
>   }
> }
> 
> Costi
> 
> 
> On Mon, May 26, 2014 at 7:45 PM, Erick Erickson <erickerickson@gmail.com>wrote:
> 
>> The MapReduceIndexerTool is really intended for very large data sets,
>> and by today's standards 80K doesn't qualify :).
>> 
>> Basically, MRIT creates N sub-indexes, then merges them, which it
>> may to in a tiered fashion. That is, it may merge gen1 to gen2, then
>> merge gen2 to gen3 etc. Which is great when indexing a bazillion
>> documents into 20 shards, but all that copying around may take
>> more time than you really gain for 80K docs.
>> 
>> Also be aware that MRIT does NOT update docs with the same ID, this
>> is due to the inherent limitation of the Lucene mergeIndex process.
>> 
>> How long is "a long time"? attachments tend to get filtered out, so if you
>> want us to see the graph you might paste it somewhere and provide a link.
>> 
>> Best,
>> Erick
>> 
>> On Mon, May 26, 2014 at 8:51 AM, Costi Muraru <costimuraru@gmail.com>
>> wrote:
>>> Hey guys,
>>> 
>>> I'm using the MergeReduceIndexerTool to import data into a SolrCloud
>>> cluster made out of 3 decent machines.
>>> Looking in the JobTracker, I can see that the mapper jobs finish quite
>>> fast. The reduce jobs get to ~80% quite fast as well. It is here where
>>> they get stucked for a long period of time (picture + log attached).
>>> I'm only trying to insert ~80k documents with 10-50 different fields
>>> each. Why is this happening? Am I not setting something correctly? Is
>>> the fact that most of the documents have different field names, or too
>>> many for that matter?
>>> Any tips are gladly appreciated.
>>> 
>>> Thanks,
>>> Costi
>>> 
>>> From the reduce logs:
>>> 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
>>> 
>> commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: start
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: enter lock
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: now prepare
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: prepareCommit: flush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]:   index before flush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DW][main]: main startFullFlush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
>>> hasTickets:false pendingChangesInFullFlush: false
>>> 60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DWFC][main]: addFlushableState DocumentsWriterPerThread
>>> [pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
>>> bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
>>> deleteQueue=DWDQ: [ generation: 0 ]]
>>> 61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DWPT][main]: flush postings as segment _0 numDocs=25603
>>> 61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
>>> heart beat for 1 threads
>>> 
>>> This is the run command I'm using:
>>> hadoop jar
>> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
>>> org.apache.solr.hadoop.MapReduceIndexerTool \
>>> --log4j /home/cmuraru/solr/log4j.properties \
>>> --morphline-file morphline.conf \
>>> --output-dir hdfs://nameservice1:8020/tmp/outdir \
>>> --verbose --go-live --zk-host localhost:2181/solr \
>>> --collection collection1 \
>>> hdfs://nameservice1:8020/tmp/indir
>> 


Mime
View raw message