Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C40D111B9 for ; Wed, 18 Jun 2014 19:12:27 +0000 (UTC) Received: (qmail 15949 invoked by uid 500); 18 Jun 2014 19:12:23 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 15882 invoked by uid 500); 18 Jun 2014 19:12:23 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 15871 invoked by uid 99); 18 Jun 2014 19:12:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 19:12:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of whoschek@cloudera.com designates 209.85.220.44 as permitted sender) Received: from [209.85.220.44] (HELO mail-pa0-f44.google.com) (209.85.220.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 19:12:18 +0000 Received: by mail-pa0-f44.google.com with SMTP id rd3so1041669pab.17 for ; Wed, 18 Jun 2014 12:11:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=1goecmQqZixRkN7WBsDv2m73ojKaNA21fbdmM8Ylgso=; b=bF5l+CwV/wmOGHQaTSJ/rfEt3u/hIexa/qWsAUtI1Jx0bBUOOYkmc+E4pHwHP7jsX5 RPiK/xR4c5ctdxIfAY0R1uTC4cwIfsUAUayWwj+AVSdkH9DW/ElQRSAQWJ2MAHJ9CTKo khkAiAnAP0iQBfGQk8XUSTQk4Rnd5U/bJ68ucE9OtkAtE9wGzi61Do9Owir65DLPvJU5 gva7U9omjev5ZFQpfw7zXYzsnH+ag1Q62T+llvL7LclpM/qFY7ONPBPgWGoW1tLfE7j4 rItyCRj/dJFtd8OoSL3OfxGD7pjHHp3dOAFyTQDXz6/PGorR0/HR5dYbBR7VNE+xkYK8 mi7A== X-Gm-Message-State: ALoCoQlbc4bF6nZ0XJp9i2ri/S0NYCYjHhbnQc8sLoDUryY1zvewqUAPRlYahNHJe91jdMiDMghI X-Received: by 10.68.203.132 with SMTP id kq4mr199024pbc.12.1403118713629; Wed, 18 Jun 2014 12:11:53 -0700 (PDT) Received: from 50-0-108-181.dsl.dynamic.sonic.net (50-0-108-181.dsl.dynamic.sonic.net. [50.0.108.181]) by mx.google.com with ESMTPSA id iz2sm4703416pbb.95.2014.06.18.12.11.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 18 Jun 2014 12:11:49 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.2\)) Subject: Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents From: Wolfgang Hoschek In-Reply-To: Date: Wed, 18 Jun 2014 12:11:47 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <324E0D5B-02A5-4FC7-8604-6BC515E3629F@cloudera.com> References: To: solr-user@lucene.apache.org X-Mailer: Apple Mail (2.1878.2) X-Virus-Checked: Checked by ClamAV on apache.org Consider giving the MR tasks more RAM, for example via=20 hadoop jar = /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D = 'mapred.child.java.opts=3D-Xmx2000m=92 ... Wolfgang. On May 26, 2014, at 10:48 AM, Costi Muraru = wrote: > Hey Erick, >=20 > The job reducers began to die with "Error: Java heap space", after 1h = and > 22 minutes being stucked at ~80%. >=20 > I did a few more tests: >=20 > Test 1. > 80,000 documents > Each document had *20* fields. The field names were* the same *for all = the > documents. Values were different. > Job status: successful > Execution time: 33 seconds. >=20 > Test 2. > 80,000 documents > Each document had *20* fields. The field names were *different* for = all the > documents. Values were also different. > Job status: successful > Execution time: 643 seconds. >=20 > Test 3. > 80,000 documents > Each document had *50* fields. The field names were *the same* for all = the > documents. Values were different. > Job status: successful > Execution time: 45.96 seconds. >=20 > Test 4. > 80,000 documents > Each document had *50* fields. The field names were *different* for = all the > documents. Values were also different. > Job status: failed > Execution time: after 1h reducers failed. > Unfortunately, this is my use case. >=20 > My guess is that the reduce time (to perform the merges) depends if = the > field names are the same across the documents. If they are different = the > merge time increases very much. I don't have any knowledge behind the = solr > merge operation, but is it possible that it tries to group the fields = with > the same name across all the documents? > In the first case, when the field names are the same across documents, = the > number of buckets is equal to the number of unique field names which = is 20. > In the second case, where all the field names are different (my use = case), > it creates a lot more buckets (80k documents * 50 different field = names =3D 4 > million buckets) and the process gets slowed down significantly. > Is this assumption correct / Is there any way to get around it? >=20 > Thanks again for reaching out. Hope this is more clear now. >=20 > This is how one of the 80k documents looks like (json format): > { > "id" : "442247098240414508034066540706561683636", > "items" : { > "IT49597_1180_i" : 76, > "IT25363_1218_i" : 4, > "IT12418_1291_i" : 95, > "IT55979_1051_i" : 31, > "IT9841_1224_i" : 36, > "IT40463_1010_i" : 87, > "IT37932_1346_i" : 11, > "IT17653_1054_i" : 37, > "IT59414_1025_i" : 96, > "IT51080_1133_i" : 5, > "IT7369_1395_i" : 90, > "IT59974_1245_i" : 25, > "IT25374_1345_i" : 75, > "IT16825_1458_i" : 28, > "IT56643_1050_i" : 76, > "IT46274_1398_i" : 50, > "IT47411_1275_i" : 11, > "IT2791_1000_i" : 97, > "IT7708_1053_i" : 96, > "IT46622_1112_i" : 90, > "IT47161_1382_i" : 64 > } > } >=20 > Costi >=20 >=20 > On Mon, May 26, 2014 at 7:45 PM, Erick Erickson = wrote: >=20 >> The MapReduceIndexerTool is really intended for very large data sets, >> and by today's standards 80K doesn't qualify :). >>=20 >> Basically, MRIT creates N sub-indexes, then merges them, which it >> may to in a tiered fashion. That is, it may merge gen1 to gen2, then >> merge gen2 to gen3 etc. Which is great when indexing a bazillion >> documents into 20 shards, but all that copying around may take >> more time than you really gain for 80K docs. >>=20 >> Also be aware that MRIT does NOT update docs with the same ID, this >> is due to the inherent limitation of the Lucene mergeIndex process. >>=20 >> How long is "a long time"? attachments tend to get filtered out, so = if you >> want us to see the graph you might paste it somewhere and provide a = link. >>=20 >> Best, >> Erick >>=20 >> On Mon, May 26, 2014 at 8:51 AM, Costi Muraru >> wrote: >>> Hey guys, >>>=20 >>> I'm using the MergeReduceIndexerTool to import data into a SolrCloud >>> cluster made out of 3 decent machines. >>> Looking in the JobTracker, I can see that the mapper jobs finish = quite >>> fast. The reduce jobs get to ~80% quite fast as well. It is here = where >>> they get stucked for a long period of time (picture + log attached). >>> I'm only trying to insert ~80k documents with 10-50 different fields >>> each. Why is this happening? Am I not setting something correctly? = Is >>> the fact that most of the documents have different field names, or = too >>> many for that matter? >>> Any tips are gladly appreciated. >>>=20 >>> Thanks, >>> Costi >>>=20 >>> =46rom the reduce logs: >>> 60208 [main] INFO org.apache.solr.update.UpdateHandler - start >>>=20 >> = commit{,optimize=3Dfalse,openSearcher=3Dtrue,waitSearcher=3Dfalse,expungeD= eletes=3Dfalse,softCommit=3Dfalse,prepareCommit=3Dfalse} >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [IW][main]: commit: start >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [IW][main]: commit: enter lock >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [IW][main]: commit: now prepare >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [IW][main]: prepareCommit: flush >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [IW][main]: index before flush >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [DW][main]: main startFullFlush >>> 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [DW][main]: anyChanges? numDocsInRam=3D25603 deletes=3Dtrue >>> hasTickets:false pendingChangesInFullFlush: false >>> 60209 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [DWFC][main]: addFlushableState DocumentsWriterPerThread >>> [pendingDeletes=3Dgen=3D0 25602 deleted terms (unique count=3D25602) >>> bytesUsed=3D5171604, segment=3D_0, aborting=3Dfalse, = numDocsInRAM=3D25603, >>> deleteQueue=3DDWDQ: [ generation: 0 ]] >>> 61542 [main] INFO org.apache.solr.update.LoggingInfoStream - >>> [DWPT][main]: flush postings as segment _0 numDocs=3D25603 >>> 61664 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 125115 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 199408 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 271088 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 336754 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 417810 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 479495 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 552357 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 621450 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>> 683173 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - = Issuing >>> heart beat for 1 threads >>>=20 >>> This is the run command I'm using: >>> hadoop jar >> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar >>> org.apache.solr.hadoop.MapReduceIndexerTool \ >>> --log4j /home/cmuraru/solr/log4j.properties \ >>> --morphline-file morphline.conf \ >>> --output-dir hdfs://nameservice1:8020/tmp/outdir \ >>> --verbose --go-live --zk-host localhost:2181/solr \ >>> --collection collection1 \ >>> hdfs://nameservice1:8020/tmp/indir >>=20