Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of whoschek@cloudera.com
 designates 209.85.220.44 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.2\))
Subject: Re: MergeReduceIndexerTool takes a lot of time for a limited number
 of documents
From: Wolfgang Hoschek <whoschek@cloudera.com>
In-Reply-To: 
 <CAJY4VsNOV_9OD5vY00jGKzkKJpY+J5QdwkmSeVL6_nwZhb8UJA@mail.gmail.com>
Date: Wed, 18 Jun 2014 12:11:47 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <324E0D5B-02A5-4FC7-8604-6BC515E3629F@cloudera.com>
References: 
 <CAJY4VsM4K3q1QhA722uH_0-HQUrFbk2BWgh_BCMy1pb4YgHLEw@mail.gmail.com>
 <CAN4YXvdA46vpQ=KuQv+a+u+ZVDkOTSrzmSu8Tc_4Q2E0BW+C6g@mail.gmail.com>
 <CAJY4VsNOV_9OD5vY00jGKzkKJpY+J5QdwkmSeVL6_nwZhb8UJA@mail.gmail.com>
To: solr-user@lucene.apache.org

Consider giving the MR tasks more RAM, for example via=20

hadoop jar =
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D =
'mapred.child.java.opts=3D-Xmx2000m=92 ...

Wolfgang.

On May 26, 2014, at 10:48 AM, Costi Muraru <costimuraru@gmail.com> =
wrote:

> Hey Erick,
>=20
> The job reducers began to die with "Error: Java heap space", after 1h =
and
> 22 minutes being stucked at ~80%.
>=20
> I did a few more tests:
>=20
> Test 1.
> 80,000 documents
> Each document had *20* fields. The field names were* the same *for all =
the
> documents. Values were different.
> Job status: successful
> Execution time: 33 seconds.
>=20
> Test 2.
> 80,000 documents
> Each document had *20* fields. The field names were *different* for =
all the
> documents. Values were also different.
> Job status: successful
> Execution time: 643 seconds.
>=20
> Test 3.
> 80,000 documents
> Each document had *50* fields. The field names were *the same* for all =
the
> documents. Values were different.
> Job status: successful
> Execution time: 45.96 seconds.
>=20
> Test 4.
> 80,000 documents
> Each document had *50* fields. The field names were *different* for =
all the
> documents. Values were also different.
> Job status: failed
> Execution time: after 1h reducers failed.
> Unfortunately, this is my use case.
>=20
> My guess is that the reduce time (to perform the merges) depends if =
the
> field names are the same across the documents. If they are different =
the
> merge time increases very much. I don't have any knowledge behind the =
solr
> merge operation, but is it possible that it tries to group the fields =
with
> the same name across all the documents?
> In the first case, when the field names are the same across documents, =
the
> number of buckets is equal to the number of unique field names which =
is 20.
> In the second case, where all the field names are different (my use =
case),
> it creates a lot more buckets (80k documents * 50 different field =
names =3D 4
> million buckets) and the process gets slowed down significantly.
> Is this assumption correct / Is there any way to get around it?
>=20
> Thanks again for reaching out. Hope this is more clear now.
>=20
> This is how one of the 80k documents looks like (json format):
> {
> "id" : "442247098240414508034066540706561683636",
> "items" : {
>   "IT49597_1180_i" : 76,
>   "IT25363_1218_i" : 4,
>   "IT12418_1291_i" : 95,
>   "IT55979_1051_i" : 31,
>   "IT9841_1224_i" : 36,
>   "IT40463_1010_i" : 87,
>   "IT37932_1346_i" : 11,
>   "IT17653_1054_i" : 37,
>   "IT59414_1025_i" : 96,
>   "IT51080_1133_i" : 5,
>   "IT7369_1395_i" : 90,
>   "IT59974_1245_i" : 25,
>   "IT25374_1345_i" : 75,
>   "IT16825_1458_i" : 28,
>   "IT56643_1050_i" : 76,
>   "IT46274_1398_i" : 50,
>   "IT47411_1275_i" : 11,
>   "IT2791_1000_i" : 97,
>   "IT7708_1053_i" : 96,
>   "IT46622_1112_i" : 90,
>   "IT47161_1382_i" : 64
>   }
> }
>=20
> Costi
>=20
>=20
> On Mon, May 26, 2014 at 7:45 PM, Erick Erickson =
<erickerickson@gmail.com>wrote:
>=20
>> The MapReduceIndexerTool is really intended for very large data sets,
>> and by today's standards 80K doesn't qualify :).
>>=20
>> Basically, MRIT creates N sub-indexes, then merges them, which it
>> may to in a tiered fashion. That is, it may merge gen1 to gen2, then
>> merge gen2 to gen3 etc. Which is great when indexing a bazillion
>> documents into 20 shards, but all that copying around may take
>> more time than you really gain for 80K docs.
>>=20
>> Also be aware that MRIT does NOT update docs with the same ID, this
>> is due to the inherent limitation of the Lucene mergeIndex process.
>>=20
>> How long is "a long time"? attachments tend to get filtered out, so =
if you
>> want us to see the graph you might paste it somewhere and provide a =
link.
>>=20
>> Best,
>> Erick
>>=20
>> On Mon, May 26, 2014 at 8:51 AM, Costi Muraru <costimuraru@gmail.com>
>> wrote:
>>> Hey guys,
>>>=20
>>> I'm using the MergeReduceIndexerTool to import data into a SolrCloud
>>> cluster made out of 3 decent machines.
>>> Looking in the JobTracker, I can see that the mapper jobs finish =
quite
>>> fast. The reduce jobs get to ~80% quite fast as well. It is here =
where
>>> they get stucked for a long period of time (picture + log attached).
>>> I'm only trying to insert ~80k documents with 10-50 different fields
>>> each. Why is this happening? Am I not setting something correctly? =
Is
>>> the fact that most of the documents have different field names, or =
too
>>> many for that matter?
>>> Any tips are gladly appreciated.
>>>=20
>>> Thanks,
>>> Costi
>>>=20
>>> =46rom the reduce logs:
>>> 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
>>>=20
>> =
commit{,optimize=3Dfalse,openSearcher=3Dtrue,waitSearcher=3Dfalse,expungeD=
eletes=3Dfalse,softCommit=3Dfalse,prepareCommit=3Dfalse}
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: start
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: enter lock
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: commit: now prepare
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]: prepareCommit: flush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [IW][main]:   index before flush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DW][main]: main startFullFlush
>>> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DW][main]: anyChanges? numDocsInRam=3D25603 deletes=3Dtrue
>>> hasTickets:false pendingChangesInFullFlush: false
>>> 60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DWFC][main]: addFlushableState DocumentsWriterPerThread
>>> [pendingDeletes=3Dgen=3D0 25602 deleted terms (unique count=3D25602)
>>> bytesUsed=3D5171604, segment=3D_0, aborting=3Dfalse, =
numDocsInRAM=3D25603,
>>> deleteQueue=3DDWDQ: [ generation: 0 ]]
>>> 61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
>>> [DWPT][main]: flush postings as segment _0 numDocs=3D25603
>>> 61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>> 683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - =
Issuing
>>> heart beat for 1 threads
>>>=20
>>> This is the run command I'm using:
>>> hadoop jar
>> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
>>> org.apache.solr.hadoop.MapReduceIndexerTool \
>>> --log4j /home/cmuraru/solr/log4j.properties \
>>> --morphline-file morphline.conf \
>>> --output-dir hdfs://nameservice1:8020/tmp/outdir \
>>> --verbose --go-live --zk-host localhost:2181/solr \
>>> --collection collection1 \
>>> hdfs://nameservice1:8020/tmp/indir
>>=20