lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brett Hoerner <br...@bretthoerner.com>
Subject Solr mapred MTree merge stage ~6x slower in 4.10
Date Thu, 25 Sep 2014 17:36:03 GMT
As an update to this thread, it seems my MTree wasn't completely hanging,
it was just much slower in 4.10.

If I replace 4.9.0 with 4.10 in my jar the MTree merge stage is 6x (or
more) slower (in my case, 20 min becomes 2 hours). I hope to bisect this in
the future, but the jobs I'm running take a long time. I haven't tried to
see if the issue shows on smaller jobs yet (does 1 minute become 6
minutes?).

Brett




On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner <brett@bretthoerner.com>
wrote:

> I have a very weird problem that I'm going to try to describe here to see
> if anyone has any "ah-ha" moments or clues. I haven't created a small
> reproducible project for this but I guess I will have to try in the future
> if I can't figure it out. (Or I'll need to bisect by running long Hadoop
> jobs...)
>
> So, the facts:
>
> * Have been successfully using Solr mapred to build very large Solr
> clusters for months
> * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge
> phase in 4.10
> * Those same jobs (same input, output, and Hadoop cluster itself) succeed
> if I only change my Solr deps to 4.9
> * The job *does succeed* in 4.10 if I use the same data to create more,
> but smaller shards (e.g. 12x as many shards each 1/12th the size of the job
> that fails)
> * Creating my "normal size" shards (the size I want, that works in 4.9)
> the job hangs with 2 mappers running, 0 reducers in the MTree merge phase
> * There are no errors or warning in the syslog/stderr of the MTree
> mappers, no errors ever echo'd back to the "interactive run" of the job
> (mapper says 100%, reduce says 0%, will stay forever)
> * No CPU being used on the boxes running the merge, no GC happening, JVM
> waiting on a futex, all threads blocked on various queues
> * No disk usage problems, nothing else obviously wrong with any box in the
> cluster
>
> I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred
> contrib, mostly some test stuff. I didn't see any transitive dependency
> changes in Solr/Lucene that look like they would affect me.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message