mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Speeding RowSimilarityJob-CooccurrencesMapper-SimilarityReducer up in RecommenderJob
Date Mon, 11 Apr 2011 15:45:43 GMT
Yes, the combiner phase can be a solid bottleneck in some cases here.
It helps to increase both io.sort.factor and io.sort.mb to make it do
a, say, 100-way merge instead of 10 and use maybe half your worker's
heap for spills in memory rather than the default 100MB.

RecommenderJob already does this on the aggregate-and-recommend step
-- see that call to "setIOSort()". You could apply the same call to
this step and see what happens. If it's helpful we can commit that.

There are a load of factors involved in tuning this and yes it seems
like this is running way slower than it should. Having less than 1
worker per machine can't be optimal I think, so you may  have to
tinker with that setting too.

I'd also up the child worker's memory if you have 24GB RAM on each
machine. That, plus the change to io.sort.mb, lets it do much more
merging in memory.



On Mon, Apr 11, 2011 at 2:49 PM, Kris Jack <mrkrisjack@gmail.com> wrote:
> Sure.  I called the mahout 0.5 job with these parameters:
>
> $ jar mahout-core-0.5-SNAPSHOT-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=/user/kjack/input/
> -Dmapred.output.dir=/user/kjack/output/ --similarityClassname
> org.apache.mahout.math.hadoop.similarity.vector.DistributedUncenteredCosineVectorSimilarity
> --booleanData
>
> The data set is quite sparse.  ~3M items have been rated by at least 2
> users.  ~50K items have been rated by at least 10 users.  ~500 items have
> been rated by at least 50 users.
>
> Running on a cluster with 15 nodes, 24GB ram each.
>
> It's not that the job didn't finish, it's just that it ran very slowly.  By
> letting Hadoop assign mappers and reducers automatically, only 9 mappers
> were assigned and 1 reducer.  I might have to up these too.
>
> I looked at the output from the MaybePruneRowsMapper and CooccurrencesMapper
> jobs and can't find any counters with those names.  Just got the basic ones
> (e.g. FILE_BYTES_READ, HDFS_BYTES_READ, FILE_BYTES_WRITTEN, etc.).  I'm
> looking at the counters from the job output through the jobtracker.  Am I
> looking in the wrong place?
>
> Thanks,
> Kris
>
>
>
> 2011/4/11 Sebastian Schelter <ssc@apache.org>
>
>> Could you give us some more details? Like all the parameters you called the
>> job with and some hints about the cluster you ran this on? And some hints
>> about your data, especially how dense/sparse it is?
>>
>> My first suspicion would be that much too much items need to be compared
>> with each other, are using Mahout 0.5 (you should because 0.4 has a bug in
>> this job) and can you give us the values of the following counters?
>>
>>  - MaybePruneRowsMapper Elements.USED and ELEMENTS.NEGLECTED
>>  - CooccurrencesMapper Counter.COOCCURRENCES
>>
>> --sebastian
>>
>>
>>
>> On 11.04.2011 14:17, Kris Jack wrote:
>>
>>> Hi everyone,
>>>
>>> I'm running the RecommenderJob with a fairly large dataset (e.g. 175K+
>>> users, 3M+ items, 40M+ preferences).
>>>
>>> When it gets to the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
>>> job, it takes a long time to run.
>>>
>>> I think that the main problem is that there's a lot of spilling going on
>>> when records are full.  Here's extract from the logs:
>>>
>>>
>>> 2011-04-01 21:53:24,245 WARN org.apache.hadoop.conf.Configuration:
>>>
>>> /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a
>>> attempt to override final parameter: mapred.local.dir;  Ignoring.
>>> 2011-04-01 21:53:25,630 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>> Initializing JVM Metrics with processName=MAP, sessionId=
>>> 2011-04-01 21:53:25,792 WARN org.apache.hadoop.conf.Configuration:
>>>
>>> /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a
>>> attempt to override final parameter: mapred.local.dir;  Ignoring.
>>> 2011-04-01 21:53:25,968 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb
>>> = 100
>>> 2011-04-01 21:53:26,040 INFO org.apache.hadoop.mapred.MapTask: data
>>> buffer = 79691776/99614720
>>> 2011-04-01 21:53:26,041 INFO org.apache.hadoop.mapred.MapTask: record
>>> buffer = 262144/327680
>>> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 0; bufend = 10492787; bufvoid = 99614720
>>> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 0; kvend = 262144; length = 327680
>>> 2011-04-01 21:53:31,420 INFO org.apache.hadoop.util.NativeCodeLoader:
>>> Loaded the native-hadoop library
>>> 2011-04-01 21:53:31,422 INFO
>>> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded&
>>> initialized native-zlib library
>>> 2011-04-01 21:53:31,424 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new compressor
>>> 2011-04-01 21:53:32,747 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 0
>>> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 10492787; bufend = 21060166; bufvoid = 99614720
>>> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 262144; kvend = 196607; length = 327680
>>> 2011-04-01 21:53:40,004 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 1
>>> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 21060166; bufend = 31021600; bufvoid = 99614720
>>> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 196607; kvend = 131070; length = 327680
>>> 2011-04-01 21:53:54,636 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 2
>>> 2011-04-01 21:53:54,921 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 31021600; bufend = 41605937; bufvoid = 99614720
>>> 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 131070; kvend = 65533; length = 327680
>>> 2011-04-01 21:54:04,370 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 3
>>> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 41605937; bufend = 52550080; bufvoid = 99614720
>>> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 65533; kvend = 327677; length = 327680
>>> 2011-04-01 21:54:17,717 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 4
>>> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 52550080; bufend = 63431004; bufvoid = 99614720
>>> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 327677; kvend = 262140; length = 327680
>>> 2011-04-01 21:54:24,260 INFO org.apache.hadoop.mapred.MapTask: Finished
>>> spill 5
>>>
>>> .
>>>
>>>
>>> .
>>>
>>> .
>>>
>>>
>>> 2011-04-06 11:18:09,932 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 42207
>>> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 53614451; bufend = 64698505; bufvoid = 99614720
>>> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 97306; kvend = 31769; length = 327680
>>> 2011-04-06 11:18:18,905 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 42208
>>> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask:
>>> Spilling map output: record full = true
>>> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask:
>>> bufstart = 64698505; bufend = 75588890; bufvoid = 99614720
>>> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 31769; kvend = 293913; length = 327680
>>> 2011-04-06 11:18:32,523 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 42209
>>> 2011-04-06 11:18:32,683 INFO org.apache.hadoop.mapred.MapTask:
>>> Starting flush of map output
>>> 2011-04-06 11:18:35,894 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 42210
>>> 2011-04-06 11:19:11,530 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 42211 sorted segments
>>> 2011-04-06 11:19:11,595 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,605 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,621 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,636 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,646 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,656 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,667 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,685 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,686 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,704 INFO org.apache.hadoop.io.compress.CodecPool:
>>> Got brand-new decompressor
>>> 2011-04-06 11:19:11,705 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42211
>>> 2011-04-06 11:19:37,409 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42202
>>> 2011-04-06 11:19:59,087 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42193
>>> 2011-04-06 11:20:19,060 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42184
>>> 2011-04-06 11:20:36,387 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42175
>>> 2011-04-06 11:21:00,826 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42166
>>> 2011-04-06 11:21:20,049 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42157
>>> 2011-04-06 11:21:42,120 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42148
>>> 2011-04-06 11:21:58,431 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 42139
>>> 2011-04-06 11:22:15,594 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 4213
>>>
>>> .
>>>
>>> .
>>>
>>> .
>>>
>>>
>>> 2011-04-09 00:43:01,590 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 73
>>> 2011-04-09 01:20:59,688 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 64
>>> 2011-04-09 01:57:23,938 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 55
>>> 2011-04-09 02:26:39,086 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 46
>>> 2011-04-09 05:08:55,609 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 37
>>> 2011-04-09 09:59:05,399 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 28
>>> 2011-04-09 14:59:58,730 INFO org.apache.hadoop.mapred.Merger: Merging
>>> 10 intermediate segments out of a total of 19
>>> 2011-04-09 19:28:18,921 INFO org.apache.hadoop.mapred.Merger: Down to
>>> the last merge-pass, with 10 segments left of total size: 24786384532
>>> bytes
>>> 2011-04-10 14:55:46,726 INFO org.apache.hadoop.mapred.TaskRunner:
>>> Task:attempt_201103291618_0156_m_000000_1 is done. And is in the
>>> process of commiting
>>> 2011-04-10 14:55:46,763 INFO org.apache.hadoop.mapred.TaskRunner: Task
>>> 'attempt_201103291618_0156_m_000000_1' done.
>>>
>>>
>>> This task took 10 days to run.  I'm guessing that it should run faster?
>>>
>>> Any general advice for speeding this up?
>>>
>>> Should I increase the io.sort.mb parameter?  It's currently set at
>>> 100, should it be more like 1,000?  I see that mahout attempts to
>>> increase the parameter for the last job in RecommenderJob but it
>>> doesn't seem to for the preceding jobs.  Is that right?
>>>
>>> Thanks,
>>>
>>> Kris
>>>
>>>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>

Mime
View raw message