mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: LDA/CVB Performance
Date Thu, 13 Jun 2013 20:36:27 GMT
I'd have to recheck with my colleague, but I'm pretty sure it worked
when we tried it. What should prevent the jobtracker from scheduling
only one map task per machine/tasktracker for a specific job?


-sebastian

On 13.06.2013 22:31, Alan Gardner wrote:
> mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, I
> don't think you can configure it per job. More granular resource control is
> a job for Mesos or YARN, MR doesn't support this sort of thing.
> 
> I think for our deployment we'll carve out a chunk of the cluster, sized
> and configured exclusively to do ML 24/7. If we don't need that much
> capacity, it might be better to spin up an Elastic Map Reduce cluster for a
> few hours every day.
> 
> 
> On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <ssc@apache.org> wrote:
> 
>> On 13.06.2013 22:12, Dmitriy Lyubimov wrote:
>>> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <ssc@apache.org>
>> wrote:
>>>
>>>> This table is readonly, right? We could try to apply the trick from our
>>>> ALS code: Instead of running one mapper per core (and thus having one
>>>> copy of the table per core), run a multithreaded mapper and share the
>>>> table between its threads. Works very well for ALS.
>>>
>>>
>>> Just out of my ignorance, how will you tell MR that your mapper is using
>>> more than 1  core and that it doesn't have to run more than 1 mapepr of
>>> that time per box?
>>
>> You need to use a MultithreadedMapper for which you can set the size of
>> the thread pool via MultithreadedMapper.setNumberOfThreads(...)
>>
>> You can configure the maximum number of mappers to run per task tracker
>> with -Dmapred.tasktracker.map.tasks.maximum=x
>>
>>>
>>>
>>>> We can also cache
>>>> the table in a static variable and make Hadoop reuse JVMs, which
>>>> increases performance if the number of blocks to process is larger than
>>>> the number of map slots.
>>>>
>>>
>>> This usually (or might be) something the admin doesn't let us override.
>>> Also i am not sure if jvm reuse in hadoop is isolated between different
>>> jobs (so other tasks may inherit that stuff they probably don't want)
>>
>> The jvm is only reused during a single job.
>>
>>>>
>>>> -sebastian
>>>>
>>>> On 13.06.2013 21:56, Ted Dunning wrote:
>>>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <jake.mannix@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Andy, note that he said he's running with a 1.6M-term dictionary.
>>>>  That's
>>>>>> going
>>>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices.
>>>> Still
>>>>>> not hitting
>>>>>> 8GB, but getting closer.
>>>>>>
>>>>>
>>>>> It will likely be even worse unless this table is shared between
>> mappers.
>>>>>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention
>>>> machine
>>>>> configuration, but this could easily cause swapping.
>>>>>
>>>>
>>>>
>>>
>>
>>
> 
> 


Mime
View raw message