mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Date Thu, 02 Feb 2012 19:25:11 GMT
Hi Nicholas,

On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote:

> Ok, I took a bit deeper look into this having changed some parameters and
> kicked off the new job..
> 
> Seems plausible that I didn't have enough memory for some of the mappers --
> unless I'm missing something here.
> An upper bound on the memory would be (assuming my original parameter of 25
> features)
> 8Mil * 25 Features = 200Mil
> (multiply by 8 bytes assuming double precision floating point) and we get:
> 1.6billion
> 1.6B / (1024^3) = ~1.5GB memory needed
> 
> The tasktracker heapsize and datanode heap sizes were only set to: 1GB

The memory you need for this task is based on the mapped.child.java.opts setting (the -Xmx
setting), not what's allocated for the NameNode, JobTracker, DataNode or TaskTracker.

In fact increasing the DataNode & TaskTracker sizes removes memory that could/should be
used by the child JVMs that the TaskTracker creates to run your map & reduce tasks.

Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which should be sufficient
given your analysis above.

-- Ken

> 
> So I have changed the bootstrap action on EC2 as follows (this is a diff
> between the original and the changes I made)
> # Parameters of the array:
> # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum,
> mapred.tasktracker.reduce.tasks.maximum]
> 29c29
> <   "m2.2xlarge"  => ["-Xmx4096m", "6",  "2"],
> ---
>>  "m2.2xlarge"  => ["-Xmx8192m", "3",  "2"],
> # Parameters of the array (Vars modified in hadoop.env.sh)
> # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE,
> HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE]
> 47c47
> <   "m2.2xlarge"  => ["2048", "8192", "1024", "1024"],
> ---
>>  "m2.2xlarge"  => ["4096", "16384", "2048", "2048"]
> 
> 
> 
> On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter <ssc@apache.org> wrote:
> 
>> Hmm, are you sure that the mappers have enough memory? You can set that
>> via Dmapred.child.java.opts=-Xmx[some number]m
>> 
>> --sebastian
>> 
>> On 02.02.2012 17:37, Nicholas Kolegraff wrote:
>>> Sounds good. Thanks Sebastian
>>> 
>>> The interesting thing is -- I tried to sample the matrix down one time to
>>> about 10% of non-zeros -- and worked no problem.
>>> 
>>> On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <ssc@apache.org>
>> wrote:
>>> 
>>>> Your parameters look good, except if you have binary data, you should
>>>> set --implicitFeedback=true. You could also set numFeatures to a very
>>>> small value (like 5) just to see if that helps.
>>>> 
>>>> The mappers load one of the feature matrices into memory which are dense
>>>> (#items x #features entries or #users x #features entries). Are you sure
>>>> that the mappers have enough memory for that?
>>>> 
>>>> It's really strange that you have problems with such small data, I
>>>> tested this with Netflix (> 100M non-zeros) on a few machines and it
>>>> worked quite well.
>>>> 
>>>> --sebastian
>>>> 
>>>> 
>>>> 
>>>> On 02.02.2012 17:25, Nicholas Kolegraff wrote:
>>>>> I will up the ante with the time out and report back -- thanks all for
>>>> the
>>>>> suggestions
>>>>> 
>>>>> Hey, Sebastian -- Here are the arguments I am using:
>>>>> --input matrix --output ALS --numFeatures 25 --numIterations 10
>> --lambda
>>>>> 0.065
>>>>> When the mapper loads the matrix into memory it only loads the actual
>>>>> non-zero data, correct?
>>>>> 
>>>>> Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
>>>>> non-zero elements.
>>>>> 
>>>>> Oh, and, I only have binary data -- I wasn't sure of the implications
>>>> with
>>>>> ALS-WR on binary data -- I couldn't find anything to suggest otherwise.
>>>>> I am using data of the format user,item,1
>>>>> I have read about probabilistic factorization -- which works with
>> binary
>>>>> data -- and perhaps naively, thought ALS-WR was similar so
>> what-the-heck
>>>> :-)
>>>>> 
>>>>> I'd love nothing more than to share the data, however, I'd probably get
>>>> in
>>>>> some trouble :-)
>>>>> Perhaps I could generate a matrix with a similar distribution? -- I'll
>>>> have
>>>>> to check on that and see if it is ok #bureaucracy
>>>>> 
>>>>> Stay tuned...
>>>>> 
>>>>> On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <ssc@apache.org>
>>>> wrote:
>>>>> 
>>>>>> Nicholas,
>>>>>> 
>>>>>> can you give us the detailed arguments you start the job with? I'd
>>>>>> especially be interested in the number of features (--numFeatures)
you
>>>>>> use. Do you use the job with implicit feedback data
>>>>>> (--implicitFeedback=true)?
>>>>>> 
>>>>>> The memory requirements of the job are the following:
>>>>>> 
>>>>>> In each iteration either the item-features matrix (items x features)
>> or
>>>>>> the user-features matrix (users x features) is loaded into the memory
>> of
>>>>>> each mapper. Then the original user-item matrix (or its transpose)
is
>>>>>> read row-wise by the mappers and they recompute the features via
>>>>>> 
>>>>>> 
>>>> 
>> AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.
>>>>>> 
>>>>>> --sebastian
>>>>>> 
>>>>>> 
>>>>>> On 02.02.2012 09:53, Sean Owen wrote:
>>>>>>> I have seen this happen in "normal" operation when the sorting
on the
>>>>>>> mapper is taking a long long time, because the output is large.
You
>> can
>>>>>>> tell it to increase the timeout.  If this is what is happening,
you
>>>> won't
>>>>>>> have a chance to update a counter as a keep-alive ping, but yes
that
>> is
>>>>>>> generally right otherwise. If this is the case it's that a mapper
is
>>>>>>> outputting a whole lot of info, perhaps 'too much'. I don't know
for
>>>>>> sure,
>>>>>>> just another a guess for the pile.
>>>>>>> 
>>>>>>> On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Status reporting happens automatically when output is generated.
>> In a
>>>>>> long
>>>>>>>> computation, it is good form to occasionally update a counter
or
>>>>>> otherwise
>>>>>>>> indicate that the computation is still progressing.
>>>>>>>> 
>>>>>>>> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
>>>>>>>> <nickkolegraff@gmail.com>wrote:
>>>>>>>> 
>>>>>>>>> Do you know if it should still report status in the midst
of a
>>>> complex
>>>>>>>>> task?  Seems questionable that it wouldn't just send
a friendly
>>>> hello?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message