mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lastarsenal <lastarse...@163.com>
Subject Re:Re: Hadoop SSVD OutOfMemory Problem
Date Tue, 28 Apr 2015 03:39:24 GMT
Hi, Dmitriy Lyubimov 


OK, I have submitted a JIRA issue at https://issues.apache.org/jira/browse/MAHOUT-1700


I'm a newbie for mahout, so, what should I do next for this issue? Thank you!

At 2015-04-28 02:16:37, "Dmitriy Lyubimov" <dlieu.7@gmail.com> wrote:
>Thank you for this analysis. I can't immediately confirm this since it's
>been a while but this sounds credible.
>
>Do you mind to file a jira with all this information, and even perhaps do a
>PR on github?
>
>thank you.
>
>On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal <lastarsenal@163.com> wrote:
>
>> Hi, All,
>>
>>
>>      Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)
>> job. There's a java heap space out of memory problem  in ABtDenseOutJob. I
>> found the reason, the ABtDenseOutJob map code is as below:
>>
>>
>>     protected void map(Writable key, VectorWritable value, Context context)
>>       throws IOException, InterruptedException {
>>
>>
>>       Vector vec = value.get();
>>
>>
>>       int vecSize = vec.size();
>>       if (aCols == null) {
>>         aCols = new Vector[vecSize];
>>       } else if (aCols.length < vecSize) {
>>         aCols = Arrays.copyOf(aCols, vecSize);
>>       }
>>
>>
>>       if (vec.isDense()) {
>>         for (int i = 0; i < vecSize; i++) {
>>           extendAColIfNeeded(i, aRowCount + 1);
>>           aCols[i].setQuick(aRowCount, vec.getQuick(i));
>>         }
>>       } else if (vec.size() > 0) {
>>         for (Vector.Element vecEl : vec.nonZeroes()) {
>>           int i = vecEl.index();
>>           extendAColIfNeeded(i, aRowCount + 1);
>>           aCols[i].setQuick(aRowCount, vecEl.get());
>>         }
>>       }
>>       aRowCount++;
>>     }
>>
>>
>> If the input is RandomAccessSparseVector, usually with big data, it's
>> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new
>> Vector[vecSize] will introduce the OutOfMemory problem. The settlement of
>> course should be enlarge every tasktracker's maximum memory:
>> <property>
>>   <name>mapred.child.java.opts</name>
>>   <value>-Xmx1024m</value>
>> </property>
>> However, if you are NOT hadoop administrator or ops, you have no
>> permission to modify the config. So, I try to modify ABtDenseOutJob map
>> code to support RandomAccessSparseVector situation, I use hashmap to
>> represent aCols instead of the original Vector[] aCols array, the modified
>> code is as below:
>>
>>
>> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
>>     protected void map(Writable key, VectorWritable value, Context context)
>>       throws IOException, InterruptedException {
>>
>>
>>       Vector vec = value.get();
>>       if (vec.isDense()) {
>>         for (int i = 0; i < vecSize; i++) {
>>           //extendAColIfNeeded(i, aRowCount + 1);
>>           if (aColsMap.get(i) == null) {
>>          aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
>> 100));
>>           }
>>           aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
>>           //aCols[i].setQuick(aRowCount, vec.getQuick(i));
>>         }
>>       } else if (vec.size() > 0) {
>>         for (Vector.Element vecEl : vec.nonZeroes()) {
>>           int i = vecEl.index();
>>           //extendAColIfNeeded(i, aRowCount + 1);
>>           if (aColsMap.get(i) == null) {
>>          aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE,
>> 100));
>>           }
>>           aColsMap.get(i).setQuick(aRowCount, vecEl.get());
>>           //aCols[i].setQuick(aRowCount, vecEl.get());
>>         }
>>       }
>>       aRowCount++;
>>     }
>>
>>
>> Then the OutofMemory problem is dismissed.
>>
>>
>> Thank you!
>>
>>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message