Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.106 as permitted sender)
Message-ID: <BLU0-SMTP1676824B91FE18F86F9CF0F8FE60@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: OutOfMemory during Plain Java MapReduce
From: Michael Segel <michael_segel@hotmail.com>
In-Reply-To: 
 <CAOcnVr1r-VcoSe-YFBpNe3qmqvXSUT7z3NfABA0FzbNS_MmVgQ@mail.gmail.com>
Date: Fri, 8 Mar 2013 08:39:53 -0600
Content-Transfer-Encoding: quoted-printable
References: 
 <CAJv_xg__igcOZy1MnTJc+V++xQMbfXHFBsEmYnN47w-Ooywfwg@mail.gmail.com>
 <CAJv_xg_5MbLOMGgG2GoQigYLA3+GtGdDSkuYuWp5LA1gg4xGpA@mail.gmail.com>
 <CAOcnVr1r-VcoSe-YFBpNe3qmqvXSUT7z3NfABA0FzbNS_MmVgQ@mail.gmail.com>
To: user@hadoop.apache.org

"A potential problem could be, that a reduce is going to write files =
>600MB and our mapred.child.java.opts is set to ~380MB."

Isn't the minimum heap normally 512MB?=20

Why not just increase your child heap size, assuming you have enough =
memory on the box...


On Mar 8, 2013, at 4:57 AM, Harsh J <harsh@cloudera.com> wrote:

> Hi,
>=20
> When you implement code that starts memory-storing value copies for
> every record (even if of just a single key), things are going to break
> in big-data-land. Practically, post-partitioning, the # of values for
> a given key can be huge given the source data, so you cannot hold it
> all in and then write in one go. You'd probably need to write out
> something continuously if you really really want to do this, or use an
> alternative form of key-value storage where updates can be made
> incrementally (Apache HBase is such a store, as one example).
>=20
> This has been discussed before IIRC, and if the goal were to store the
> outputs onto a file then its better to just directly serialize them
> with a file opened instead of keeping it in a data structure and
> serializing it at the end. The caveats that'd apply if you were to
> open your own file from a task are described at
> =
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_f=
iles_directly_from_map.2BAC8-reduce_tasks.3F.
>=20
> On Fri, Mar 8, 2013 at 4:35 AM, Christian Schneider
> <cschneiderpublic@gmail.com> wrote:
>> I had a look to the stacktrace and it says the problem is at the =
reducer:
>> userSet.add(iterator.next().toString());
>>=20
>> Error: Java heap space
>> attempt_201303072200_0016_r_000002_0: WARN : mapreduce.Counters - =
Group
>> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
>> org.apache.hadoop.mapreduce.TaskCounter instead
>> attempt_201303072200_0016_r_000002_0: WARN :
>> org.apache.hadoop.conf.Configuration - session.id is deprecated. =
Instead,
>> use dfs.metrics.session-id
>> attempt_201303072200_0016_r_000002_0: WARN :
>> org.apache.hadoop.conf.Configuration - slave.host.name is deprecated.
>> Instead, use dfs.datanode.hostname
>> attempt_201303072200_0016_r_000002_0: FATAL: =
org.apache.hadoop.mapred.Child
>> - Error running child : java.lang.OutOfMemoryError: Java heap space
>> attempt_201303072200_0016_r_000002_0: at
>> java.util.Arrays.copyOfRange(Arrays.java:3209)
>> attempt_201303072200_0016_r_000002_0: at
>> java.lang.String.<init>(String.java:215)
>> attempt_201303072200_0016_r_000002_0: at
>> java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
>> attempt_201303072200_0016_r_000002_0: at
>> java.nio.CharBuffer.toString(CharBuffer.java:1157)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.io.Text.decode(Text.java:394)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.io.Text.decode(Text.java:371)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.io.Text.toString(Text.java:273)
>> attempt_201303072200_0016_r_000002_0: at
>> com.myCompany.UserToAppReducer.reduce(RankingReducer.java:21)
>> attempt_201303072200_0016_r_000002_0: at
>> com.myCompany.UserToAppReducer.reduce(RankingReducer.java:1)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
>> attempt_201303072200_0016_r_000002_0: at
>> =
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> attempt_201303072200_0016_r_000002_0: at
>> java.security.AccessController.doPrivileged(Native Method)
>> attempt_201303072200_0016_r_000002_0: at
>> javax.security.auth.Subject.doAs(Subject.java:396)
>> attempt_201303072200_0016_r_000002_0: at
>> =
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.=
java:1408)
>> attempt_201303072200_0016_r_000002_0: at
>> org.apache.hadoop.mapred.Child.main(Child.java:262)
>>=20
>> But how to solve this?
>>=20
>>=20
>> 2013/3/7 Christian Schneider <cschneiderpublic@gmail.com>
>>>=20
>>> Hi,
>>> during the Reduce phase or afterwards (i don't really know how to =
debug
>>> it) I get a heap out of Memory Exception.
>>>=20
>>> I guess this is because the value of the reduce task (a Custom =
Writable)
>>> holds a List with a lot of user ids.
>>> The Setup is quite simple. This are the related classes I used:
>>>=20
>>> //-----------------------------------------------
>>> // The Reducer
>>> // It just add all userIds of the Iterable to the UserSetWriteAble
>>> //-----------------------------------------------
>>> public class UserToAppReducer extends Reducer<Text, Text, Text,
>>> UserSetWritable> {
>>>=20
>>> @Override
>>> protected void reduce(final Text appId, final Iterable<Text> =
userIds,
>>> final Context context) throws IOException, InterruptedException  {
>>> final UserSetWritable userSet =3D new UserSetWritable();
>>>=20
>>> final Iterator<Text> iterator =3D userIds.iterator();
>>> while (iterator.hasNext()) {
>>> userSet.add(iterator.next().toString());
>>> }
>>>=20
>>> context.write(appId, userSet);
>>> }
>>> }
>>>=20
>>> //-----------------------------------------------
>>> // The Custom Writable
>>> // Needed to implement a own toString Method bring the output into =
the
>>> right format. Maybe i can to this also with a own OutputFormat =
class.
>>> //-----------------------------------------------
>>> public class UserSetWritable implements Writable {
>>> private final Set<String> userIds =3D new HashSet<String>();
>>>=20
>>> public void add(final String userId) {
>>> this.userIds.add(userId);
>>> }
>>>=20
>>> @Override
>>> public void write(final DataOutput out) throws IOException {
>>> out.writeInt(this.userIds.size());
>>> for (final String userId : this.userIds) {
>>> out.writeUTF(userId);
>>> }
>>> }
>>>=20
>>> @Override
>>> public void readFields(final DataInput in) throws IOException {
>>> final int size =3D in.readInt();
>>> for (int i =3D 0; i < size; i++) {
>>> final String readUTF =3D in.readUTF();
>>> this.userIds.add(readUTF);
>>> }
>>> }
>>>=20
>>> @Override
>>> public String toString() {
>>> String result =3D "";
>>> for (final String userId : this.userIds) {
>>> result +=3D userId + "\t";
>>> }
>>>=20
>>> result +=3D this.userIds.size();
>>> return result;
>>> }
>>> }
>>>=20
>>> As Outputformat I used the default TextOutputFormat.
>>>=20
>>> A potential problem could be, that a reduce is going to write files =
>600MB
>>> and our mapred.child.java.opts is set to ~380MB.
>>> I digged deeper into the TextOutputFormat and saw, that the
>>> HdfsDataOutputStream is not implementing .flush().
>>> And .flush is also not used in TextOutputFormat. This means, that =
the
>>> whole file is kept in RAM and then persisted at the end of =
processing, or?
>>> And of course, this leads into the exception.
>>>=20
>>> With PIG I am able to query the same Data. Even with one reducer =
only.
>>> But I have a bet to make it faster with plain MapReduce :)
>>>=20
>>> Could you help me how to debug this and maybe point me into the =
right
>>> direction?
>>>=20
>>> Best Regards,
>>> Christian.
>>=20
>>=20
>=20
>=20
>=20
> --
> Harsh J
>=20