hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)
Date Sun, 07 Aug 2016 19:12:07 GMT
Hi,

recently, after upgrading to CDH 5.8.0, I've run into a performance
issue when reading data from AWS S3 (via s3a).

A job [1] reads 10,000s files ("objects") from S3 and writes extracted
data back to S3. Every file/object is about 1 GB in size, processing
is CPU-intensive and takes a couple of minutes per file/object. Each
file/object is processed by one task using FilenameInputFormat.

After the upgrade to CDH 5.8.0, the job showed slow progress, 5-6
times slower in overall than in previous runs. A significant number
of tasks hung up without progress for up to one hour. These tasks were
dominating and most nodes in the cluster showed little or no CPU
utilization. Tasks are not killed/restarted because the task timeout
is set to a very large value (because S3 is known to be slow
sometimes). Attaching to a couple of the hung tasks with jstack
showed that these tasks hang when reading from S3 [3].

The problem was finally fixed by setting
  fs.s3a.connection.timeout = 30000  (default: 200000 ms)
  fs.s3a.attempts.maximum = 5        (default 20)
Tasks now take 20min. in the worst case, the majority finishes within minutes.

Is this the correct way to fix the problem?
These settings have been increased recently in HADOOP-12346 [2].
What could be the draw-backs with a lower timeout?

Thanks,
Sebastian

[1]
https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java

[2] https://issues.apache.org/jira/browse/HADOOP-12346

[3] "main" prio=10 tid=0x00007fad64013000 nid=0x4ab5 runnable [0x00007fad6b274000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at
com.cloudera.org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
        at
com.cloudera.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
        at com.cloudera.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at com.cloudera.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
        at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
        - locked <0x00000007765604f8> (a org.apache.hadoop.fs.s3a.S3AInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        ...

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Mime
View raw message