hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)
Date Tue, 16 Aug 2016 18:59:56 GMT
Hello Sebastian,

This is an interesting finding.  Thank you for reporting it.

Are you able to share a bit more about your deployment architecture?  Are these EC2 VMs? 
If so, are they co-located in the same AWS region as the S3 bucket?  If the cluster is not
running in EC2 (e.g. on-premises physical hardware), then are there any notable differences
on nodes that experienced this problem (e.g. smaller capacity on the outbound NIC)?

This is just a theory, but If your bandwidth to the S3 service is intermittently saturated
or throttled or somehow compromised, then I could see how longer timeouts and more retries
might increase overall job time.  With the shorter settings, it might cause individual task
attempts to fail sooner.  Then, if the next attempt gets scheduled to a different node with
better bandwidth to S3, it would start making progress faster in the second attempt.  Then,
the effect on overall job execution might be faster.

--Chris Nauroth

On 8/7/16, 12:12 PM, "Sebastian Nagel" <wastl.nagel@googlemail.com> wrote:

    Hi,
    
    recently, after upgrading to CDH 5.8.0, I've run into a performance
    issue when reading data from AWS S3 (via s3a).
    
    A job [1] reads 10,000s files ("objects") from S3 and writes extracted
    data back to S3. Every file/object is about 1 GB in size, processing
    is CPU-intensive and takes a couple of minutes per file/object. Each
    file/object is processed by one task using FilenameInputFormat.
    
    After the upgrade to CDH 5.8.0, the job showed slow progress, 5-6
    times slower in overall than in previous runs. A significant number
    of tasks hung up without progress for up to one hour. These tasks were
    dominating and most nodes in the cluster showed little or no CPU
    utilization. Tasks are not killed/restarted because the task timeout
    is set to a very large value (because S3 is known to be slow
    sometimes). Attaching to a couple of the hung tasks with jstack
    showed that these tasks hang when reading from S3 [3].
    
    The problem was finally fixed by setting
      fs.s3a.connection.timeout = 30000  (default: 200000 ms)
      fs.s3a.attempts.maximum = 5        (default 20)
    Tasks now take 20min. in the worst case, the majority finishes within minutes.
    
    Is this the correct way to fix the problem?
    These settings have been increased recently in HADOOP-12346 [2].
    What could be the draw-backs with a lower timeout?
    
    Thanks,
    Sebastian
    
    [1]
    https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java
    
    [2] https://issues.apache.org/jira/browse/HADOOP-12346
    
    [3] "main" prio=10 tid=0x00007fad64013000 nid=0x4ab5 runnable [0x00007fad6b274000]
       java.lang.Thread.State: RUNNABLE
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at
    com.cloudera.org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
            at
    com.cloudera.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
            at com.cloudera.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at com.cloudera.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
            at com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
            - locked <0x00000007765604f8> (a org.apache.hadoop.fs.s3a.S3AInputStream)
            at java.io.DataInputStream.read(DataInputStream.java:149)
            ...
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
    For additional commands, e-mail: user-help@hadoop.apache.org
    
    
    

Mime
View raw message