hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From max scalf <oracle.bl...@gmail.com>
Subject Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)
Date Sat, 20 Aug 2016 14:35:32 GMT
Just out of curiosity, have you enabled S3 endpoint for this ?  Hopefully u
are running this cluster inside a VPC, if so an endpoint would help as the
S3 traffic will not go out to the Internet...

Any new policies put in place for your S3 bucket as others have mentioned
something about throttling ?

On Wed, Aug 17, 2016, 3:22 PM Sebastian Nagel <wastl.nagel@googlemail.com>
wrote:

> Hi Dheeren, hi Chris,
>
>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so,
> are they co-located in the same AWS region as the S3 bucket?
>
> Running a cluster of 100 m1.xlarge EC2 instances with Ubuntu 14.04
> (ami-41a20f2a).
> The cluster is running in a single availability zone (us-east-1d), the S3
> bucket
> is in the same region (us-east-1).
>
> % lsb_release -d
> Description:    Ubuntu 14.04.3 LTS
>
> % uname -a
> Linux ip-10-91-235-121 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29
> 11:21:34 UTC 2015 x86_64 x86_64
> x86_64 GNU/Linux
>
> > Did you change java idk version as well,  as part of the upgrade?
>
> Java is taken as provided by Ubuntu:
>
> % java -version
> java version "1.7.0_111"
> OpenJDK Runtime Environment (IcedTea 2.6.7) (7u111-2.6.7-0ubuntu0.14.04.3)
> OpenJDK 64-Bit Server VM (build 24.111-b01, mixed mode)
>
> Cloudera CDH is installed from
>
> http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
>
> After the jobs are done the cluster is shut down and bootstrapped (bash +
> cloudinit) anew on demand.
> A new launch of the cluster may, of course, include updates of
>  - the underlying Amazon machine image
>  - Ubuntu packages
>  - Cloudera packages
>
> And the real reason for the problem may come from any of these changes.
> The update to Cloudera CDH 5.8.0 was the most obvious since the problems
> appeared
> (seen first 2016-08-01).
>
> >> If the cluster is not running in EC2 (e.g. on-premises physical
> hardware), then are there any
> notable differences on nodes that experienced this problem (e.g. smaller
> capacity on the outbound NIC)?
>
> Probably not, although I cannot exclude this. I've the last days run into
> problems which could be
> related: few tasks are slow, even seem to hang, e.g., reducers during
> copy. But that's also looks
> more like a Hadoop (configuration) problem. Network throughput between
> nodes measured with iperf is
> not super-performant but generally ok (5-20 MBit/s).
>
>  >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or
> throttled or somehow compromised, then I could see how longer timeouts and
> more retries might
> increase overall job time.  With the shorter settings, it might cause
> individual task attempts to
> fail sooner.  Then, if the next attempt gets scheduled to a different node
> with better bandwidth to
> S3, it would start making progress faster in the second attempt.  Then,
> the effect on overall job
> execution might be faster.
>
> That's also my assumption. While connecting to S3 a server is selected
> which is fast now.
> While copying 1 GB which takes a couple of minutes just because of general
> network throughput,
> the server may become more loaded. When reconnecting a better server is
> chosen.
>
> Btw., tasks are not failing when choosing a moderate timeout - 30 sec. is
> ok, with lower
> values (a few seconds) the file uploads frequently fail.
>
> I've seen this behavior with a simple distcp from S3: with the default
> values, it took 1 day to copy
> 300 GB from S3 to HDFS. After choosing a shorter timeout the job finished
> within 5 hours.
>
> Thanks,
> Sebastian
>
> On 08/16/2016 09:11 PM, Dheeren Bebortha wrote:
> > Did you change java idk version as well,  as part of the upgrade?
> > Dheeren
> >
> >> On Aug 16, 2016, at 11:59 AM, Chris Nauroth <cnauroth@hortonworks.com>
> wrote:
> >>
> >> Hello Sebastian,
> >>
> >> This is an interesting finding.  Thank you for reporting it.
> >>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so, are they co-located in the same AWS region as
> the S3 bucket?  If the cluster is not running in EC2 (e.g. on-premises
> physical hardware), then are there any notable differences on nodes that
> experienced this problem (e.g. smaller capacity on the outbound NIC)?
> >>
> >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or throttled or somehow compromised, then I could
> see how longer timeouts and more retries might increase overall job time.
> With the shorter settings, it might cause individual task attempts to fail
> sooner.  Then, if the next attempt gets scheduled to a different node with
> better bandwidth to S3, it would start making progress faster in the second
> attempt.  Then, the effect on overall job execution might be faster.
> >>
> >> --Chris Nauroth
> >>
> >> On 8/7/16, 12:12 PM, "Sebastian Nagel" <wastl.nagel@googlemail.com>
> wrote:
> >>
> >>    Hi,
> >>
> >>    recently, after upgrading to CDH 5.8.0, I've run into a performance
> >>    issue when reading data from AWS S3 (via s3a).
> >>
> >>    A job [1] reads 10,000s files ("objects") from S3 and writes
> extracted
> >>    data back to S3. Every file/object is about 1 GB in size, processing
> >>    is CPU-intensive and takes a couple of minutes per file/object. Each
> >>    file/object is processed by one task using FilenameInputFormat.
> >>
> >>    After the upgrade to CDH 5.8.0, the job showed slow progress, 5-6
> >>    times slower in overall than in previous runs. A significant number
> >>    of tasks hung up without progress for up to one hour. These tasks
> were
> >>    dominating and most nodes in the cluster showed little or no CPU
> >>    utilization. Tasks are not killed/restarted because the task timeout
> >>    is set to a very large value (because S3 is known to be slow
> >>    sometimes). Attaching to a couple of the hung tasks with jstack
> >>    showed that these tasks hang when reading from S3 [3].
> >>
> >>    The problem was finally fixed by setting
> >>      fs.s3a.connection.timeout = 30000  (default: 200000 ms)
> >>      fs.s3a.attempts.maximum = 5        (default 20)
> >>    Tasks now take 20min. in the worst case, the majority finishes
> within minutes.
> >>
> >>    Is this the correct way to fix the problem?
> >>    These settings have been increased recently in HADOOP-12346 [2].
> >>    What could be the draw-backs with a lower timeout?
> >>
> >>    Thanks,
> >>    Sebastian
> >>
> >>    [1]
> >>
> https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java
> >>
> >>    [2] https://issues.apache.org/jira/browse/HADOOP-12346
> >>
> >>    [3] "main" prio=10 tid=0x00007fad64013000 nid=0x4ab5 runnable
> [0x00007fad6b274000]
> >>       java.lang.Thread.State: RUNNABLE
> >>            at java.net.SocketInputStream.socketRead0(Native Method)
> >>            at
> java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>            at
> java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>            at
> >>    com.cloudera.org.apache.http.impl.io
> .AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
> >>            at
> >>    com.cloudera.org.apache.http.impl.io
> .ContentLengthInputStream.read(ContentLengthInputStream.java:182)
> >>            at
> com.cloudera.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> com.cloudera.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
> >>            at
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> >>            at
> org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
> >>            - locked <0x00000007765604f8> (a
> org.apache.hadoop.fs.s3a.S3AInputStream)
> >>            at java.io.DataInputStream.read(DataInputStream.java:149)
> >>            ...
> >>
> >>    ---------------------------------------------------------------------
> >>    To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >>    For additional commands, e-mail: user-help@hadoop.apache.org
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >> For additional commands, e-mail: user-help@hadoop.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>

Mime
View raw message