hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen He <airb...@gmail.com>
Subject Re: Data Locality Importance
Date Sun, 23 Mar 2014 05:55:26 GMT
Hi Mike

Data locality has an assumption. It assumes storage access (disk, ssd, etc)
is faster than network data transferring. Vinod has already explained the
benefits. But locality in map stage may not always bring good things. If a
fat node saves a large file, it is possible that current MR framework
assigns a lots of map tasks from single job to this node, and then, congest
its network in shuffle.

I am not sure how EMR is implemented in physical layer. If they are all
virtual machines, it is possible that your "seperate" HDFS cluster and MR
cluster still get benefits from local data access.

Chen


On Sat, Mar 22, 2014 at 11:07 PM, Sathya <sathya@morisonmenon.com> wrote:

> "VOTE FOR MODI" or teach me how not to get mails
>
> -----Original Message-----
> From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com] On Behalf
> Of
> Vinod Kumar Vavilapalli
> Sent: Sunday, March 23, 2014 12:20 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Data Locality Importance
>
> Like you said, it depends both on the kind of network you have and the type
> of your workload.
>
> Given your point about S3, I'd guess your input files/blocks are not large
> enough that moving code to data trumps moving data itself to the code. When
> that balance tilts a lot, especially when moving large input data
> files/blocks, data-locality will help improve performance significantly.
> That or when the read throughput from a remote desk << reading it from a
> local disk.
>
> HTH
> +Vinod
>
> On Mar 21, 2014, at 7:06 PM, Mike Sam <mikesam460@gmail.com> wrote:
>
> > How important is Data Locality to Hadoop? I mean, if we prefer to
> > separate the HDFS cluster from the MR cluster, we will lose data
> > locality but my question is how bad is this assuming we provider a
> > reasonable network connection between the two clusters? EMR kills data
> > locality when using S3 as storage but we do not see a significant job
> > time difference running same job from the HDFS cluster of the same
> > setup. So, I am wondering how important is Data Locality to Hadoop in
> practice?
> >
> > Thanks,
> > Mike
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of
> this message is not the intended recipient, you are hereby notified that
> any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.
>
>
> ---
> This email is free from viruses and malware because avast! Antivirus
> protection is active.
> http://www.avast.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message