hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: fuse-dfs then samba mount
Date Thu, 13 Aug 2009 07:12:18 GMT
On Thu, Aug 13, 2009 at 12:04 AM, Manhee Jo <jo@nttdocomo.com> wrote:

> Hi all,
> I've succeeded in sharing hdfs files from windows xp through fuse-dfs then
> samba mount.
> When I tried to copy (read and write) 1GB text file from fuse-dfs over
> samba, it took around 50 secs.
> Then, I tried "dfs get" the same file to a data node's local file system
> and tried to copy the file
> from the data node (without fuse-dfs this time) over samba, again, which
> took around 30 seconds.
> Since the disk reads are paralleled and distributed, should it be faster
> then reading from one node?

Nope - the file is stored distributed, but a single reader (using dfs -get
or the DFSClient API from Java) won't do a parallel read from multiple
replicas. What you've seen seems about right - there's a measurable overhead
of going through the datanode compared to just using local disk.

> Well, I know it must depend on the file size. So then, here is my question.
> What is actually happening in fuse-dfs read? and samba?

It's a single connection to one datanode at a time. At the end of each
block, it connects to the DN that stores the next block and reads from that
one. At no time does it transfer in parallel from multiple replicas. Some
people have mentioned this as a feature request but it hasn't been
prioritized high yet for a multitude of reasons.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message