hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Macdonald (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4) tool to mount dfs on linux
Date Tue, 19 Feb 2008 22:30:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570484#action_12570484
] 

Craig Macdonald commented on HADOOP-4:
--------------------------------------

Pete,

I have been experimenting with fuse_dfs.c and have a few questions:

(1) I am using a previous version of fuse_dfs.c, mainly because I dont have bootstrap.sh.
However, with respect to the new fuse_dfs.c option parsing - is this compatible with calling
via mount.fuse, and autofs?

This how I currently mount using an autofs map containing:
{code}
hdfs            -fstype=fuse,rw,nodev,nonempty,noatime,allow_other  :/path/to/fuse_dfs_moutn/fuse_dfs.sh\#dfs\://namenode\:9000
{code}
fuse_dfs.sh is just a shell script setting the CLASSPATH and LD_LIBRARY_PATH, and essentially,
just execs the fuse_dfs. If I changed to the more recent version, I would probably have to
put the dfs://namenode:9000 configuration into the script I think.

(2) Have you done any sort of performance testing? I'm experimenting with HDFS for use in
a mixed envionment (hadoop and non-hadoop jobs), and the throughput I see is miserable. For
example, I use a test network of 8 P3-1GHz nodes, and a similar client on 100meg network.

Below, I compare cat-ing a 512MB file from (a) an NFS mount on the same network as the cluster
nodes (b) using the hadoop frontend and (c) using the FUSE HDFS filesystem.

{noformat}
# (a)
$ time cat /mnt/tmp/data.df > /dev/null

real 0m47.280s
user 0m0.059s
sys 0m2.476s

# (b)
$ time bin/hadoop fs -cat hdfs:///user/craigm/data.df > /dev/null

real 0m48.839s
user 0m16.256s
sys 0m7.001s

# (c)
$ time cat /misc/hdfs/user/craigm/data.df >/dev/null

real    1m41.686s
user    0m0.135s
sys     0m2.302s
{noformat}

Note that the NFS and Hadoop fs -cat obtain about 10.5MB/sec, while the hdfs fuse mount (in
/misc/hdfs) achieves only 5MB/sec. Is this an expected overhead for FUSE? 

I did try tuning rd_buf_size to match the size of reads that the kernel was requesting - ie
128KB instead of 32KB, however this made matters worse:

{noformat}
# with 128KB buffer size
$ time cat /misc/hdfs/user/craigm/data.df >/dev/null

real    2m11.080s
user    0m0.113s
sys     0m2.180s
{noformat}

Perhaps an option would be to keep the HDFS file open between reads and timeout the connection
when not used, or something; read more than we need and then keep it in the memory? Both would
overly complicate the neat code though!

(3) If I use an autofs for hdfs, then mounts will timeout quickly (30 seconds), and then reconnect
again on demand. Perhaps fuse_dfs.c can implement the destroy fuse operation to free up the
connection to the namenode, etc?

Cheers

Craig

> tool to mount dfs on linux
> --------------------------
>
>                 Key: HADOOP-4
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.5.0
>         Environment: linux only
>            Reporter: John Xing
>            Assignee: Doug Cutting
>         Attachments: fuse-dfs.tar.gz, fuse-dfs.tar.gz, fuse-dfs.tar.gz, fuse-hadoop-0.1.0_fuse-j.2.2.3_hadoop.0.5.0.tar.gz,
fuse-hadoop-0.1.0_fuse-j.2.4_hadoop.0.5.0.tar.gz, fuse-hadoop-0.1.1.tar.gz, fuse-j-hadoopfs-03.tar.gz,
fuse_dfs.c, fuse_dfs.c, Makefile
>
>
> tool to mount dfs on linux

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message