hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: parallel cat
Date Wed, 06 Jul 2011 11:35:41 GMT
On 06/07/11 11:08, Rita wrote:
> I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
> lot to pipe to various programs.
>
> I was wondering if its possible to prefetch the data for clients with more
> bandwidth. Most of my clients have 10g interface and datanodes are 1g.
>
> I was thinking, prefetch x blocks (even though it will cost extra memory)
> while reading block y. After block y is read, read the prefetched blocked
> and then throw it away.
>
> It should be used like this:
>
>
> export PREFETCH_BLOCKS=2 #default would be 1
> hadoop fs -pcat hdfs://namenode/verylarge file | program
>
> Any thoughts?
>

Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Here the DFS client got some extra data on where every copy of every 
block was, and the client decided which machine to fetch it from. This 
made the best use of the entire cluster, by keeping each datanode busy.


-steve

Mime
View raw message