hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rita <rmorgan...@gmail.com>
Subject Re: parallel cat
Date Thu, 07 Jul 2011 07:22:16 GMT
Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
see any example code for the implementation.

On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran <stevel@apache.org> wrote:

> On 06/07/11 11:08, Rita wrote:
>> I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
>> a
>> lot to pipe to various programs.
>> I was wondering if its possible to prefetch the data for clients with more
>> bandwidth. Most of my clients have 10g interface and datanodes are 1g.
>> I was thinking, prefetch x blocks (even though it will cost extra memory)
>> while reading block y. After block y is read, read the prefetched blocked
>> and then throw it away.
>> It should be used like this:
>> export PREFETCH_BLOCKS=2 #default would be 1
>> hadoop fs -pcat hdfs://namenode/verylarge file | program
>> Any thoughts?
> Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
> http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>
> Here the DFS client got some extra data on where every copy of every block
> was, and the client decided which machine to fetch it from. This made the
> best use of the entire cluster, by keeping each datanode busy.
> -steve

--- Get your facts first, then you can distort them as you please.--

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message