hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Angeles <patr...@cloudera.com>
Subject Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Tue, 28 Dec 2010 23:15:37 GMT
To add to what Ted says,

"hadoop fs -copyFromLocal" assumes that the file is present locally on the
datanode. That means the file would have to have been transfered to the
node, copied to local disk, and only after that is it written to HDFS, so
there's an extra trip to disk that you could have avoided.

You might have avoided that using a direct pull from the MapReduce task that
writes directly to HDFS. But, as Ted mentions, that is not as efficient as a
source-based push and is also more complex.

On Tue, Dec 28, 2010 at 6:07 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> if the data is coming off of a single machine then simply running multiple
> threads on that machine spraying the data into the cluster is likely to be
> faster than a map-reduce program.  The reason is that you can run the
> spraying process continuously and can tune it to carefully saturate your
> outbound link toward the cluster.  With a map-reduce program it will be very
> easy to flatten the link.
>
> Another issue is that it is easy to push data to the cluster from a local
> disk rather than to pull it from nodes in the cluster because most network
> file protocols aren't as efficient as you might like.
>
>
> On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <ronald.taylor@pnl.gov>wrote:
>
>> 2) some way of parallelizing the reads
>>
>> So - I will check into network hardware, in regard to (1). But for (2), is
>> the MapReduce method that I was think of, a way that uses "hadoop fs
>> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
>> believe that you were saying that it is indeed OK, but I want to
>> double-check, since this will be a critical piece of our work flow.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message