hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Tue, 28 Dec 2010 23:07:31 GMT
if the data is coming off of a single machine then simply running multiple
threads on that machine spraying the data into the cluster is likely to be
faster than a map-reduce program.  The reason is that you can run the
spraying process continuously and can tune it to carefully saturate your
outbound link toward the cluster.  With a map-reduce program it will be very
easy to flatten the link.

Another issue is that it is easy to push data to the cluster from a local
disk rather than to pull it from nodes in the cluster because most network
file protocols aren't as efficient as you might like.

On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <ronald.taylor@pnl.gov>wrote:

> 2) some way of parallelizing the reads
> So - I will check into network hardware, in regard to (1). But for (2), is
> the MapReduce method that I was think of, a way that uses "hadoop fs
> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
> believe that you were saying that it is indeed OK, but I want to
> double-check, since this will be a critical piece of our work flow.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message