hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xavier Stevens <xstev...@mozilla.com>
Subject Re: Most Common ways to load data into Hadoop in production systems
Date Wed, 21 Jul 2010 16:42:21 GMT
 Hi Urckle,

A lot of the more "advanced" setups just record data directly to HDFS to
start with.  You have to write some custom code using the HDFS API but
that way you don't need to import large masses of data.  People also use
"distcp" to do large scale imports, but if you're hitting something like
an NFS server it will probably fall over if you have a descent sized
cluster hitting it.

In cases where writing data directly to HDFS can't be done for some
reason.  You can install Hadoop as a "client" (no processes actually
running) on the machines that will be putting the data.  Then you can
invoke hadoop fs -put in parallel from all of those machines.  It should
go significantly faster than using a single mount point.

Hope this helps,


On 7/21/10 9:30 AM, Urckle wrote:
> Hi, I have a newbie question.
> Scenario:
> Hadoop version: 0.20.2
> MR coding will be done in java.
> Just starting out with my first Hadoop setup. I would like to know are
> there any best practice ways to load data into the dfs? I have
> (obviously) manually put data files into hdfs using the shell commands
> while playing with it at setup but going forward I will want to be
> retrieving large numbers of data feeds from remote, 3rd party
> locations and throwing them into hadoop for analysis later. What is
> the best way to automate this? Is it to gather the retrieved files
> into known locations to be mounted and then automate via script etc.
> to put the files into hdfs? Or is there some other practice? I've not
> been able to find specific use case yet... all docs cover the basic fs
> command without giving much details about more advanced setups.
> thanks for any info
> regards

View raw message