hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gautam <gautamkows...@gmail.com>
Subject Re: Most Common ways to load data into Hadoop in production systems
Date Sat, 24 Jul 2010 06:51:34 GMT
On Wed, Jul 21, 2010 at 10:25 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

> On Wed, Jul 21, 2010 at 12:42 PM, Xavier Stevens <xstevens@mozilla.com>
> wrote:
> >  Hi Urckle,
> >
> > A lot of the more "advanced" setups just record data directly to HDFS to
> > start with.  You have to write some custom code using the HDFS API but
> > that way you don't need to import large masses of data.  People also use
> > "distcp" to do large scale imports, but if you're hitting something like
> > an NFS server it will probably fall over if you have a descent sized
> > cluster hitting it.
> >
> > In cases where writing data directly to HDFS can't be done for some
> > reason.  You can install Hadoop as a "client" (no processes actually
> > running) on the machines that will be putting the data.  Then you can
> > invoke hadoop fs -put in parallel from all of those machines.  It should
> > go significantly faster than using a single mount point.
> >
> > Hope this helps,
> >
> > -Xavier
> >
> > On 7/21/10 9:30 AM, Urckle wrote:
> >> Hi, I have a newbie question.
> >>
> >> Scenario:
> >> Hadoop version: 0.20.2
> >> MR coding will be done in java.
> >>
> >>
> >> Just starting out with my first Hadoop setup. I would like to know are
> >> there any best practice ways to load data into the dfs? I have
> >> (obviously) manually put data files into hdfs using the shell commands
> >> while playing with it at setup but going forward I will want to be
> >> retrieving large numbers of data feeds from remote, 3rd party
> >> locations and throwing them into hadoop for analysis later. What is
> >> the best way to automate this? Is it to gather the retrieved files
> >> into known locations to be mounted and then automate via script etc.
> >> to put the files into hdfs? Or is there some other practice? I've not
> >> been able to find specific use case yet... all docs cover the basic fs
> >> command without giving much details about more advanced setups.
> >>
> >> thanks for any info
> >>
> >> regards
> >
>
> Hadoop Mailing list presents...
> An ASF film...
> from the producer of 'dfs -copyFromLocal'...
>
> How to move logs
>
> Starting (in order of appearance)
> Chukwa
> Chukwa is an open source data collection system for monitoring large
> distributed systems. Chukwa is built on top of the Hadoop Distributed
> File System (HDFS) and Map/Reduce framework and inherits Hadoop’s
> scalability and robustness. Chukwa also includes a flexible and
> powerful toolkit for displaying, monitoring and analyzing results to
> make the best use of the collected data.
>
> Scribe
>
> Scribe is a server for aggregating log data that's streamed in real
> time from clients. It is designed to be scalable and reliable.
>
> Flume
>
> Flume is a distributed service that makes it very easy to collect and
> aggregate your data into a persistent store such as HDFS. Flume can
> read data from almost any source – log files, Syslog packets, the
> standard output of any Unix process – and can deliver it to a batch
> processing system like Hadoop or a real-time data store like HBase
>
> Co Starting:
>
> FSDataOutputStream
>


One could put FSDataOutputStream on steroids just by writing up a Map-Only
Hadoop job to pull data into a cluster. Each mapper can be a pipe that pulls
using java's HTTPClient or the Java Secure Channel api (
http://www.jcraft.com/jsch/) to pull from SSH interfaces   AND   writes to
DFS.

This releaves the data loader from many-a-headache of download task
management.
-G.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message