hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Urckle <urc...@gmail.com>
Subject Re: Most Common ways to load data into Hadoop in production systems
Date Wed, 21 Jul 2010 17:00:49 GMT
On 21/07/2010 17:55, Edward Capriolo wrote:
> On Wed, Jul 21, 2010 at 12:42 PM, Xavier Stevens<xstevens@mozilla.com>  wrote:
>>   Hi Urckle,
>> A lot of the more "advanced" setups just record data directly to HDFS to
>> start with.  You have to write some custom code using the HDFS API but
>> that way you don't need to import large masses of data.  People also use
>> "distcp" to do large scale imports, but if you're hitting something like
>> an NFS server it will probably fall over if you have a descent sized
>> cluster hitting it.
>> In cases where writing data directly to HDFS can't be done for some
>> reason.  You can install Hadoop as a "client" (no processes actually
>> running) on the machines that will be putting the data.  Then you can
>> invoke hadoop fs -put in parallel from all of those machines.  It should
>> go significantly faster than using a single mount point.
>> Hope this helps,
>> -Xavier
>> On 7/21/10 9:30 AM, Urckle wrote:
>>> Hi, I have a newbie question.
>>> Scenario:
>>> Hadoop version: 0.20.2
>>> MR coding will be done in java.
>>> Just starting out with my first Hadoop setup. I would like to know are
>>> there any best practice ways to load data into the dfs? I have
>>> (obviously) manually put data files into hdfs using the shell commands
>>> while playing with it at setup but going forward I will want to be
>>> retrieving large numbers of data feeds from remote, 3rd party
>>> locations and throwing them into hadoop for analysis later. What is
>>> the best way to automate this? Is it to gather the retrieved files
>>> into known locations to be mounted and then automate via script etc.
>>> to put the files into hdfs? Or is there some other practice? I've not
>>> been able to find specific use case yet... all docs cover the basic fs
>>> command without giving much details about more advanced setups.
>>> thanks for any info
>>> regards
> Hadoop Mailing list presents...
> An ASF film...
> from the producer of 'dfs -copyFromLocal'...
> How to move logs
> Starting (in order of appearance)
> Chukwa
> Chukwa is an open source data collection system for monitoring large
> distributed systems. Chukwa is built on top of the Hadoop Distributed
> File System (HDFS) and Map/Reduce framework and inherits Hadoop’s
> scalability and robustness. Chukwa also includes a flexible and
> powerful toolkit for displaying, monitoring and analyzing results to
> make the best use of the collected data.
> Scribe
> Scribe is a server for aggregating log data that's streamed in real
> time from clients. It is designed to be scalable and reliable.
> Flume
> Flume is a distributed service that makes it very easy to collect and
> aggregate your data into a persistent store such as HDFS. Flume can
> read data from almost any source – log files, Syslog packets, the
> standard output of any Unix process – and can deliver it to a batch
> processing system like Hadoop or a real-time data store like HBase
> Co Starting:
> FSDataOutputStream
sounds like a cool action movie!...

View raw message