hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radhe Radhe <radhe.krishna.ra...@live.com>
Subject RE: Streaming data access in HDFS: Design Feature
Date Wed, 05 Mar 2014 08:35:11 GMT
Hi Shashwat,

This is an excerpt from Hadoop The Definitive Guide--Tom White
Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java. Hadoop Streaming uses Unix standard streams
as the interface between Hadoop and your program, so you can use any language that
can read standard input and write to standard output to write your MapReduce
Streaming is naturally suited for text processing (although, as of version 0.21.0, it can
handle binary streams, too), and when used in text mode, it has a line-oriented view of
data. Map input data is passed over standard input to your map function, which processes
it line by line and writes lines to standard output. A map output key-value pair
is written as a single tab-delimited line. Input to the reduce function is in the same
format—a tab-separated key-value pair—passed over standard input. The reduce function
reads lines from standard input, which the framework guarantees are sorted by
key, and writes its results to standard output.

I think this is not what I am asking for.


From: dwivedishashwat@gmail.com
Date: Wed, 5 Mar 2014 13:47:09 +0530
Subject: Re: Streaming data access in HDFS: Design Feature
To: user@hadoop.apache.org
CC: radhe.krishna.radhe@live.com

Streaming means process it as its coming to HDFS, like where in hadoop this hadoop streaming
enable hadoop to receive data using executable of different types 

i hope you have already read this : http://hadoop.apache.org/docs/r0.18.1/streaming.html#Hadoop+Streaming

Warm Regards_∞_

 Shashwat Shriparv

On Wed, Mar 5, 2014 at 1:38 PM, Radhe Radhe <radhe.krishna.radhe@live.com> wrote:

Hello All,

Can anyone please explain what we mean by Streaming data access in HDFS.

Data is usually copied to HDFS and in HDFS the data is splitted across DataNodes in blocks.

Say for example, I have an input file of 10240 MB(10 GB) in size and a block size of 64 MB.
Then there will be 160 blocks.
These blocks will be distributed across DataNodes in blocks.
Now the Mappers will read data from these DataNodes keeping the data locality feature in mind(i.e.
blocks local to a DataNode will be read by the map tasks running in that DataNode).

Can you please point me where is the "Streaming data access in HDFS" is coming into picture


View raw message