hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Kernévez <pkerne...@octo.com>
Subject Re: Anatomy of read in hdfs
Date Mon, 10 Apr 2017 10:00:42 GMT
On Mon, Apr 10, 2017 at 11:46 AM, Sidharth Kumar <
sidharthkumar2707@gmail.com> wrote:

> Thanks Philippe,
>
> I am looking for answer only restricted to HDFS. Because we can do read
> and write operations from CLI using commands like "*hadoop fs
> -copyfromlocal /(local disk location) /(hdfs path)" *and read using "*hadoop
> fs -text /(hdfs file)" *as well.
>
> So my question are
> 1) when I write data using -copyfromlocal command how data from data queue
> is being pushed to data streamer ? Do we have only one data streamer which
> listen to data queue and store data into individual datanode one by one or
> we have multiple streamer which listen to data queue and create pipeline
> for each individual packets?
>
​On stream per command. You may start several command, one per file, but
the bottleneck will quickly be ​the network.
This command is only used to do import/export data from/to hadoop cluster.
The main reads and writes should occurs inside the cluster, when you will
do processing.


> 2) Similarly when we read data, client will receive packets one after
> another in sequential manner like 2nd data node will wait for 1st node to
> send it's block first or it will be a parallel process.
>
​Depend on the reader. If you use cmd cli, yes the reads will be
sequential. If you use Hadoop Yarn processing patterns (MapReduce, Spark,
Tez, etc.)​ then multiple reader (Map) will be started to do parallel
processing of you data.

​What do you want to do with the data that you read ?

Regards,
Philippe​



>
>
> Thanks for your help in advance.
>
> Sidharth
>
>
> On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pkernevez@octo.com> wrote:
>
>> Hi Sidharth,
>>
>> As it has been explained, HDFS is not just a file system. It's a part of
>> the Hadoop platform. To take advantage of HDFS you have to understand how
>> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
>> to implements jobs and parallel processing.
>> That says that you will have to rethink the design of your programs to
>> take advantage of HDFS.
>>
>> You may start with this kind of tutorial
>> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>>
>> Then have a deeper read of the Hadoop documentation
>> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client
>> /hadoop-mapreduce-client-core/MapReduceTutorial.html
>>
>> Regards,
>> Philippe
>>
>>
>>
>> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <daemeonr@gmail.com>
>> wrote:
>>
>>> Readers ARE parallel processes, one per map task. There are defaults in
>>> map phase, about how many readers there are for the input file(s). Default
>>> is one mapper task block (or file, where any file is smaller than the hdfs
>>> block size). There is no java framework per se for splitting up an file
>>> (technically not so, but let's simplify, outside of your own custom code).
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>
>>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>>> sidharthkumar2707@gmail.com> wrote:
>>>
>>>> Thanks Tariq, It really helped me to understand but just one another
>>>> doubt that if reading is not a parallel process then to ready a file of
>>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>>> complete file but it's not the scenerio in the real time. And second
>>>> question is write operations as well is sequential process ? And will every
>>>> datanode have their own data streamer which listen to data queue to get the
>>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>>> of hdfs read and write operations.
>>>>
>>>> Regards
>>>> Sidharth
>>>>
>>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <dontariq@gmail.com> wrote:
>>>>
>>>> Hi Sidhart,
>>>>
>>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>>> of a HDFS file are read in parallel by multiple mappers created in that
>>>> particular program. Input splits to be precise.
>>>>
>>>> On the other hand if you have a standalone java program then it's just
>>>> a single thread process and will read the data sequentially.
>>>>
>>>>
>>>> On Friday, April 7, 2017, Sidharth Kumar <sidharthkumar2707@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for your response . But I dint understand yet,if you don't mind
>>>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>>>> parallelize the readers (one per block for the mapper) with processing
>>>>> framework like MapReduce.*"
>>>>>
>>>>> And also how the concept of parallelize the readers will work with hdfs
>>>>>
>>>>> Thanks a lot in advance for your help.
>>>>>
>>>>>
>>>>> Regards
>>>>> Sidharth
>>>>>
>>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pkernevez@octo.com>
>>>>> wrote:
>>>>>
>>>>> Hi Sidharth,
>>>>>
>>>>> The reads are sequential.
>>>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>>>> the mapper) with processing framework like MapReduce.
>>>>>
>>>>> Regards,
>>>>> Philippe
>>>>>
>>>>>
>>>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>>>> sidharthkumar2707@gmail.com> wrote:
>>>>>
>>>>>> Hi Genies,
>>>>>>
>>>>>> I have a small doubt that hdfs read operation is parallel or
>>>>>> sequential process. Because from my understanding it should be parallel
but
>>>>>> if I read "hadoop definitive guide 4" in anatomy of read it says
"*Data
>>>>>> is streamed from the datanode back **to the client, which calls
>>>>>> read() repeatedly on the stream (step 4). When the end of the **block
>>>>>> is reached, DFSInputStream will close the connection to the datanode,
then
>>>>>> find **the best datanode for the next block (step 5). This happens
>>>>>> transparently to the client, **which from its point of view is just
>>>>>> reading a continuous stream*."
>>>>>>
>>>>>> So can you kindly explain me how read operation will exactly happens.
>>>>>>
>>>>>>
>>>>>> Thanks for your help in advance
>>>>>>
>>>>>> Sidharth
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Philippe Kernévez
>>>>>
>>>>>
>>>>>
>>>>> Directeur technique (Suisse),
>>>>> pkernevez@octo.com
>>>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>>>
>>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>>> OCTO Technology http://www.octo.ch
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> [image: http://]
>>>>
>>>> Tariq, Mohammad
>>>> about.me/mti
>>>> [image: http://]
>>>> <http://about.me/mti>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.ch
>>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch

Mime
View raw message