hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: How HDFS divides Files into block
Date Fri, 18 May 2012 10:23:01 GMT
Utkarsh,

This question has been asked several times before. I've myself previously
answered the same question at:
http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg04282.html

If HDFS says its block size is 64M, then that is what the block size is.
HDFS is a filesystem, and writes only 64M bytes per block, and does not
care about what the file carries (No FS cares what the file carries). The
problem does not lie on the FS side. You need to think instead, "How do I
read data from HDFS, if my records may lie across two blocks? Will I be
able to?".

It is up to the reader of the blocks to take care of record boundaries
which may easily lie across blocks (And generally only MR does harder block
boundary reading). The way MR's LineRecordReader (TextInputFormat) does it
is explained here: http://wiki.apache.org/hadoop/HadoopMapReduce

So in short: Don't worry, this is already taken care for you.

On Fri, May 18, 2012 at 2:40 PM, Utkarsh Gupta <Utkarsh_Gupta@infosys.com>wrote:

> Hi,****
>
> ** **
>
> I have a doubt about HDFS which may be a very trivial thing but I am not
> able to understand it.****
>
> ** **
>
> Since hdfs keeps the files in block of 64/128 MB how does HDFS splits
> files?****
>
> The problem which I see is that suppose I have a long string in my input
> file as:****
>
> ** **
>
> 672364,423746273,4234234,2,342,34,2,34,234,2,34,234,2,342,342****
>
> ** **
>
> This is to be processed in one map call. But because of blocks a part of
> this line is in one block and next in another.****
>
> ** **
>
> Block1:****
>
> --****
>
> -****
>
> -
> this block goes to one mapper process****
>
> -****
>
> -****
>
> 672364,423746273,4234****
>
> <end of block1>****
>
> ** **
>
> Block2:****
>
> 234,2,342,34,2,34,234,2,34,234,2,342,342****
>
> -****
>
> -****
>
> -
> this block goes to another mapper process****
>
> ** **
>
> ** **
>
> How HDFS avoids this scenario?****
>
> ** **
>
> Thanks and Regards****
>
> Utkarsh Gupta****
>
> ** **
>
> ** **
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
>


-- 
Harsh J

Mime
View raw message