hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: HDFS Structure
Date Wed, 29 Dec 2010 05:11:47 GMT
FileInputFormat takes care of line boundaries in splits, you don't
need to worry about that.

Each mapper works on a FileSplit, which contains the starting offset
and the length from there. These things are computed for it with line
boundaries in mind (and the extra bytes are pulled from the DataNode
that has it).

Similarly, in SequenceFiles, it is done using a special "Sync" byte
embedded in between logical blocks of data.

On Wed, Dec 29, 2010 at 10:27 AM, shanmukhan battinapati
<shanmukhan.b@gmail.com> wrote:
> Hi,
> I have a small doubt about the how  HDFS manages the files internally.
> Assume like I have a NameNode and 2 DataNodes. I have inserted a csv file of
> size 80MB into HDFS using 'hadoop copyFromLocal' command.
> Then how this file will be stored in HDFS?
> Will it be split into two parts of size 64MB(Default chunk size) and
> remaining 16Mb and copied to the 2 DataNodes?
> If that is the case, if I am doing some map-reduce on the two dataNodes, as
> the data is not line oriented I may get unexpected results.
> How to solve this type of issues? Please help me.
> Thanks & Regards
> Shanmukhan.B

Harsh J

View raw message