hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: splitting of big files?
Date Tue, 27 May 2008 14:21:11 GMT
Am Montag, den 26.05.2008, 11:32 +0000 schrieb koara@atlas.cz:
> Hello,
> 
> i am considering using hadoop map/reduce but have some difficulties getting around the
basic concepts of chunks distribution.
> 
> How does the 'distributed' processing on large files account for the fact that some files
cannot be split at (64mb) boundary?
> Such as large text files (many gigs) that need to be processed line by line -- splitting
a line mid-way and processing an incomplete partial chunk on some worker may be a serious
error, depending on the application.
> 
> Can somebody please tell me where am i wrong in my thinking here. Links to relevant documentation
passages/tutorials welcome too, cheers.

Well, when using streaming.jar, it handles this for you.

But basically, you follow certain rules in this, don't nail me if these
are not correct, but it goes something like that, taking \n delimited
lines as "logical units" here:

Any map task is given an offset and length.

Any map task reads into the following block to finish the line.
Any map task but the first, ignores the partial first line.

Andreas

Mime
View raw message