hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prasenjit mukherjee <prasen....@gmail.com>
Subject How hdfs splits blocks on record boundaries
Date Thu, 14 Jun 2012 01:11:50 GMT
I have a textfile which doesn't have any newline characters. The
records are separated by a special character ( e.g. $ ). if I push a
single file of 5 GB to hdfs, how will it identify the boundaries on
which the files should be split ?

What are the options I have in such scenaion so that I can run mapreduce jobs :

1. Replace record-separator with new line ? ( Not very convincing as I
have newline in the data )

2. Create 64MB chunks by some preprocessing ? ( Would love to know if
it can be avoided )

3. I can definitely write my customloader for my mapreduce jobs, but
even then is it possible to reach out across hdfs nodes if the files
are not aligned with recoird boundaries ?


Sent from my mobile device

View raw message