hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: DFS block size
Date Sat, 14 Nov 2009 21:07:14 GMT
Replies inline.

On 11/14/09 9:55 PM, "Hrishikesh Agashe" <hrishikesh_agashe@persistent.co.in> wrote:


Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS,
it will not be divided any further?

--Yes, file will be stored in single block per replica.

I have lots and lots if XMLs and I would like to process them directly. Currently I am converting
them to Sequence files (10 XMLs per sequence file) and the putting them on HDFS. However creating
sequence files is very time consuming process. So if I just ensure that all XMLs are less
than 64 MB (or value of dfs.block.size), they will not be split and I can safely process them
in map / reduce using SAX parser?

--True, but too many small files is generally not recommended, since they eat up into NN resources
and add overhead to mapred jobs, along with other issues discussed previously in this forum.
Cloudera has a pretty detailed blog on this. Alternatively, you can also define the split
size to be used in your map-red code using configuration parameter mapred.min.split.size (
doesn't work with all formats :| ) . For XML, there is a streamxml or something similar named
format you may want to consider.


If this is not possible, is there a way to speed up sequence file creation process?

This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message