hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: Regarding loading a big XML file to HDFS
Date Tue, 22 Nov 2011 11:20:39 GMT
If your file is bigger than a block size (typically 64mb or 128mb), then it will be split into
more than one block. The blocks may or may not be stored on different datanodes. If you're
using a default InputFormat, then the input will be split between two task. Since you said
you need the whole file in order to process it, you should use either a custom InputFormat
that doesn't split or use something like WholeFileInputFormat which returns the whole file
s a single record. 


On Nov 21, 2011, at 20:20, hari708 <haridcs@gmail.com> wrote:

> Hi,
> I have a big file consisting of XML data.the XML is not represented as a
> single line in the file. if we stream this file using ./hadoop dfs -put
> command to a hadoop directory .How the distribution happens.?
> Basically in My mapreduce program i am expecting a complete XML as my
> input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> main confusion is if namenode distribute data to DataNodes ,there is a
> chance that a part of xml can go to one data node and other half can go in
> another datanode.If that is the case will my custom XMLReader in the
> mapreduce be able to combine it(as mapreduce reads data locally only).
> Please help me on this?
> -- 
> View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871901p32871901.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message