hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Hadoop and processing of XML files
Date Thu, 16 Nov 2006 17:30:20 GMT
grad0584@di.uoa.gr wrote:
> I am particularly concerned about the fact that Hadoop FS stores data in huge
> blocks and the scheduler &#8220;cuts&#8221; it in arbitrary byte indexes prior
to MapReduce.
> This way, many XML files will co-exist in one block and one of them will
> certainly be cut in half.

Split points are indeed usually random, to avoid creating a centralized 
i/o bottleneck when splitting.  But the convention is that split data 
begins with the first object after the split point (unless at beginning 
of file) and continue through the object straddling the end of the split 
(if any).  Thus a little cross-block i/o is performed while processing 
the end of the split.

To make this work, one must be able to find the start of an object from 
a random point in the file.  With XML this is possible, e.g., if one 
knows the top-level element, and that it will not occur anywhere except 
at top level.  If that's the case, then one can simply scan for 
<top-level-element/> at the beginning of each split.

> Does the Hadoop API and architecture in general give the developer of the
> MapReduce functions a chance of reliably reconstructing the original files that
> compose each block for some processing other than grep-like (in my case,
> SAX-driven parsing)?
> Any ideas on how this might be achieved or where I should start digging in the
> Javadocs?

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html

Cheers,

Doug

Mime
View raw message