hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: What if an XML file is accross boundary of HDFS chunks?
Date Mon, 16 Nov 2009 20:05:51 GMT
Hey Steve,

Look at the mailing list archives - there's a specialized input splitter that you could use
that at least 2 different people suggested.

Brian

On Nov 16, 2009, at 2:02 PM, Steve Gao wrote:

> Thanks. But this is not a neat solution in case that the XML block is very large.
> Anybody has another solution? Thanks!
> 
> --- On Thu, 10/29/09, Amandeep Khurana <amansk@gmail.com> wrote:
> 
> From: Amandeep Khurana <amansk@gmail.com>
> Subject: Re: What if an XML file is accross boundary of HDFS chunks?
> To: common-user@hadoop.apache.org
> Date: Thursday, October 29, 2009, 5:12 PM
> 
> Store the entire xml in one line...
> 
> On 10/29/09, Steve Gao <steve.gao@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical data of an
>> XML segment?
>> 
>> For example:
>> 
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <book>book3</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <title>
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> 


Mime
View raw message