hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Processing large XML file
Date Thu, 16 Oct 2008 16:52:55 GMT
Hey Holger,

Your project sounds interesting.  I would place the untarred file in  
DFS, then write the mapreduce application to use  
StreamXmlRecordReader.  This is a simple record reader which allows  
you to specify beginning and end text strings (in your case, <text>  
and </text>, respectively).  The values between the strings would the  
value hadoop passes to your MapReduce application.

As long as you don't have nested <text> tags, this would work.


On Oct 16, 2008, at 10:28 AM, Holger Baumhaus wrote:

> Hello,
> I can't wrap my head around the best way to process a 20 GB  
> Wikipedia XML dump file [1] with Hadoop.
> The content I'm interested in is enclosed in the <text>-tags.  
> Usually a SAX-parser would be the way to go, but since it is event  
> based I don't think there would be a benefit of using a MR based  
> approach. An article is only around a few KBs in size.
> My other thought was to preprocess the file and split it up into  
> multiple text files with a size of 128 MBs. That step alone takes  
> around 70 minutes on my machine. Would the preprocessing step also  
> do the final processing (counting sentences and words in an  
> article), than it wouldn't take much longer.
> So even if I would use the splitted files with Hadoop, I wouldn't  
> really save some time, since I have to upload the 157 files to HDFS  
> and then start the MR job.
> Are there other ways to handle XML files with Hadoop?
> Holger
> [1] http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2
> -- 

View raw message