hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Ho <karlu...@yahoo.ca>
Subject Re: Building good XML parsing library Hadoop
Date Mon, 12 Nov 2007 17:47:42 GMT
How would you use a regular SAX parser to implement the "next" method in the RecordReader ?

Alan Ho

----- Original Message ----
From: Owen O'Malley <oom@yahoo-inc.com>
To: hadoop-dev@lucene.apache.org
Sent: Sunday, November 11, 2007 11:50:11 PM
Subject: Re: Building good XML parsing library Hadoop

On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:

> After looking long and hard for a good way to process XML. I've  
> looked at the Streaming XML Record reader, and frankly - it doesn't  
> look good.

Agreed, the Streaming XML record reader is a hack. My personal  
opinion is that the current design is broken enough to be  
problematic. I think the best approach would be to use a SAX parser  
and process each file as a single file split.

> I've been using a StAX parser (the one that comes with J2EE 5). DOM  
> and SAX doesn't cut it cause the RecordReader interface needs the  
> ability to "pull" record by record.

I don't understand the problem. You should be able to implement the  
RecordReader interface with a SAX parser.

> 1. FileSplit - I'm not sure if I should even try to implement this  
> capability. I'm working off the LineRecordReader example, and the  
> low level manipulation of bytes seem really tricky. With StAX, I'm  
> not able to track where in the file I've read up to, so I'm unable  
> to figure out when to stop parsing a section of the file. The only  
> way that I can see this work is to "extend" my own version of  
> BufferInputStream to track how many bytes have been read.

WIth XML, you can't really start reading in the middle of the file.  
So I don't see any way to handle file splits that are less than a  
full file.

> 2. Should I even bother with JAXB ? If its cumbersome, then I'd  
> rather not use it. Alternatively, when calling "next", the  
> application returns a single record represented by XML.

I think JAXB would be overkill. A simple SAX parser should be fine, I  

> 4. I'm I re-inventing the wheel - has someone else done this ?  
> Please let me know.

I don't think anyone has done it yet. If you can make a generally  
useful InputFormat, it would be nice to contribute it back.

-- Owen

      Get a sneak peak at messages with a handy reading pane with All new Yahoo! Mail: http://mail.yahoo.ca

View raw message