hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Ho <karlu...@yahoo.ca>
Subject Building good XML parsing library Hadoop
Date Mon, 12 Nov 2007 07:24:13 GMT
After looking long and hard for a good way to process XML. I've  
looked at the Streaming XML Record reader, and frankly - it doesn't  
look good.

Here's how far I got prototyping:

I've been using a StAX parser (the one that comes with J2EE 5). DOM  
and SAX doesn't cut it cause the RecordReader interface needs the  
ability to "pull" record by record.

I've also been using JAXB 2.0 in order to bind the XML to real java  

Here are some of my dilemmas:

1. FileSplit - I'm not sure if I should even try to implement this  
capability. I'm working off the LineRecordReader example, and the low  
level manipulation of bytes seem really tricky. With StAX, I'm not  
able to track where in the file I've read up to, so I'm unable to  
figure out when to stop parsing a section of the file. The only way  
that I can see this work is to "extend" my own version of  
BufferInputStream to track how many bytes have been read.

2. Should I even bother with JAXB ? If its cumbersome, then I'd  
rather not use it. Alternatively, when calling "next", the  
application returns a single record represented by XML.

3. Is a StAX parser adequate ? I'm not sure that the speed would be  
fast enough.

4. I'm I re-inventing the wheel - has someone else done this ? Please  
let me know.

If someone is interested in my work, I could contribute back to the  

Alan Ho

View raw message