hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Ho <karlu...@yahoo.ca>
Subject Re: Building good XML parsing library Hadoop
Date Wed, 14 Nov 2007 09:54:20 GMT
Thanks everyone.

I've done a first cut at the parser. I've made the following  

1. The input is a sequence of XML elements
2. There is no "recursion" of elements
3. Fixed depth of 1

My design is fairly simple - it simply reads records 1 by 1 and puts  
them into Strings (almost identical to TextInputFormat &  

I decided to skip the whole FileSplit issue for now. The user  
basically needs to specify what the element name that denotes a  
record. I've got a couple of questions for the community:

1. Do we see a lot of performance benefits by using FileSplit for  
text files ?
2. What StAX parser do people consider the fastest ?
3. Does it make sense to "assume" that for an xml file, the first  
"sequence" is the sequence of records ? If so, I'm thinking about  
putting in a convenience function that will "detect" what the element  
name is for records.

I'm going to do some performance testing as well. To the yahoo guys -  
Is there a point of contact that I can get my changes integrated into  
the trunk ?

Alan Ho

On Nov 12, 2007, at 11:11 AM, Arkady Borkovsky wrote:

> Alan,
> Can you tell a little more about specific needs you try to cover?
> Do you deal with full XML?  Correct XML?
> A pretty common situation is
> -- the input is a sequence of XML elements ("records"), and the  
> application does not care about the "top element" that covers the  
> whole file
> -- there is no "recursion" -- that is an element <A>...</A>  never  
> appears inside another <A> element.
> -- as a special case, the tree is has fixed depth, (often 1)
> -Arkady
> On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:
>> After looking long and hard for a good way to process XML. I've  
>> looked at the Streaming XML Record reader, and frankly - it  
>> doesn't look good.
>> Here's how far I got prototyping:
>> I've been using a StAX parser (the one that comes with J2EE 5).  
>> DOM and SAX doesn't cut it cause the RecordReader interface needs  
>> the ability to "pull" record by record.
>> I've also been using JAXB 2.0 in order to bind the XML to real  
>> java objects.
>> Here are some of my dilemmas:
>> 1. FileSplit - I'm not sure if I should even try to implement this  
>> capability. I'm working off the LineRecordReader example, and the  
>> low level manipulation of bytes seem really tricky. With StAX, I'm  
>> not able to track where in the file I've read up to, so I'm unable  
>> to figure out when to stop parsing a section of the file. The only  
>> way that I can see this work is to "extend" my own version of  
>> BufferInputStream to track how many bytes have been read.
>> 2. Should I even bother with JAXB ? If its cumbersome, then I'd  
>> rather not use it. Alternatively, when calling "next", the  
>> application returns a single record represented by XML.
>> 3. Is a StAX parser adequate ? I'm not sure that the speed would  
>> be fast enough.
>> 4. I'm I re-inventing the wheel - has someone else done this ?  
>> Please let me know.
>> If someone is interested in my work, I could contribute back to  
>> the community.
>> Thanks,
>> Alan Ho

View raw message