hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur" <tuc...@gmail.com>
Subject Re: Building good XML parsing library Hadoop
Date Mon, 12 Nov 2007 07:44:33 GMT
We have similar requirements but we are handling it a little different. When
uploading the XML file to DFS we fragment XML file into records, for this
use a StAX reader/writer. We read the elements that conform a record from
the input XML document and create a separate XML document for each record,
this means we have to inject all the namespace declaration from the root of
the original XML document in each record XML document. Then we write each
record XML document to DFS using a SequenceFile.Writer. After that things
are easy, splitting is taken care by built in splitters, parsing of each
record is done using a regular XML parser, it can be even a DOM parse as you
have an XML document per record. But yes, you have to pay the penalty up
front of parsing the XML at DFS upload time and this is done by a single
thread. Our code is specific to the types of XML documents we are handling,
we could try to see what we can decouple it if there is interest.

Alejandro
(I'm going on vacations now, so I could see this after I'm back, 2 weeks
from now)

On Nov 12, 2007 12:54 PM, Alan Ho <karlunho@yahoo.ca> wrote:

> After looking long and hard for a good way to process XML. I've
> looked at the Streaming XML Record reader, and frankly - it doesn't
> look good.
>
> Here's how far I got prototyping:
>
> I've been using a StAX parser (the one that comes with J2EE 5). DOM
> and SAX doesn't cut it cause the RecordReader interface needs the
> ability to "pull" record by record.
>
> I've also been using JAXB 2.0 in order to bind the XML to real java
> objects.
>
>
> Here are some of my dilemmas:
>
> 1. FileSplit - I'm not sure if I should even try to implement this
> capability. I'm working off the LineRecordReader example, and the low
> level manipulation of bytes seem really tricky. With StAX, I'm not
> able to track where in the file I've read up to, so I'm unable to
> figure out when to stop parsing a section of the file. The only way
> that I can see this work is to "extend" my own version of
> BufferInputStream to track how many bytes have been read.
>
> 2. Should I even bother with JAXB ? If its cumbersome, then I'd
> rather not use it. Alternatively, when calling "next", the
> application returns a single record represented by XML.
>
> 3. Is a StAX parser adequate ? I'm not sure that the speed would be
> fast enough.
>
> 4. I'm I re-inventing the wheel - has someone else done this ? Please
> let me know.
>
> If someone is interested in my work, I could contribute back to the
> community.
>
> Thanks,
> Alan Ho
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message