hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: Building good XML parsing library Hadoop
Date Wed, 14 Nov 2007 20:33:39 GMT
On Wed, Nov 14, 2007 at 12:25:05PM -0800, Alan Ho wrote:
>OK, I'll be finished with a tutorial and tests by end of this week or next. I've decided
to stuff each record into a String, so that its up to the person who writes the map class
on how they want to parse it. My intention is that "power users" will use JAXB to bind String
to Java classes. Not so sophisticated users can parse the string into a DOM tree, and manipulate
that instead. I'll provide sample code and a tutorial for those who want to use JAXB.
>I'll also make the "key" for each record be the "fully qualified name element name" +
index. I'm not sure about the internals of hadoop, but that it has some sort of usefulness
such as reporting what records cause map tasks to fail, etc.
>Since I'm using the StAX parser that comes with J2EE 5, I don't think I'll run into character
encoding issues or CDATA issues.
>Can someone hand-hold me through the process of contributing this code ?



>Alan Ho
>----- Original Message ----
>From: Arkady Borkovsky <arkady@yahoo-inc.com>
>To: hadoop-dev@lucene.apache.org
>Sent: Wednesday, November 14, 2007 9:06:45 AM
>Subject: Re: Building good XML parsing library Hadoop
>The "restricted XML" described by Alan is very common.
>It would be good to to have an InputFormat that
>-- given a top level tag T splits the input into chunks with one T  
>element in each chunk.
>-- takes care about encoding, character-entity references, and CDATA
>-- puts a record into a hashtable (Dictionary)  with keys that  
>correspond to tags and tag-attibute-name pairs and values that are  
>strings representing either element content or attribute values.
>-- a config for this InputFormat will say which of these strings are  
>put in which columns when fed to the steaming command.  Other  
>conventions are possible, too.
>This porbably was the intent of the original "Streaming XML record  
>-- ab
>On Nov 14, 2007, at 1:54 AM, Alan Ho wrote:
>> Thanks everyone.
>> I've done a first cut at the parser. I've made the following  
>> assumptions:
>> 1. The input is a sequence of XML elements
>> 2. There is no "recursion" of elements
>> 3. Fixed depth of 1
>> My design is fairly simple - it simply reads records 1 by 1 and  
>> puts them into Strings (almost identical to TextInputFormat &  
>> LineRecordReader).
>> I decided to skip the whole FileSplit issue for now. The user  
>> basically needs to specify what the element name that denotes a  
>> record. I've got a couple of questions for the community:
>On Nov 11, 2007, at 11:50 PM, Owen O'Malley wrote:
>> On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:
>>> After looking long and hard for a good way to process XML. I've  
>>> looked at the Streaming XML Record reader, and frankly - it  
>>> doesn't look good.
>> Agreed, the Streaming XML record reader is a hack. My personal  
>> opinion is that the current design is broken enough to be
> problematic.
>> 1. Do we see a lot of performance benefits by using FileSplit for  
>> text files ?
>> 2. What StAX parser do people consider the fastest ?
>> 3. Does it make sense to "assume" that for an xml file, the first  
>> "sequence" is the sequence of records ? If so, I'm thinking about  
>> putting in a convenience function that will "detect" what the  
>> element name is for records.
>> I'm going to do some performance testing as well. To the yahoo guys  
>> - Is there a point of contact that I can get my changes integrated  
>> into the trunk ?
>> Thanks,
>> Alan Ho
>> On Nov 12, 2007, at 11:11 AM, Arkady Borkovsky wrote:
>>> Alan,
>>> Can you tell a little more about specific needs you try to cover?
>>> Do you deal with full XML?  Correct XML?
>>> A pretty common situation is
>>> -- the input is a sequence of XML elements ("records"), and the  
>>> application does not care about the "top element" that covers the  
>>> whole file
>>> -- there is no "recursion" -- that is an element <A>...</A>  never
>>> appears inside another <A> element.
>>> -- as a special case, the tree is has fixed depth, (often 1)
>>> -Arkady
>>> On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:
>>>> After looking long and hard for a good way to process XML. I've  
>>>> looked at the Streaming XML Record reader, and frankly - it  
>>>> doesn't look good.
>>>> Here's how far I got prototyping:
>>>> I've been using a StAX parser (the one that comes with J2EE 5).  
>>>> DOM and SAX doesn't cut it cause the RecordReader interface needs  
>>>> the ability to "pull" record by record.
>>>> I've also been using JAXB 2.0 in order to bind the XML to real  
>>>> java objects.
>>>> Here are some of my dilemmas:
>>>> 1. FileSplit - I'm not sure if I should even try to implement  
>>>> this capability. I'm working off the LineRecordReader example,  
>>>> and the low level manipulation of bytes seem really tricky. With  
>>>> StAX, I'm not able to track where in the file I've read up to, so  
>>>> I'm unable to figure out when to stop parsing a section of the  
>>>> file. The only way that I can see this work is to "extend" my own  
>>>> version of BufferInputStream to track how many bytes have been
> read.
>>>> 2. Should I even bother with JAXB ? If its cumbersome, then I'd  
>>>> rather not use it. Alternatively, when calling "next", the  
>>>> application returns a single record represented by XML.
>>>> 3. Is a StAX parser adequate ? I'm not sure that the speed would  
>>>> be fast enough.
>>>> 4. I'm I re-inventing the wheel - has someone else done this ?  
>>>> Please let me know.
>>>> If someone is interested in my work, I could contribute back to  
>>>> the community.
>>>> Thanks,
>>>> Alan Ho
>      Ask a question on any topic and get answers from real people. Go to Yahoo! Answers
and share what you know at http://ca.answers.yahoo.com

View raw message