hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Ho (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2439) Hadoop needs a better XML Input
Date Sat, 14 Feb 2009 01:17:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673445#action_12673445
] 

Alan Ho commented on HADOOP-2439:
---------------------------------

The approach described doesn't use a StaX parser, and probably isn't going to be as robust
to failure or as extensible as using a StaX parser. If you look at the code, my patch allows
you to specify the XML element name, "namespace_prefix", and namespace_URI when identifying
the correct tag.  My patch also makes it easier to massage the XML too when reading in data.

Initially when I tried to create a XML parser, I tried to hack something up like the previous
approach described. But after trying to parse real-world data (e.g. a dump of wikipedia),
I threw up my arms and decided to use a proper pull-parser.



> Hadoop needs a better XML Input
> -------------------------------
>
>                 Key: HADOOP-2439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2439
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.1
>            Reporter: Alan Ho
>            Priority: Minor
>         Attachments: HADOOP-2439Patch.patch
>
>
> Hadoop does not have a good XML parser for XML input. The XML parser in the streaming
class is fairly difficult to work with and doesn't have proper test cases around it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message