hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Ho (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2439) Hadoop needs a better XML Input
Date Sat, 14 Feb 2009 01:17:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673445#action_12673445

Alan Ho commented on HADOOP-2439:

The approach described doesn't use a StaX parser, and probably isn't going to be as robust
to failure or as extensible as using a StaX parser. If you look at the code, my patch allows
you to specify the XML element name, "namespace_prefix", and namespace_URI when identifying
the correct tag.  My patch also makes it easier to massage the XML too when reading in data.

Initially when I tried to create a XML parser, I tried to hack something up like the previous
approach described. But after trying to parse real-world data (e.g. a dump of wikipedia),
I threw up my arms and decided to use a proper pull-parser.

> Hadoop needs a better XML Input
> -------------------------------
>                 Key: HADOOP-2439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2439
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.1
>            Reporter: Alan Ho
>            Priority: Minor
>         Attachments: HADOOP-2439Patch.patch
> Hadoop does not have a good XML parser for XML input. The XML parser in the streaming
class is fairly difficult to work with and doesn't have proper test cases around it.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message