crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Tzolov (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-491) Add an Xml File Source
Date Tue, 27 Jan 2015 10:24:35 GMT


Christian Tzolov commented on CRUNCH-491:

XmlSource implementation (re)uses the XmlInputFormat code from the Mahout project:

>From what i can see the XmlInputFormat is designated to process XML files with large number
or repetitive XML chunks, whereas each chunk is small enough to fit in memory.  

Some observation regarding XmlImportFormat:

1. Performs exact string match (no reg expressions allowed) 
2. Manages XML elements split across multiple HDFS Splits.  
3. Retains the last inner chunk/element in memory, which could lead to OOM.
4. Can’t handle empty closed XML elements <boza>…</boza> .. <BOZA attr=“value1”
/> … <boza> … </boza>
5. Can’t handle nested elements with the same name <record> … <record> …
</record> … </record>
6. Will not handle mixed XML namespace syntaxes (for example elements form the default NS
that are used with and without prefixes in the same document) 

I can think of solutions for 4 and 5 but I am not convinced they are generic enough to worth
complicating the existing implementation. 

My (scrum minded) approach would be to commit the current solution and address the other uses
cases only if a real issue pops up?

> Add an Xml File Source
> ----------------------
>                 Key: CRUNCH-491
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Christian Tzolov
>            Assignee: Christian Tzolov
>            Priority: Minor
>              Labels: inputformat, source, xml
>         Attachments: CRUNCH-491.patch, CRUNCH-491b.patch
> Large XML documents that are composed of a repetitive XML elements can be broken into
chunks delimited by the start and end tags of those elements.
> The XmlSource should process XML files and extract out the XML between the pre-configured
start / end tags.

This message was sent by Atlassian JIRA

View raw message