crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mac champion (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-491) Add an Xml File Source
Date Thu, 05 Feb 2015 14:51:36 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307323#comment-14307323
] 

mac champion commented on CRUNCH-491:
-------------------------------------

The way that I calculated splits in the CSV input format is I would jump a set amount of bytes
into the file (whatever the desired split size was) and then look at each character until
I knew the next record was over. Then, I would note the location for splitting. While reading
the characters, I kept track of their raw byte size with this method:

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVLineReader.java#L417

This uses a java.nio.charset.CharsetEncoder (configured with the file's encoding in the class's
constructor) to figure out how many raw bytes each character would have actually taken up
in the file. In this way, the reader can keep track of how many bytes of the file it has consumed.


Can you do something similar?

> Add an Xml File Source
> ----------------------
>
>                 Key: CRUNCH-491
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-491
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Christian Tzolov
>            Assignee: Christian Tzolov
>            Priority: Minor
>              Labels: inputformat, source, xml
>         Attachments: CRUNCH-491-1.patch, CRUNCH-491.patch, CRUNCH-491b.patch, CRUNCH-491c.patch,
CRUNCH-491d.patch
>
>
> Large XML documents that are composed of a repetitive XML elements can be broken into
chunks delimited by the start and end tags of those elements.
> The XmlSource should process XML files and extract out the XML between the pre-configured
start / end tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message