crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Tzolov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-491) Add an Xml File Source
Date Thu, 05 Feb 2015 16:14:37 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307478#comment-14307478
] 

Christian Tzolov commented on CRUNCH-491:
-----------------------------------------

I am not sure i understand the the position jump/note. But I know that you can't change the
Split boundaries from inside the RecordReader as the later is run within the Map task or you
have to synchronize this change with all the running mappers.
 
The CharsetEncoder though seems to be able to recompute the original byte size for given char
array. (Something I've assumed impossible). Using it i was able to resolve the problem (or
at least to make tests pass) with minimal intervention (see CRUNCH-491f). Thanks Mac!

> Add an Xml File Source
> ----------------------
>
>                 Key: CRUNCH-491
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-491
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Christian Tzolov
>            Assignee: Christian Tzolov
>            Priority: Minor
>              Labels: inputformat, source, xml
>         Attachments: CRUNCH-491-1.patch, CRUNCH-491.patch, CRUNCH-491b.patch, CRUNCH-491c.patch,
CRUNCH-491d.patch, CRUNCH-491f.patch
>
>
> Large XML documents that are composed of a repetitive XML elements can be broken into
chunks delimited by the start and end tags of those elements.
> The XmlSource should process XML files and extract out the XML between the pre-configured
start / end tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message