crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Tzolov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CRUNCH-491) Add an Xml File Source
Date Thu, 05 Feb 2015 14:12:36 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307241#comment-14307241
] 

Christian Tzolov edited comment on CRUNCH-491 at 2/5/15 2:12 PM:
-----------------------------------------------------------------

It is challenging to track the raw bytes read from the input file while using the high level
InputStreamReader to read the decoded characters. The raw byte count is required for the management
of the the FileSplit boundaries logic (e.g. while in the middle of an xml element keep reading
until the end tag even if this crosses the Split end limit). 

Some observations:
- The number of encoded characters read from the InputStreamReader cannot be used to compute
the number of the raw bytes read from the input file.  
- The InputStreamReader and underlying StreamDecoder don’t expose the number of raw bytes
read from the input data. 
- The InputStreamReader wraps the FSDataInputStream which provides getPos() method. Unfortunately
this method won't help because the InputStreamReader reads the input data (FSDataInputStream)
in chunks (8KB). The FSDataInputStream's position will be incremented with 8KB even before
 a single byte is read by the InputStreamReader. 

It seems that the Pig’s XMLLoader implementation doesn't support encodings either.   

To resolve this I've hacked the JDK InputStreamReader and StreamDecoder by exposing the count
of the bytes processed (see CrunchInputStreamReader#readBytesCount(), CrunchStreamDecoder#readBytesCount()).
(Does this violate any JDK license agreements?)

The new XmlRecordReaderTest verifies the correctness of the raw byte position and the FileSplit
range cases (e.g. read through the split if in the middle of an Xml element). 

[~jwills], [~champgm] I'd appreciate if you can review those changes. 
Do you think the encoding capabilities justifies the additional complexity? Or should we revert
to the original Mahout XmlInptuFormat (e.g. no encoding support)? 





was (Author: tzolov):
It is challenging to track the raw bytes read from the input file while using the high level
InputStreamReader to read the decoded characters. The raw byte count is required for the management
of the the FileSplit boundaries logic (e.g. while in the middle of an xml element keep reading
until the end tag even if this crosses the Split end limit). 

Some observations:
- The number of encoded characters read from the InputStreamReader cannot be used to compute
the number of the raw bytes read from the input file.  
- The InputStreamReader and underlying StreamDecoder don’t expose the number of raw bytes
read from the input data. 
- The InputStreamReader wraps the FSDataInputStream which provides getPos() method. Unfortunately
this method won't help because the InputStreamReader reads the input data (FSDataInputStream)
in chunks (8KB). The FSDataInputStream's position will be incremented with 8KB even before
 a single byte is read by the InputStreamReader. 

It seems that the Pig’s XMLLoader implementation doesn't support encodings either.   

To resolve this I've hacked the JDK InputStreamReader and StreamDeckoer by exposing the count
of the bytes processed (see CrunchInputStreamReader#readBytesCount(), CrunchStreamDecoder#readBytesCount()).
(Does this violate any JDK license agreements?)

The new XmlRecordReaderTest verifies the correctness of the raw byte position and the FileSplit
range cases (e.g. read through the split if in the middle of an Xml element). 

[~jwills], [~champgm] I'd appreciate if you can review those changes. 
Do you think the encoding capabilities justifies the additional complexity? Or should we revert
to the original Mahout XmlInptuFormat (e.g. no encoding support)? 




> Add an Xml File Source
> ----------------------
>
>                 Key: CRUNCH-491
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-491
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Christian Tzolov
>            Assignee: Christian Tzolov
>            Priority: Minor
>              Labels: inputformat, source, xml
>         Attachments: CRUNCH-491-1.patch, CRUNCH-491.patch, CRUNCH-491b.patch, CRUNCH-491c.patch,
CRUNCH-491d.patch
>
>
> Large XML documents that are composed of a repetitive XML elements can be broken into
chunks delimited by the start and end tags of those elements.
> The XmlSource should process XML files and extract out the XML between the pre-configured
start / end tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message