beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Damien GOUYETTE (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-2060) XmlSource use harcoded Charset
Date Mon, 24 Apr 2017 13:36:04 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Damien GOUYETTE updated BEAM-2060:
----------------------------------
    Description: 
When i use a file encoded with ISO-8859-1 with a caracter *é* i got an exception like : 

{code}
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061,
byte #1012)
	at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
	... 19 more
{code}

Encoding is hardcoded : 

https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342


It would be great if i can specify it like : 
{code}
XmlSource.from[MyClass](input)
      .withRootElement("ROOT_ELEMENT")
      .withRecordElement("MyClass")
      .withRecordClass(classOf[MyClass])
      .withCharset(StandardCharsets.ISO_8859_1)
{code}

I can provide a pull request if you want

  was:
When i use a file encoded with ISO-8859-1 with a caracter *é* i got an exception like : 

{code}
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061,
byte #1012)
	at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
	... 19 more
{code}

Encoding is hardcoded : 

https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342


It would be great if i can specify it like : 
{code}
XmlSource.from[MyClass](input)
      .withRootElement("ROOT_ELEMENT")
      .withRecordElement("MyClass")
      .withRecordClass(classOf[MyClass])
      .withCharset(StandardCharsets.ISO_8859_1)
{code}


> XmlSource use harcoded Charset
> ------------------------------
>
>                 Key: BEAM-2060
>                 URL: https://issues.apache.org/jira/browse/BEAM-2060
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 0.6.0
>            Reporter: Damien GOUYETTE
>            Assignee: Davor Bonaci
>
> When i use a file encoded with ISO-8859-1 with a caracter *é* i got an exception like
: 
> {code}
> Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061,
byte #1012)
> 	at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
> 	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
> 	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
> 	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
> 	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
> 	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
> 	... 19 more
> {code}
> Encoding is hardcoded : 
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342

> It would be great if i can specify it like : 
> {code}
> XmlSource.from[MyClass](input)
>       .withRootElement("ROOT_ELEMENT")
>       .withRecordElement("MyClass")
>       .withRecordClass(classOf[MyClass])
>       .withCharset(StandardCharsets.ISO_8859_1)
> {code}
> I can provide a pull request if you want



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message