flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10134) UTF-16 support for TextInputFormat
Date Wed, 26 Sep 2018 02:43:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628183#comment-16628183

ASF GitHub Bot commented on FLINK-10134:

XuQianJin-Stars commented on issue #6710: [FLINK-10134] UTF-16 support for TextInputFormat
bug fixed
URL: https://github.com/apache/flink/pull/6710#issuecomment-424565714
   @StephanEwen Hello, regarding the two questions you raised yesterday, I have some opinions
about myself and I don’t know if it’s right.
   1.Where should the BOM be read?I think it is still necessary to increase the logic for
processing the bom when the file is started at the beginning of the file. Add an attribute
to the read bom encoding logic to record the file bom encoding.For example: put it in the
function `createInputSplits`.
   2.Regarding the second performance problem, you can use the previously generated bom code
to judge UTF8 with bom, UTF16 wuth bom, UTF32 with bom, and control the byte size to process
the end of each line, because I found The previous bug garbled is actually a coding problem,
one is caused by improper processing of the end byte of each line. I have done the following
for this problem:
   `String utf8 = "UTF-8";`
   `String utf16 = "UTF-16";`
   `String utf32 = "UTF-32";`
   `int stepSize = 0;`
   `String charsetName = this.getCharsetName();`
   `if (charsetName.contains(utf8)) {`
   ​	`stepSize = 1;`
   `} else if (charsetName.contains(utf16)) {`
   ​	`stepSize = 2;`
   `} else if (charsetName.contains(utf32)) {`
   ​	`stepSize = 4;`
   `//Check if \n is used as delimiter and the end of this line is a \r, then remove \r from
the line`
   `if (this.getDelimiter() != null && this.getDelimiter().length == 1`
   ​		`&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >=
   ​		`&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {`
   ​		`numBytes -= stepSize;`
   `numBytes = numBytes - stepSize + 1;`
   `return new String(bytes, offset, numBytes, this.getCharsetName());`
   These are some of my own ideas. I hope that you can give some better suggestions and handle
this jira better. Thank you.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> UTF-16 support for TextInputFormat
> ----------------------------------
>                 Key: FLINK-10134
>                 URL: https://issues.apache.org/jira/browse/FLINK-10134
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.4.2
>            Reporter: David Dreyfus
>            Priority: Blocker
>              Labels: pull-request-available
> It does not appear that Flink supports a charset encoding of "UTF-16". It particular,
it doesn't appear that Flink consumes the Byte Order Mark (BOM) to establish whether a UTF-16
file is UTF-16LE or UTF-16BE.
> TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), which sets
TextInputFormat.charsetName and then modifies the previously set delimiterString to construct
the proper byte string encoding of the the delimiter. This same charsetName is also used in
TextInputFormat.readRecord() to interpret the bytes read from the file.
> There are two problems that this implementation would seem to have when using UTF-16.
>  # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will return a
Big Endian byte sequence including the Byte Order Mark (BOM). The actual text file will not
contain a BOM at each line ending, so the delimiter will never be read. Moreover, if the actual
byte encoding of the file is Little Endian, the bytes will be interpreted incorrectly.
>  # TextInputFormat.readRecord() will not see a BOM each time it decodes a byte sequence
with the String(bytes, offset, numBytes, charset) call. Therefore, it will assume Big Endian,
which may not always be correct. [1] [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]
> While there are likely many solutions, I would think that all of them would have to start
by reading the BOM from the file when a Split is opened and then using that BOM to modify
the specified encoding to a BOM specific one when the caller doesn't specify one, and to overwrite
the caller's specification if the BOM is in conflict with the caller's specification. That
is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, Flink should rewrite
the charsetName as UTF-16LE.
>  I hope this makes sense and that I haven't been testing incorrectly or misreading the
> I've verified the problem on version 1.4.2. I believe the problem exists on all versions. 

This message was sent by Atlassian JIRA

View raw message