drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Phillips (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3712) Drill does not recognize UTF-16-LE encoding
Date Thu, 08 Oct 2015 21:50:26 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949453#comment-14949453

Steven Phillips commented on DRILL-3712:

What convert_from function did you use?

> Drill does not recognize UTF-16-LE encoding
> -------------------------------------------
>                 Key: DRILL-3712
>                 URL: https://issues.apache.org/jira/browse/DRILL-3712
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.1.0
>         Environment: OSX, likely Linux. 
>            Reporter: Edmon Begoli
>             Fix For: Future
> We are unable to process files that OSX identifies as character sete UTF16LE.  After
unzipping and converting to UTF8, we are able to process one fine.  There are CONVERT_TO and
CONVERT_FROM commands that appear to address the issue, but we were unable to make them work
on a gzipped or unzipped version of the UTF16 file.  We were  able to use CONVERT_FROM ok,
but when we tried to wrap the results of that to cast as a date, or anything else, it failed.
 Trying to work with it natively caused the double-byte nature to appear (a substring 1,4
only return the first two characters).
> I cannot post the data because it is proprietary in nature, but I am posting this code
that might be useful in re-creating an issue:
> {noformat}
> #!/usr/bin/env python
> """ Generates a test psv file with some text fields encoded as UTF-16-LE. """
> def write_utf16le_encoded_psv():
> 	total_lines = 10
> 	encoded = "Encoded B".encode("utf-16-le")
> 	with open("test.psv","wb") as csv_file:
> 		csv_file.write("header 1|header 2|header 3\n")
> 		for i in xrange(total_lines):
> 				csv_file.write("value A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")
> if __name__ == "__main__":
> 	write_utf16le_encoded_psv()
> {noformat}
> then:
> tar zcvf test.psv.tar.gz test.psv

This message was sent by Atlassian JIRA

View raw message