drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edmon Begoli (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-3712) Drill does not recognize UTF-16-LE encoding
Date Wed, 26 Aug 2015 04:43:45 GMT
Edmon Begoli created DRILL-3712:
-----------------------------------

             Summary: Drill does not recognize UTF-16-LE encoding
                 Key: DRILL-3712
                 URL: https://issues.apache.org/jira/browse/DRILL-3712
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Text & CSV
    Affects Versions: 1.1.0
         Environment: OSX, likely Linux. 
            Reporter: Edmon Begoli
            Assignee: Steven Phillips


We are unable to process files that OSX identifies as character sete UTF16LE.  After unzipping
and converting to UTF8, we are able to process one fine.  There are CONVERT_TO and CONVERT_FROM
commands that appear to address the issue, but we were unable to make them work on a gzipped
or unzipped version of the UTF16 file.  We were  able to use CONVERT_FROM ok, but when we
tried to wrap the results of that to cast as a date, or anything else, it failed.  Trying
to work with it natively caused the double-byte nature to appear (a substring 1,4 only return
the first two characters).

I cannot post the data because it is proprietary in nature, but I am posting this code that
might be useful in re-creating an issue:


#!/usr/bin/env python
""" Generates a test psv file with some text fields encoded as UTF-16-LE. """
def write_utf16le_encoded_psv():
	total_lines = 10
	encoded = "Encoded B".encode("utf-16-le")
	with open("test.psv","wb") as csv_file:
		csv_file.write("header 1|header 2|header 3\n")
		for i in xrange(total_lines):
				csv_file.write("value A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")

if __name__ == "__main__":
	write_utf16le_encoded_psv()


then:

tar zcvf test.psv






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message