hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-2859) STRING data corruption in internationalized data -- based on LANG env variable
Date Fri, 20 Apr 2012 19:40:37 GMT

     [ https://issues.apache.org/jira/browse/HIVE-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated HIVE-2859:
-----------------------------------

    Affects Version/s: 0.9.0
                       0.8.0
                       0.8.1
        Fix Version/s:     (was: 0.9.0)
                           (was: 0.7.1)

Unlinking from 0.9
                
> STRING data corruption in internationalized data -- based on LANG env variable
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-2859
>                 URL: https://issues.apache.org/jira/browse/HIVE-2859
>             Project: Hive
>          Issue Type: Bug
>          Components: Configuration, Import/Export, Serializers/Deserializers, Types
>    Affects Versions: 0.7.1, 0.8.0, 0.8.1, 0.9.0
>         Environment: Windows / RHEL5 with LANG = en_US.CP1252
>            Reporter: John Gordon
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> This is a bug in Hive that is exacerbated by replatforming it to Windows without CYGWIN.
 Basically, it assumes that the default file.encoding is UTF8.  There are something like 6-7
getBytes() calls and write() calls that don't specify the encoding.  The rest specify UTF-8
explicitly, which blocks auto-detection of UTF-16 data in files with a BOM present.  The mix
of explicit encodings and default encoding assumptions means that Hive must be run in a JVM
whose default encoding is UTF-8 and only UTF-8.
>  
> When the JVM starts up, it derives the default encoding from the C runtime setlocale()
call.  On Linux/Unix, this would use the LANG env variable (which is almost always <locale>.UTF8
for machines handling internationalized data, but not guaranteed to be so).  On Windows, this
is derived from the user's language settings, and cannot return a UTF-8 encoding, right now.
 So there isn't an environment setting for Windows that would reliably provide the JVM with
a set of inputs to cause it to set the default encoding to UTF-8 on startup without additional
options.
> However, there are 2 feasible options: 
> 1.) the JVM has a startup option -Dfile.encoding=UTF-8 which should explicitly override
the default encoding detection behavior  in the JVM to make it always UTF-8 regardless of
the environmental configuration.  This would make all deployments on all OS/environment configs
behave consistently.  I don't know where Hive sets the JVM options we use when it starts the
service.
> 2.) We could add "UTF8" explicitly to all the remaining getBytes() calls that need it,
and make all the string I/O explicitly UTF-8 encoded.  This is probably being changed right
now as part of Hive-1505, so we would duplicate effort and maybe make that change harder.
 Seems easier to trick the JVM into behaving like it is on a well-configured machine WRT default
encoding instead of setting explicit encodings everywhere.
>  
> So:
> -	Pretty much any globalized strings than Western European are going to be corrupted
in the current Hive service on Windows with this bug present because there really isn't a
way to have the JVM read the environment and determine by default that UTF8 should be the
default encoding.
> -	Anyone can repro this on Linux fairly easily -- Add "export LANG=en_US.CP1252" to /etc/profile
to modify the global LANG default encoding to CP1252 explicitly, then restart the service
and do a query over internationalized UTF-8 data.
> -	We shouldn't rely on JVM default codepage selection if we want to support UTF-8 consistently
and reliably as the default encoding.
> -       The estimate can range wildly, but adding an explicit default encoding on startup
should only take a little while if you know where to do it, theoretically.
> -       I don't know where to update the start arguments of the JVM when the service
is started, just getting into the code for the first time with this bug investigation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message