hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (Updated) (JIRA)" <>
Subject [jira] [Updated] (HIVE-2859) STRING data corruption in internationalized data -- based on LANG env variable
Date Fri, 20 Apr 2012 19:40:37 GMT


Ashutosh Chauhan updated HIVE-2859:

    Affects Version/s: 0.9.0
        Fix Version/s:     (was: 0.9.0)
                           (was: 0.7.1)

Unlinking from 0.9
> STRING data corruption in internationalized data -- based on LANG env variable
> ------------------------------------------------------------------------------
>                 Key: HIVE-2859
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Configuration, Import/Export, Serializers/Deserializers, Types
>    Affects Versions: 0.7.1, 0.8.0, 0.8.1, 0.9.0
>         Environment: Windows / RHEL5 with LANG = en_US.CP1252
>            Reporter: John Gordon
>   Original Estimate: 6h
>  Remaining Estimate: 6h
> This is a bug in Hive that is exacerbated by replatforming it to Windows without CYGWIN.
 Basically, it assumes that the default file.encoding is UTF8.  There are something like 6-7
getBytes() calls and write() calls that don't specify the encoding.  The rest specify UTF-8
explicitly, which blocks auto-detection of UTF-16 data in files with a BOM present.  The mix
of explicit encodings and default encoding assumptions means that Hive must be run in a JVM
whose default encoding is UTF-8 and only UTF-8.
> When the JVM starts up, it derives the default encoding from the C runtime setlocale()
call.  On Linux/Unix, this would use the LANG env variable (which is almost always <locale>.UTF8
for machines handling internationalized data, but not guaranteed to be so).  On Windows, this
is derived from the user's language settings, and cannot return a UTF-8 encoding, right now.
 So there isn't an environment setting for Windows that would reliably provide the JVM with
a set of inputs to cause it to set the default encoding to UTF-8 on startup without additional
> However, there are 2 feasible options: 
> 1.) the JVM has a startup option -Dfile.encoding=UTF-8 which should explicitly override
the default encoding detection behavior  in the JVM to make it always UTF-8 regardless of
the environmental configuration.  This would make all deployments on all OS/environment configs
behave consistently.  I don't know where Hive sets the JVM options we use when it starts the
> 2.) We could add "UTF8" explicitly to all the remaining getBytes() calls that need it,
and make all the string I/O explicitly UTF-8 encoded.  This is probably being changed right
now as part of Hive-1505, so we would duplicate effort and maybe make that change harder.
 Seems easier to trick the JVM into behaving like it is on a well-configured machine WRT default
encoding instead of setting explicit encodings everywhere.
> So:
> -	Pretty much any globalized strings than Western European are going to be corrupted
in the current Hive service on Windows with this bug present because there really isn't a
way to have the JVM read the environment and determine by default that UTF8 should be the
default encoding.
> -	Anyone can repro this on Linux fairly easily -- Add "export LANG=en_US.CP1252" to /etc/profile
to modify the global LANG default encoding to CP1252 explicitly, then restart the service
and do a query over internationalized UTF-8 data.
> -	We shouldn't rely on JVM default codepage selection if we want to support UTF-8 consistently
and reliably as the default encoding.
> -       The estimate can range wildly, but adding an explicit default encoding on startup
should only take a little while if you know where to do it, theoretically.
> -       I don't know where to update the start arguments of the JVM when the service
is started, just getting into the code for the first time with this bug investigation.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message