hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ujjwal Wadhawan <uwadha...@gmail.com>
Subject Re: binary column data consistency in hive table copy
Date Wed, 30 Sep 2015 13:06:51 GMT
Great! Thank you all for your inputs.

-Ujjwal

On Tue, Sep 15, 2015 at 8:08 AM, Gabriel Balan <gabriel.balan@oracle.com>
wrote:

> Hi
>
> You see "1w==" when you do a CTAS into a table using text files and lazysimpleserde
> because in that case binary columns are stored as base64.
>
> That also means lazySimpleSerde will also expect your 'binin' text file to
> have base64 encoded values.
> The strange things you see when you select from binsource must be the
> base64 decoding of '10000101', etc.
>
> You can read about base64 here: https://en.wikipedia.org/wiki/Base64
>
> Also, I find the intended use for binary very interesting. According to
> https://cwiki.apache.org/confluence/display/Hive/Binary+DataType+Proposal:
>
> "Often [...] a row in a data might be very wide with hundreds of columns.
> Sometimes, user is just interested in few of those columns and doesn't want
> to bother about exact type information for rest of columns. In such cases,
> he may just declare the types of those columns as binary and Hive will not
> try to interpret those columns."
>
> hth
> Gabriel Balan
>
> The statements and opinions expressed here are my own and do not
> necessarily represent those of Oracle Corporation.
>
>
> ----- Original Message -----
> From: xihuyu2000@126.com
> To: user@hive.apache.org
> Sent: Monday, September 14, 2015 7:17:24 PM GMT -05:00 US/Canada Eastern
> Subject: Re:  Re: binary column data consistency in hive table copy
>
> if use CTAS then a MR job occures.  Maybe the problem is in the MR job.
> 2015-09-15
> ------------------------------
> xihuyu2000
> ------------------------------
>
> *发件人:*Jason Dere <jdere@hortonworks.com>
> *发送时间:*2015-09-15 06:00
> *主题:*Re: binary column data consistency in hive table copy
> *收件人:*"user@hive.apache.org"<user@hive.apache.org>
> *抄送:*
>
>
> Looks like your table is using text storage format. Binary data needs to
> be stored as base64 in TextInputformat, so those values are probably being
> interpreted as base64 strings.
>
>
> ------------------------------
> *From:* Ujjwal Wadhawan <uwadhawan@gmail.com>
> *Sent:* Monday, September 14, 2015 2:32 PM
> *To:* user@hive.apache.org
> *Subject:* binary column data consistency in hive table copy
>
> Hi all,
>
>
>
> I recently observed a behavior in hive that I’ll like to share and get
> inputs.
>
>
>
> *Scenario:*
>
>
>
> Say you have a hive table with a binary column.
>
>
>
> create table binsource (bincol binary);
>
>
>
> and some input data
>
>
>
> $ cat /nis3/home/ujjwal2/test2/binin
>
> 10000101
>
> 121
>
> 10
>
> 1011
>
> Asfs
>
>
>
>
>
> Let’s load the data in the table
>
>
>
> LOAD DATA LOCAL INPATH '/home/ujjwal2/test2/binin' OVERWRITE INTO TABLE
> binsource;
>
>
>
> When I do a select * on hive CLI, I see following characters (see image)
>
>
> [image: http://puu.sh/k6HBw/877367d595.png]
>
>
>
> The underlying HDFS file still has the actual input though.
>
>
>
>
>
> Now I make a copy of this table using command "create table
> ujjwal2.bintarget as select * from ujjwal2.binsource;".
>
>
> [image: http://puu.sh/k6HEj/b34a8bd4a0.png]
>
>
>
>
> *ISSUE:*
>
>
> Now when I see the underlying file create on HDFS for bintarget, I see
> some extra characters.
>
>
>
>
>
> In may combinations I have tried, the extra characters are in “=”, “w” and
> “A”.
>
>
> 10000101
>
> 120=
>
> 1w==
>
> 1011
>
> Asfs
>
>
> Does anyone know what these characters signify ?
>
>
>
> Best,
>
> Ujjwal
>
>
>

Mime
View raw message