hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject RE: "Create external table" nulling data from source table
Date Thu, 28 Jan 2016 20:19:16 GMT
Thanks Nick,

I don't think there should be any newline chars - the columns that are turning null are mostly
Decimal, and none of my strings are longer than one word (no spaces).

And again: the same row is correct if I export a small set of data, and incorrect if I export
a large set - so I think that file/data size has something to do with this.

That said, I think you have the right idea in general.

I just tested and if I take the orc table, copy it to a sequence file, and then copy to a
csv "file", everything looks good.

Is that going to be my best workaround for this?

-----Original Message-----
From: Nicholas Hakobian [mailto:nicholas.hakobian@rallyhealth.com] 
Sent: Thursday, January 28, 2016 3:15 PM
To: user@hive.apache.org
Subject: Re: "Create external table" nulling data from source table

Do you have any fields with embedded newline characters? If so, certain hive output formats
will parse the newline character as the end of row, and when importing, chances are the missing
fields (now part of the next row) will be padded with nulls. This happens in Hive as well
if you are using a TextFile intermediate format (SequenceFile does not have this problem).

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com



On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack <Zack.Riesland@sensus.com> wrote:
> First time posting to this list. Please forgive me if I break 
> etiquette. I’m looking for some help with getting data from hive to hbase.
>
>
>
> I’m using HDP 2.2.8.
>
>
>
> I have a compressed (zlib), orc-based hive table with 12 columns and 
> billions of rows.
>
> In order to get the data into hbase, I have to create a copy of the 
> table as an "external" table, backed by CSV files (unless someone 
> knows a better way).
>
>
>
> Then, I use the CsvBulkLoad mapreduce job to create hfiles from the 
> csv files backing the external table.
>
> I’ve been doing this for almost a year, and MOST of the data ends up 
> correct, but if I export a large amount of data, I end up with nulls 
> where I shouldn't.
>
>
>
> If I run the exact same query on the source table (compressed orc) and 
> destination table (external text) I get null values in the results of 
> the latter, but not the former.
>
>
>
> However, if I only copy a small subset of the data to the text-based 
> table, all the data is correct.
>
>
>
> I also noticed that if I use an uncompressed source table, and then 
> copy to an external text-based table, it happens much more often.
>
>
>
> So, my (not-very-educated) guess is that this has to do with ORC files.
>
>
>
> I know that there are alternatives to ORC, but Hortonworks strongly 
> encourages us to use ORC for everything. I’m not even sure whether 
> Parquet works with HDP.
>
>
>
> Anyways, Is this a known bug?
>
> Any ideas on how I can get around it without chopping up my data into 
> multiple tables?
Mime
View raw message