hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brotanek, Jan" <Jan.Brota...@adastragrp.com>
Subject RE: Hive Stored Textfile to Stored ORC taking long time
Date Fri, 09 Dec 2016 22:29:27 GMT
I have this problem as well. It takes forever to insert into ORC table. I have original table
text files gzipped. Having 4nodes with each 64gb and 16cores

From: Joaquin Alzola [mailto:Joaquin.Alzola@lebara.com]
Sent: pátek 9. prosince 2016 12:34
To: user@hive.apache.org
Subject: RE: Hive Stored Textfile to Stored ORC taking long time

Hi Jorn

Yes I will do that test. Same file size but with less columns.

I created a table with simple columns (all strings) and not nested and I do not do any transformations.
Attach both tables schema.

As per default the hive.vectorized.execution.enabled is set to false.
I have not enable it.

Just an example that it took 1 hours :
0: jdbc:hive2://localhost:10000> insert into table ret_rec_cdrs_orc PARTITION (country='DE',year='2016',month='12')
select * from ret_rec_cdrs where country='DE' and year='2016' and month='12';
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (3837.457 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs where country='DE' and
year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (24.722 seconds)
0: jdbc:hive2://localhost:10000> select count(*) from ret_rec_cdrs_orc where country='DE'
and year='2016' and month='12';
+----------+--+
|   _c0    |
+----------+--+
| 3900155  |
+----------+--+
1 row selected (82.071 seconds)

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 10:22
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

Ok.
No do no split in smaller files. This is done automatically. Your behavior looks strange.
For that file size I would expect that it takes below one minute.
Maybe you hit a bug in the spark on hive engine. You could try with a file with less columns,
but the same size. I assume that this is a hive table with simple columns (nothing deeply
nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework (mr, tez, spark
etc), because performance does not differ so much in these cases, especially for the small
amount of data you process.

On 9 Dec 2016, at 11:02, Joaquin Alzola <Joaquin.Alzola@lebara.com<mailto:Joaquin.Alzola@lebara.com>>
wrote:
Hi Jorn

The file is about 1.5GB with 1.5 milion records and about 550 fields in each row.

ORC is compress as Zlib.

I am using a standalone solution before expanding it, so everything is on the same node.
Hive 2.0.1 --> Spark 1.6.3 --> HDFS 2.6.5

The configuration is much more as standard and have not change anything much.

It cannot be a network issue because all the apps are on the same node.

Since I am doing all of this translation on the Hive point (from textfile to ORC) I wanted
to know if I could do it quicker on the Spark or HDFS level (doing the file conversion some
other way) not on the stop of the “stack”

We take the files every day once so if I put them in textfile and then to ORC it will take
me almost half a day just to display the data.

It is basicly a time consuming task, and want to do it much quicker. A better solution of
course would be to put smaller files with FLUME but this I will do it in the future.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 09 December 2016 09:48
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hive Stored Textfile to Stored ORC taking long time

How large is the file? Might IO be an issue? How many disks have you on the only node?

Do you compress the ORC (snappy?).

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

On 9 Dec 2016, at 02:08, Joaquin Alzola <Joaquin.Alzola@lebara.com<mailto:Joaquin.Alzola@lebara.com>>
wrote:
HI List

The transformation from textfile table to stored ORC table takes quiet a long time.

Steps follow>


1.Create one normal table using textFile format

2.Load the data normally into this table

3.Create one table with the schema of the expected results of your normal hive table using
stored as orcfile

4.Insert overwrite query to copy the data from textFile table to orcfile table

I have about 1,5 million records with about 550 fields in each row.

Doing step 4 takes about 30 minutes (moving from one format to the other).

I have spark with only one worker (same for HDFS) so running now a standalone server but with
25G and 14 cores on that worker.

BR

Joaquin
This email is confidential and may be subject to privilege. If you are not the intended recipient,
please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient,
please do not copy or disclose its content but contact the sender immediately upon receipt.
This email is confidential and may be subject to privilege. If you are not the intended recipient,
please do not copy or disclose its content but contact the sender immediately upon receipt.
Mime
View raw message