impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Russell <jruss...@cloudera.com>
Subject Re: Impala compute incremental stats and insert speed becomes slowly when the partitions and the amount of data is larger
Date Fri, 01 Apr 2016 18:16:19 GMT
I'm wondering if you could run COMPUTE INCREMENTAL STATS less frequently, because with 50 thousand
partitions and 100 billion rows, that's only about 2 million rows per partition.  Impala already
has the information that this is a huge table, it just does not know if these new partitions
are large or small.  By default, it would treat them as small.  Perhaps with 2 million rows
per partition, those partitions really are small in comparison to any other tables you might
be using in a join query.  Have you observed whether query performance suffers if you add
several new partitions before running COMPUTE INCREMENTAL STATS, rather than doing it for
each new partition?

I have not worked with data on that scale in a partitioned table, but the technique I typically
use for inserting into a big table is to insert into a small table, run a set of fast queries
to sanity check the new data, then LOAD DATA the resulting HDFS files into the appropriate
directory of the destination table.  (Currently, this technique requires manually deleting
the _impala_insert_staging temporary directory created in the source table by the INSERT statement.
 Also, a REFRESH is required on the source table for Impala to recognize that it no longer
contains any data files.)

John

> On Mar 31, 2016, at 5:56 PM, Qinggang Wang <qinggangwang7@gmail.com> wrote:
> 
>  The cluster has 12 impala nodes and use about 1T memory, and the table has 30 columns
> 
> 
> 在 2016年3月31日星期四 UTC+8下午8:41:06,Qinggang Wang写道:
> Hi All,
>         There is a table has about a hundred billion data and fifty thousand partitions
in the impala.  It becomes  troublesome that when we insert new partitions and execute compute
incremental stats , the speed of  insert as well as compute stats either becomes very slowly
compared with the condition that the number of partitions and the amount of data is small.
The time of insert and compute stats either more that 80 seconds now, while neither of the
time of insert and compute stats more than 2 seconds when the data is small.  As there are
68 partitions one day, it is really cost much time in insert and compute. Is there any way
to solve that?
>      
> 
> Thanks
> 
> -- 
> You received this message because you are subscribed to the Google Groups "Impala User"
group.
> To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org
<mailto:impala-user+unsubscribe@cloudera.org>.


Mime
View raw message