impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Speed up refresh table partitions in batch
Date Wed, 23 Aug 2017 03:34:34 GMT
Previously I found that if you run any command that touches the partition,
like adding properties it caused a refresh of that partition.

On Tue, Aug 22, 2017 at 10:40 PM, yu feng <olaptestyu@gmail.com> wrote:

> Hi, community :
>
>    I am a improvement modify to impala in our env, and I want to contribute
> it to impala community , This is our scenarioļ¼š
>
>     we have a table with three or four partition keys, and the table have
> almost 1K partition to be added, and a spark streaming job write new data
> to existing partitions every 15 min(add to recent 7 days), so we have to
> refresh the recent 7 days partition, about 7K partitions.
>
>    However, the whole table have 10W partitions and growing, we have two
> chioce: refresh the whole table or refresh the 7K partitions, we obvious
> should select to refresh table, but It will take 5min to be finish, I check
> the code(before 2.8.0) and find refreshing table will finally call the
> function :
>
> HdfsTable.load(true,  client, msTbl, true, true, null);
>
> which will try to reload metadata and check every partition existing in the
> table, and load eveny file to check whether file is updated or newly
> created by checking last ModificationTime and file length.
>
> In our table, there are about 100W files, so the refresh table operation is
> slowly.
>
> Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = ('xx1',
> 'xx2', 'xx3'}); and the operation will just refresh partitions match the
> day in (xx1/xx2/xx3), in this way, we can only load files and partitions in
> the last 7 days.
>
> After our test, we find in this way, we speed the operation 2x times.
>
> Do you have any suggestion about it ?  Thanks a lot.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message