impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yu feng <olaptes...@gmail.com>
Subject Speed up refresh table partitions in batch
Date Wed, 23 Aug 2017 02:40:27 GMT
Hi, community :

   I am a improvement modify to impala in our env, and I want to contribute
it to impala community , This is our scenarioļ¼š

    we have a table with three or four partition keys, and the table have
almost 1K partition to be added, and a spark streaming job write new data
to existing partitions every 15 min(add to recent 7 days), so we have to
refresh the recent 7 days partition, about 7K partitions.

   However, the whole table have 10W partitions and growing, we have two
chioce: refresh the whole table or refresh the 7K partitions, we obvious
should select to refresh table, but It will take 5min to be finish, I check
the code(before 2.8.0) and find refreshing table will finally call the
function :

HdfsTable.load(true,  client, msTbl, true, true, null);

which will try to reload metadata and check every partition existing in the
table, and load eveny file to check whether file is updated or newly
created by checking last ModificationTime and file length.

In our table, there are about 100W files, so the refresh table operation is
slowly.

Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = ('xx1',
'xx2', 'xx3'}); and the operation will just refresh partitions match the
day in (xx1/xx2/xx3), in this way, we can only load files and partitions in
the last 7 days.

After our test, we find in this way, we speed the operation 2x times.

Do you have any suggestion about it ?  Thanks a lot.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message