kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: swap data in Kudu table
Date Fri, 23 Feb 2018 21:32:42 GMT
A couple other ideas from the Impala side:

- could you use a view and alter the view to point to a different table?
Then all readers would be pointed at the view, and security permissions
could be on that view rather than the underlying tables?

- I think if you use an external table in Impala you could use an ALTER
TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
different table. Then issue a 'refresh' on the impalads so that they load
the new metadata. Subsequent queries would hit the new underlying Kudu
table, but permissions and stats would be unchanged.


On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mpercy@apache.org> wrote:

> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> load capabilities or staging abilities. Theoretically renaming a partition
> atomically shouldn't be that hard to implement, since it's just a master
> metadata operation which can be done atomically, but it's not yet
> implemented.
> There is a JIRA to track a generic bulk load API here:
> https://issues.apache.org/jira/browse/KUDU-1370
> Since I couldn't find anything to track the specific features you
> mentioned, I just filed the following improvement JIRAs so we can track it:
>    - KUDU-2326: Support atomic bulk load operation
>    <https://issues.apache.org/jira/browse/KUDU-2326>
>    - KUDU-2327: Support atomic swap of tables or partitions
>    <https://issues.apache.org/jira/browse/KUDU-2327>
> Mike
> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <boris@boristyukin.com>
> wrote:
>> Hello,
>> I am trying to figure out the best and safest way to swap data in a
>> production Kudu table with data from a staging table.
>> Basically, once in a while we need to perform a full reload of some
>> tables (once in a few months). These tables are pretty large with billions
>> of rows and we want to minimize the risk and downtime for users if
>> something bad happens in the middle of that process.
>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>> DATA INPATH. We can prepare data for reload in a staging table upfront and
>> this process might take many hours. Once staging table is ready, we can
>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
>> production table - this operation is almost instant and the very last step
>> in our pipeline.
>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>> PARTITION command.
>> Now with Kudu, I cannot seem to find a good strategy. The only thing came
>> to my mind is to drop the production table and rename a staging table to
>> production table as the last step of the job, but in this case we are going
>> to lose statistics and security permissions.
>> Any other ideas?
>> Thanks!
>> Boris

Todd Lipcon
Software Engineer, Cloudera

View raw message