kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: swap data in Kudu table
Date Fri, 23 Feb 2018 21:32:42 GMT
A couple other ideas from the Impala side:

- could you use a view and alter the view to point to a different table?
Then all readers would be pointed at the view, and security permissions
could be on that view rather than the underlying tables?

- I think if you use an external table in Impala you could use an ALTER
TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
different table. Then issue a 'refresh' on the impalads so that they load
the new metadata. Subsequent queries would hit the new underlying Kudu
table, but permissions and stats would be unchanged.

-Todd

On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mpercy@apache.org> wrote:

> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> load capabilities or staging abilities. Theoretically renaming a partition
> atomically shouldn't be that hard to implement, since it's just a master
> metadata operation which can be done atomically, but it's not yet
> implemented.
>
> There is a JIRA to track a generic bulk load API here:
> https://issues.apache.org/jira/browse/KUDU-1370
>
> Since I couldn't find anything to track the specific features you
> mentioned, I just filed the following improvement JIRAs so we can track it:
>
>    - KUDU-2326: Support atomic bulk load operation
>    <https://issues.apache.org/jira/browse/KUDU-2326>
>    - KUDU-2327: Support atomic swap of tables or partitions
>    <https://issues.apache.org/jira/browse/KUDU-2327>
>
> Mike
>
> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <boris@boristyukin.com>
> wrote:
>
>> Hello,
>>
>> I am trying to figure out the best and safest way to swap data in a
>> production Kudu table with data from a staging table.
>>
>> Basically, once in a while we need to perform a full reload of some
>> tables (once in a few months). These tables are pretty large with billions
>> of rows and we want to minimize the risk and downtime for users if
>> something bad happens in the middle of that process.
>>
>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>> DATA INPATH. We can prepare data for reload in a staging table upfront and
>> this process might take many hours. Once staging table is ready, we can
>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
>> production table - this operation is almost instant and the very last step
>> in our pipeline.
>>
>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>> PARTITION command.
>>
>> Now with Kudu, I cannot seem to find a good strategy. The only thing came
>> to my mind is to drop the production table and rename a staging table to
>> production table as the last step of the job, but in this case we are going
>> to lose statistics and security permissions.
>>
>> Any other ideas?
>>
>> Thanks!
>> Boris
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message