kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: performance issue involving "insert as select"
Date Mon, 12 Dec 2016 05:51:14 GMT
Hi Rotem,

On Thu, Dec 8, 2016 at 3:25 AM, Rotem Gabay <rotemgabay82@gmail.com> wrote:

> Hi, I have  a small cluster on which I tried to run some performance tests
> on kudu, In order to populate some data I have made simple "insert as
> select" from simple HDFS table that took 10 minutes to finish. I then tried
> to duplicate the same data by doing another insert as select from the kudu
> table to itself ( insert into kudu_tbl select * from kudu_tbl), this insert
> took more then 2 hours to complete. Is there ant reasonable explaination ?

One interesting aspect of current releases of Kudu is that Impala queries
don't operate with snapshot consistency. In the case that you are writing
into the same table that you are reading from, it's actually possible that
the query reads its own results.

Put another way, one fragment of the query may be writing into a tablet
while another fragment is still reading that tablet. Without snapshot
consistency, it's actually possible for this to create a sort of "infinite
loop" of inserts. While usually not infinite, it can end up producing far
more rows than you expected.

We're working on addressing this in upcoming releases. In the meantime,
it's probably best to generate your data in a different fashion rather than
inserting into the same table that you're reading from.

Hope that helps. Let us know if the explanation doesn't seem to match up
with what you're seeing.

Todd Lipcon
Software Engineer, Cloudera

View raw message