impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mostafa Mokhtar <mmokh...@cloudera.com>
Subject Re: Sorting during insert
Date Sun, 07 Jan 2018 19:29:50 GMT
Hi Burno,

The local sort is intentional as we have found that a full sort of the data
can be very expensive, so it is a tradeoff between very long ETL and Query
performance.

How are the queries against the partitioned and sorted performing? can you
share a query profile and DDL for the underlying table?

Thanks
Mostafa

On Sun, Jan 7, 2018 at 4:11 AM, Bruno Quinart <bquinart@icloud.com> wrote:

> Hello
>
> We leverage the new filtering on Parquet stats for a few selective queries
> (refer also to http://blog.cloudera.com/blog/2017/12/faster-performance-
> for-selective-queries/ <http://blog.cloudera.com/blog/2017/12/faster-
> performance-for-selective-queries/>). We run Impala 2.9 on Cloudera
> 5.12.0.
>
> However we notice that when inserting data into a non-partitioned table,
> the data is not optimally sorted. If we check the query plan (for a simple
> insert with only filtering on source table), there is no exchange of data
> between Impala daemons. So whatever a specific daemon reads gets sorted
> ‘locally’ and written away. That means that for selective queries we
> typically still scan as much Parquet files as the number of daemons in the
> cluster.
>
> Is this expected?
> Some kind of distributed merge sort could limited the IO needed.
>
> For partitioned tables, the insert already typically happens with an
> implicit SHUFFLE. So in that case all data to be sorted goes to one daemon
> and sorting happens as expected. In case of a NOSHUFFLE hint, the sorting
> again only happens in scope of individual daemons (which is kind of what
> what is requested by the hint - but still somewhat conflicting with the
> table definition). I did not yet try the behaviour with the CLUSTERED hint
> (my understanding is that it would be similar as SHUFFLE).
> In case of the SHUFFLE, the insert takes a lot longer however, because a
> single daemon is sorting all the data. Some more optimal distributed merge
> sorting could maybe reduce this time.
>
> I had a quick look at JIRA, but didn’t find much about this. In
> IMPALA-4969, sorting during insert in partitioned table is discussed, but
> that was before  the filtering on Parquet stats was implemented.
>
> Hoping to get more insights on what thoughts were given to this and
> possible trade-offs involved.
>
> Thanks
> Bruno
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message