impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) PREVIEW: Basic column-wise slot materialization in Parquet scanner.
Date Fri, 15 Apr 2016 06:54:58 GMT
Alex Behm has posted comments on this change.

Change subject: PREVIEW: Basic column-wise slot materialization in Parquet scanner.
......................................................................


Patch Set 1:

(5 comments)

Thanks, Tim!

As discussed in person, let's stick to materializing in a row format for now because we need
to output results in rows. It is not clear that materializing to columns then copying to rows
is worth the extra complexity without also immediately reaping the column-wise benefits (e.g.
SIMD predicate eval). The biggest benefit is expected from late materialization which is possible
in both designs and probably easier in the current design.

The new patch set contains several bugfixes.

http://gerrit.cloudera.org:8080/#/c/2779/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 325:         continue_execution = ReadValue(pool, tuple);
> I think you could make these non-virtual calls without duplicating code by 
Good idea. As discussed I will give this a try. I'll make this change in its own patch set
so one can compare the before/after easily without also looking at extra differences.


Line 1732:   // and return an output batch with relatively few rows.
> The TODO describes the current intended behaviour, so that sounds right. I 
Yea, I think so, too. Do you think it's obvious enough for me to remove the TODO?


Line 1737:       // Optimization for scans with selective filters/conjuncts: None of the
> Is this factoring in accumulated disk buffers?
As you've correctly pointed out, it's not clear whether this optimization is even useful by
itself.
I removed this optimization until we have hard data that suggests that we need something like
it.


Line 1740:       scratch_batch->pool->Clear();
> Are you seeing the MemPool show up in the profile? The current allocation a
This was a "premature" optimization. I was just assuming that reusing the memory would be
better. Removed for now pending further investigation.


Line 1829:         col_reader->ReadValueBatch(scratch_batch.pool.get(), batch_size,
> Ignoring return value?
Correct, that was a bug. Fixed.


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message