flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Urs Schoenenberger <urs.schoenenber...@tngtech.com>
Subject DataSet: partitionByHash without materializing/spilling the entire partition?
Date Tue, 05 Sep 2017 10:30:03 GMT
Hi all,

we have a DataSet pipeline which reads CSV input data and then
essentially does a combinable GroupReduce via first(n).

In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) ->
first(n)), we got a jobgraph like this:

source --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce

This works, but we found the combine phase to be inefficient because not
enough combinable elements fit into a sorter. My idea was to
pre-partition the DataSet to increase the chance of combinable elements
(readCsvFile -> partitionBy(0) -> groupBy(0) ->  sortGroup(0) -> first(n)).

To my surprise, I found that this changed the job graph to

source --[Hash Partition on 0]--> partition(noop) --[Forward]--> combine
--[Hash Partition on 0, Sort]--> reduce

while materializing and spilling the entire partitions at the
partition(noop)-Operator!

Is there any way I can partition the data on the way from source to
combine without spilling? That is, can I get a job graph that looks like


source --[Hash Partition on 0]--> combine --[Hash Partition on 0,
Sort]--> reduce

instead?

Thanks,
Urs

-- 
Urs Schönenberger - urs.schoenenberger@tngtech.com

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Mime
View raw message