flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: Question about Re-Partition Operator
Date Fri, 26 Sep 2014 11:04:25 GMT
I did a pull request that contains my suggestions. This also fixes the
problem that you have multiple repartition operators if you do several
map operators on a repartition. Each one of them would get their own
private repartition.

What do you guys think?

On Thu, Sep 25, 2014 at 10:44 AM, Fabian Hueske <fhueske@apache.org> wrote:
> I wanted the change to be as less invasive as possible and there were a few
> tricky details.
> For example the DOP of a partition operator must be the same as the
> following operator (otherwise data is reshuffled again).
> Also multiple consumers of a repartition are a nasty detail (esp. if they
> have different DOPs).
> These things were a bit tricky to integrate into the current translation
> mechanism without major changes.
> Right now, each Map op injects its own partitioning operator before itself
> and sets the DOP to its own DOP.
> That's why the logic for that was put into the Map ops. A drawback of this
> approach is, that there is no reuse of repartitionings even if it would be
> possible...
> There might be smarter ways to implement this, but I think my (limited)
> solution works correct in all cornercase that came to my mind.
> 2014-09-25 10:27 GMT+02:00 Aljoscha Krettek <aljoscha@apache.org>:
>> Yes, I see your points on why you limited it to the map-style
>> operators. My gripe with it is that it moves knowledge about the
>> Partition Operator into some operators that now need special-case
>> handling and that these operations are duplicated on
>> PartitionedDataSet. If we have Partition as just another operator
>> there would be no special-case handling and the integration would be
>> scalable (in a sense that we don't have to touch other operators, or
>> have to add partition support if we add another map-style operator).
>> My view is that users should know what they are doing if the add a
>> repartition before a reduce. In the future the optimizer could catch
>> those cases and merge a custom repartition with a reduce if there are
>> no other consumers of the repartitioning.
>> Cheers,
>> Aljoscha
>> On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <fhueske@apache.org>
>> wrote:
>> > Hi,
>> >
>> > I limited explicit repartitioning to map-like operatators because Map
>> > operations should be by far the most common use cases (rebalancing for
>> > expensive mappers, repartition for partitionMap, ...) and I doubt the
>> > benefits of repartitioning before other operators.
>> > Right now, partitioning is implemented as an own runtime operator with
>> > noop-driver. Hence, there would be a serialization overhead if subsequent
>> > operators cannot be chained (i.e., join).
>> > Also having a partitioning in front of a combinable reduce would make
>> > things complicated, because the partitioning should be moved between
>> > combiner and reducer.
>> > Overall, I think, there are not many benefits of offering repartitioning
>> > for all operators and if you still want it, you can use an identity
>> mapper.
>> >
>> > Does that make sense?
>> > If not, we can also change the implementation without breaking the API
>> and
>> > offer repartitioning for all operators. ;-)
>> >
>> > Anyway, I'm with you that we should think about how to integrate physical
>> > operators (partition, sort, combine) into the API.
>> >
>> > Cheers, Fabian
>> >
>> >
>> > 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <aljoscha@apache.org>:
>> >
>> >> Hi,
>> >> (mostly @fabian) why is the re-partition operator special-cased like
>> >> it is now? You can only do map/filter on partitioned data. Wouldn't it
>> >> be nice if we had a general re-partition operator. Operators that
>> >> normally do their own repartitioning would notice that the data is
>> >> already partitioned and use that. The way it is now, the
>> >> implementation relies heavily on special-case handling.
>> >>
>> >> In the long run we could even introduce combine and sort as special
>> >> operators that users could insert themselves. The optimiser would then
>> >> also insert these before operations when required. This would
>> >> simplify/generalise things a bit.
>> >>
>> >> What do you think?
>> >>
>> >> Cheers,
>> >> Aljoscha
>> >>

View raw message