spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <>
Subject Re: repeated unioning of dataframes take worse than O(N^2) time
Date Fri, 30 Dec 2016 07:59:50 GMT

Actually, as you use Dataset's union API, unlike RDD's union API, it will
break the nested structure. So that should not be the issue.

The additional time introduced when the number of dataframes grows, is spent
on analysis stage. I can think that as the Union has a long children list,
the analyzer needs more time to traverse the tree.

When the dataset of Union(Range1, Range2) is created, the Analyzer needs to
go through 2 Range(s). When the next union happens, i.e., Union(Range1,
Range2, Range3), the Analyzer needs to go through 3 Range(s), except for the
first 2 Range(s). The two Range plans are overlapped. But the Analyzer still
goes through them.

If there is an Union with 5 Range logical plans, the Analyzer goes through:

2 + 3 + 4 + 5 = 14 Range(s) under the Union

When you increase the Range plans to 10. It becomes:

2 + 3 + 4 + 5 + ... + 10 = 54 Range(s)

So if an Union of 100 Range plans, there are 5049 Range(s) needed to go
through. For 200 Range plans, it becomes 20099.

You can see it is not linear relation.

Liang-Chi Hsieh | @viirya 
Spark Technology Center 
View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe e-mail:

View raw message