flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Junghanns <m.jungha...@mailbox.org>
Subject Re: Join hints for the Gelly functions
Date Sat, 22 Aug 2015 09:28:22 GMT

I guess enforcing a Join Strategy by default is not the best option 
since you can't assume what the user did before actually calling the 
Gelly functions and how the data looks like (maybe its one of the 1% 
graphs where the relation is the other way around or the vertex data set 
is very large); maybe the datasets are already sorted / partitioned. 
Another solution could be overloading the Gelly functions that use joins 
and letting the users decide to give hints or not?

As an example, I am currently benchmarking graphs with up to 700M 
vertices and 3B edges on a YARN cluster and at one point in the job I 
need to join vertices and edges. I also tried to give the 
broadcast-hash-second (vertices) hint and the job performed 
significantly slower than letting the system decide.


On 22.08.2015 09:51, Andra Lungu wrote:
> Hey everyone,
> When coding for my thesis, I observed that half of the current Gelly
> functions (the ones that use join operators) fail on a cluster environment
> with the following exception:
> java.lang.IllegalArgumentException: Too few memory segments provided. Hash Join
> needs at least 33 memory segments.
> This is because, in 99% of the cases, the vertex data set is significantly
> smaller than the edge data set. What I did to get rid of the error was the
> following:
> DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources
= edges
>        .join(this.vertices,
> JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0)
> In short, I added join hints. I believe this should also be in Gelly, in
> case someone bumps into the same problem somewhere in the future.
> What do you think?

View raw message