flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li, Chengxiang" <chengxiang...@intel.com>
Subject Use bloom filter to improve hybrid hash join performance
Date Thu, 18 Jun 2015 09:37:19 GMT
Hi, flink developers

I read the flink hybrid hash join documents and implementation, very nice job. For the case
of small table does not all fit into memory, I think we may able to improve the performance
better.  Currently in hybrid hash join, while small table does not fit into memory, part of
the small table data would be spilled to disk, and the counterpart partition of big table
data would be spilled to disk in probe phase as well. You can find detail description here:
http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html . If we
build a bloom filter while spill small table to disk during build phase, and use it to filter
the big table records which tend to be spilled to disk, this may greatly reduce the spilled
big table file size, and saved the disk IO cost for writing and further reading. I have created
FLINK-2240<https://issues.apache.org/jira/browse/FLINK-2240> about it,  I would like
to contribute on this optimization if someone can assign the JIRA to me. But before that,
I would like to hear your comments about this.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message