hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Behroz Sikander (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-990) GSoC'16: Apache Hama benchmark against Spark and Flink
Date Thu, 19 May 2016 23:55:12 GMT

    [ https://issues.apache.org/jira/browse/HAMA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292377#comment-15292377

Behroz Sikander commented on HAMA-990:

>> I personally recommend you don't spend much time for other trivial bug fixes.
Okay. I am very close to understanding it completely but I will move my focus towards the
main goal as you mentioned.

Regarding the main goal, I think that we should check Hama on the following types of algorithms.
1- Batch
2- Iterative
3- Graph
4- Query Processing

and the proposed algorithms are
1- Batch - Word Count
2- Iterative/ML - K-Means
3- Graph - Page Rank
4- Query Processing - We can use MRQL for this and can perform a scan/join on a dataset.[2]

According to [1] and [3], Apache Flink is faster than Spark in K-Means, Page Rank and Query
Processing whereas Spark is faster in Word Count. We can reproduce these results in our cluster
and then can calculate the results for Hama. Once we have all the results we can compare all
the systems.

1- for monitoring the memory, CPU, harddrive and network usage we can use [4]. What do you
think about this ?
2- Karamel can be used for easy installation of Spark and Flink [5]. I am also okay with manual
installation. Any suggestions ?
3- Spark and Flink also have a TeraSort benchmark where Flink is apparently faster. [6]. Should
we also do a TeraSort benchmark ?
4- Should we try all the systems Flink/Spark/Hama on default configurations or we should tweak
them for best performance  for each algorithm ?

[1] - http://www.slideshare.net/sbaltagi/overview-of-apacheflinkbyslimbaltagi     - See slide
[2] - http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
[3] - http://link.springer.com/chapter/10.1007/978-3-319-19027-3_3
[4] - https://github.com/shelan/collectl-monitoring
[5] - http://karamel.readthedocs.io/en/latest/text/overview.html
[6] - http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/

> GSoC'16: Apache Hama benchmark against Spark and Flink
> ------------------------------------------------------
>                 Key: HAMA-990
>                 URL: https://issues.apache.org/jira/browse/HAMA-990
>             Project: Hama
>          Issue Type: Documentation
>            Reporter: Behroz Sikander
>            Priority: Minor

This message was sent by Atlassian JIRA

View raw message