flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Junghanns <m.jungha...@mailbox.org>
Subject Re: neo4j - Flink connector
Date Sat, 31 Oct 2015 16:51:13 GMT

I wanted to give you a little update. I created a non-parallel
InputFormat which reads Cypher results from Neo4j into Tuples [1].
It can be used like the JDBCInputFormat:

String q = "MATCH (p1:Page)-[:Link]->(p2) RETURN id(p1), id(p2)";

Neo4jInputFormat<Tuple2<Integer, Integer>> neoInput = 

Atm, to run the tests, a Neo4j instance needs to be up and running.
I tried to get neo4j-harness [2] into the project, but there are some
dependency conflicts which I need to figure out.

I ran a first benchmark on my Laptop (4 cores, 12GB, SSD) with Neo4j
running on the same machine. My dataset is the polish wiki dump [2]
which consists of 430,602 pages and 2,727,302 links. The protocol is:

1) Read edge ids from cold Neo4j into Tuple2<Integer, Integer>
2) Convert Tuple2 into Tuple3<Integer, Integer, Double> for edge value
3) Create Gelly graph from Tuple3, init vertex values to 1.0
4) Run PageRank with beta=0.85 and 5 iterations

This takes about 22 seconds on my machine which is very promising.

Next steps are:
- OutputFormat
- Better Unit tests (integrate neo4j-harness)
- Bigger graphs :)

Any ideas and suggestions are of course highly appreciated :)


[1] https://github.com/s1ck/flink-neo4j
[2] http://neo4j.com/docs/stable/server-unmanaged-extensions-testing.html
[3] plwiktionary-20151002-pages-articles-multistream.xml.bz2

On 29.10.2015 14:51, Vasiliki Kalavri wrote:
> Hello everyone,
> Martin, Martin, Alex (cc'ed) and myself have started discussing about
> implementing a neo4j-Flink connector. I've opened a corresponding JIRA
> (FLINK-2941) containing an initial document [1], but we'd also like to
> share our ideas here to engage the community and get your feedback.
> We've had a skype call today and I will try to summarize some of the key
> points here. The main use-cases we see are the following:
> - Use Flink for ETL / creating the graph and then insert it to a graph
> database, like neo4j, for querying and search.
> - Query neo4j on some topic or search the graph for patterns and extract a
> subgraph, on which we'd then like to run some iterative graph analysis
> task. This is where Flink/Gelly can help, by complementing the querying
> (neo4j) with efficient iterative computation.
> We all agreed that the main challenge is efficiently getting the data out
> of neo4j and into Flink. There have been some attempts to do similar things
> with neo4j and Spark, but the performance results are not very promising:
> - Mazerunner [2] is using HDFS for communication. We think that's it's not
> worth it going towards this direction, as dumping the neo4j database to
> HDFS and then reading it back to Flink would probably be terribly slow.
> - In [3], you can see Michael Hunger's findings on using neo's HTTP
> interface to import data into Spark, run PageRank and then put data back
> into neo4j. It seems that this took > 2h for a 125m edge graph. The main
> bottlenecks appear to be (1) reading the data as an RDD => this had to be
> performed into small batches to avoid OOM errors and (2) PageRank
> computation itself, which seems weird to me.
> We decided to experiment with neo4j HTTP and Flink and we'll report back
> when we have some results.
> In the meantime, if you have any ideas on how we could speed up reading
> from neo4j or any suggestion on approaches that I haven't mentioned, please
> feel free to reply to this e-mail or add your comment in the shared
> document.
> Cheers,
> -Vasia.
> [1]:
> https://docs.google.com/document/d/13qT_e-y8aTNWQnD43jRBq1074Y1LggPNDsic_Obwc28/edit?usp=sharing
> [2]: https://github.com/kbastani/neo4j-mazerunner
> [3]: https://gist.github.com/jexp/0dfad34d49a16000e804

View raw message