flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Junghanns <m.jungha...@mailbox.org>
Subject Re: neo4j - Flink connector
Date Fri, 06 Nov 2015 15:09:46 GMT
Hi,

I could need your input on testing the input format with Flink.

As I already mentioned, Neo4j offers a dedicated module (neo4j-harness)
for unit testing server extensions / REST applications. The problem here
is that the dependencies of Flink conflict with the dependencies of
neo4j-harness (e.g. jetty, scala-library). I tried to figure out what
combination could run using the maven exclude mechanism, but no success.

So I thought about workarounds:

(1) Michael Hunger (neo4j) started a project and invited me to
contribute [1]. What it does during tests is:
- download a neo4j-<version>.tar.gz into a temp folder
- extract and start a neo4j instance
- run tests
- stop and discard neo4j

I like the concept, but I guess the problem is that it runs outside of
maven and I guess downloading from external resources (especially in
travis-ci) could lead to problems.

(2) I had a look into the other input formats. flink-hbase uses examples
instead of unit tests. This could be an option as long as there is no
clean solution for "real" unit testing.

What do you think?

Cheers, Martin


[1] https://github.com/jexp/neo4j-starter


On 03.11.2015 01:18, Stephan Ewen wrote:
> Wow, very nice results :-)
> 
> This input format alone is probably a very useful contribution, so I would
> open a contribution there once you manage to get a few tests running.
> 
> I know little about neo4j, is there a way to read cypher query results in
> parallel? (most systems do not expose such an interface, but maybe neo4j is
> special there).
> 
> I recall at some point in time Martin Neumann asking about a way to create
> dense contiguous unique IDs for creating graphs that can be bulk-imported
> into neo4j. There is code for that in the data set utils, this may be
> valuable for an output format.
> 
> On Sat, Oct 31, 2015 at 9:51 AM, Martin Junghanns <m.junghanns@mailbox.org>
> wrote:
> 
>> Hi,
>>
>> I wanted to give you a little update. I created a non-parallel
>> InputFormat which reads Cypher results from Neo4j into Tuples [1].
>> It can be used like the JDBCInputFormat:
>>
>> String q = "MATCH (p1:Page)-[:Link]->(p2) RETURN id(p1), id(p2)";
>>
>> Neo4jInputFormat<Tuple2<Integer, Integer>> neoInput =
>> Neo4jInputFormat.buildNeo4jInputFormat()
>>       .setRestURI(restURI)
>>       .setCypherQuery(q)
>>       .setUsername("neo4j")
>>       .setPassword("test")
>>       .setConnectTimeout(1000)
>>       .setReadTimeout(1000)
>>       .finish();
>>
>> Atm, to run the tests, a Neo4j instance needs to be up and running.
>> I tried to get neo4j-harness [2] into the project, but there are some
>> dependency conflicts which I need to figure out.
>>
>> I ran a first benchmark on my Laptop (4 cores, 12GB, SSD) with Neo4j
>> running on the same machine. My dataset is the polish wiki dump [2]
>> which consists of 430,602 pages and 2,727,302 links. The protocol is:
>>
>> 1) Read edge ids from cold Neo4j into Tuple2<Integer, Integer>
>> 2) Convert Tuple2 into Tuple3<Integer, Integer, Double> for edge value
>> 3) Create Gelly graph from Tuple3, init vertex values to 1.0
>> 4) Run PageRank with beta=0.85 and 5 iterations
>>
>> This takes about 22 seconds on my machine which is very promising.
>>
>> Next steps are:
>> - OutputFormat
>> - Better Unit tests (integrate neo4j-harness)
>> - Bigger graphs :)
>>
>> Any ideas and suggestions are of course highly appreciated :)
>>
>> Best,
>> Martin
>>
>>
>> [1] https://github.com/s1ck/flink-neo4j
>> [2] http://neo4j.com/docs/stable/server-unmanaged-extensions-testing.html
>> [3] plwiktionary-20151002-pages-articles-multistream.xml.bz2
>>
>>
>>
>> On 29.10.2015 14:51, Vasiliki Kalavri wrote:
>>
>>> Hello everyone,
>>>
>>> Martin, Martin, Alex (cc'ed) and myself have started discussing about
>>> implementing a neo4j-Flink connector. I've opened a corresponding JIRA
>>> (FLINK-2941) containing an initial document [1], but we'd also like to
>>> share our ideas here to engage the community and get your feedback.
>>>
>>> We've had a skype call today and I will try to summarize some of the key
>>> points here. The main use-cases we see are the following:
>>>
>>> - Use Flink for ETL / creating the graph and then insert it to a graph
>>> database, like neo4j, for querying and search.
>>> - Query neo4j on some topic or search the graph for patterns and extract a
>>> subgraph, on which we'd then like to run some iterative graph analysis
>>> task. This is where Flink/Gelly can help, by complementing the querying
>>> (neo4j) with efficient iterative computation.
>>>
>>> We all agreed that the main challenge is efficiently getting the data out
>>> of neo4j and into Flink. There have been some attempts to do similar
>>> things
>>> with neo4j and Spark, but the performance results are not very promising:
>>>
>>> - Mazerunner [2] is using HDFS for communication. We think that's it's not
>>> worth it going towards this direction, as dumping the neo4j database to
>>> HDFS and then reading it back to Flink would probably be terribly slow.
>>> - In [3], you can see Michael Hunger's findings on using neo's HTTP
>>> interface to import data into Spark, run PageRank and then put data back
>>> into neo4j. It seems that this took > 2h for a 125m edge graph. The main
>>> bottlenecks appear to be (1) reading the data as an RDD => this had to be
>>> performed into small batches to avoid OOM errors and (2) PageRank
>>> computation itself, which seems weird to me.
>>>
>>> We decided to experiment with neo4j HTTP and Flink and we'll report back
>>> when we have some results.
>>>
>>> In the meantime, if you have any ideas on how we could speed up reading
>>> from neo4j or any suggestion on approaches that I haven't mentioned,
>>> please
>>> feel free to reply to this e-mail or add your comment in the shared
>>> document.
>>>
>>> Cheers,
>>> -Vasia.
>>>
>>> [1]:
>>>
>>> https://docs.google.com/document/d/13qT_e-y8aTNWQnD43jRBq1074Y1LggPNDsic_Obwc28/edit?usp=sharing
>>> [2]: https://github.com/kbastani/neo4j-mazerunner
>>> [3]: https://gist.github.com/jexp/0dfad34d49a16000e804
>>>
>>>
> 

Mime
View raw message