flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: neo4j - Flink connector
Date Fri, 06 Nov 2015 15:17:55 GMT
Hi Martin,

what exactly were the issues you were facing with the dependency conflicts?

There is a way around these issues, and its called shading:
https://maven.apache.org/plugins/maven-shade-plugin/
In Flink we have several shaded modules (curator, hadoop) .. we could add a
neo4j-harness-shaded module which relocates conflicting dependencies into a
different namespace. That way, you can execute different versions of the
same library (jetty, scala) at the same time.
Since the module contains module would only contain dependencies needed at
test time, we could exclude it from releases.

Regarding Scala, it would be fine to execute the neo4j tests only with
scala 2.11 builds. Its not hard to control this using maven build profiles.
Do you really need jetty? Is neo4j starting the web interface also for the
tests?

Regards,
Robert


On Fri, Nov 6, 2015 at 4:09 PM, Martin Junghanns <m.junghanns@mailbox.org>
wrote:

> Hi,
>
> I could need your input on testing the input format with Flink.
>
> As I already mentioned, Neo4j offers a dedicated module (neo4j-harness)
> for unit testing server extensions / REST applications. The problem here
> is that the dependencies of Flink conflict with the dependencies of
> neo4j-harness (e.g. jetty, scala-library). I tried to figure out what
> combination could run using the maven exclude mechanism, but no success.
>
> So I thought about workarounds:
>
> (1) Michael Hunger (neo4j) started a project and invited me to
> contribute [1]. What it does during tests is:
> - download a neo4j-<version>.tar.gz into a temp folder
> - extract and start a neo4j instance
> - run tests
> - stop and discard neo4j
>
> I like the concept, but I guess the problem is that it runs outside of
> maven and I guess downloading from external resources (especially in
> travis-ci) could lead to problems.
>
> (2) I had a look into the other input formats. flink-hbase uses examples
> instead of unit tests. This could be an option as long as there is no
> clean solution for "real" unit testing.
>
> What do you think?
>
> Cheers, Martin
>
>
> [1] https://github.com/jexp/neo4j-starter
>
>
> On 03.11.2015 01:18, Stephan Ewen wrote:
> > Wow, very nice results :-)
> >
> > This input format alone is probably a very useful contribution, so I
> would
> > open a contribution there once you manage to get a few tests running.
> >
> > I know little about neo4j, is there a way to read cypher query results in
> > parallel? (most systems do not expose such an interface, but maybe neo4j
> is
> > special there).
> >
> > I recall at some point in time Martin Neumann asking about a way to
> create
> > dense contiguous unique IDs for creating graphs that can be bulk-imported
> > into neo4j. There is code for that in the data set utils, this may be
> > valuable for an output format.
> >
> > On Sat, Oct 31, 2015 at 9:51 AM, Martin Junghanns <
> m.junghanns@mailbox.org>
> > wrote:
> >
> >> Hi,
> >>
> >> I wanted to give you a little update. I created a non-parallel
> >> InputFormat which reads Cypher results from Neo4j into Tuples [1].
> >> It can be used like the JDBCInputFormat:
> >>
> >> String q = "MATCH (p1:Page)-[:Link]->(p2) RETURN id(p1), id(p2)";
> >>
> >> Neo4jInputFormat<Tuple2<Integer, Integer>> neoInput =
> >> Neo4jInputFormat.buildNeo4jInputFormat()
> >>       .setRestURI(restURI)
> >>       .setCypherQuery(q)
> >>       .setUsername("neo4j")
> >>       .setPassword("test")
> >>       .setConnectTimeout(1000)
> >>       .setReadTimeout(1000)
> >>       .finish();
> >>
> >> Atm, to run the tests, a Neo4j instance needs to be up and running.
> >> I tried to get neo4j-harness [2] into the project, but there are some
> >> dependency conflicts which I need to figure out.
> >>
> >> I ran a first benchmark on my Laptop (4 cores, 12GB, SSD) with Neo4j
> >> running on the same machine. My dataset is the polish wiki dump [2]
> >> which consists of 430,602 pages and 2,727,302 links. The protocol is:
> >>
> >> 1) Read edge ids from cold Neo4j into Tuple2<Integer, Integer>
> >> 2) Convert Tuple2 into Tuple3<Integer, Integer, Double> for edge value
> >> 3) Create Gelly graph from Tuple3, init vertex values to 1.0
> >> 4) Run PageRank with beta=0.85 and 5 iterations
> >>
> >> This takes about 22 seconds on my machine which is very promising.
> >>
> >> Next steps are:
> >> - OutputFormat
> >> - Better Unit tests (integrate neo4j-harness)
> >> - Bigger graphs :)
> >>
> >> Any ideas and suggestions are of course highly appreciated :)
> >>
> >> Best,
> >> Martin
> >>
> >>
> >> [1] https://github.com/s1ck/flink-neo4j
> >> [2]
> http://neo4j.com/docs/stable/server-unmanaged-extensions-testing.html
> >> [3] plwiktionary-20151002-pages-articles-multistream.xml.bz2
> >>
> >>
> >>
> >> On 29.10.2015 14:51, Vasiliki Kalavri wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> Martin, Martin, Alex (cc'ed) and myself have started discussing about
> >>> implementing a neo4j-Flink connector. I've opened a corresponding JIRA
> >>> (FLINK-2941) containing an initial document [1], but we'd also like to
> >>> share our ideas here to engage the community and get your feedback.
> >>>
> >>> We've had a skype call today and I will try to summarize some of the
> key
> >>> points here. The main use-cases we see are the following:
> >>>
> >>> - Use Flink for ETL / creating the graph and then insert it to a graph
> >>> database, like neo4j, for querying and search.
> >>> - Query neo4j on some topic or search the graph for patterns and
> extract a
> >>> subgraph, on which we'd then like to run some iterative graph analysis
> >>> task. This is where Flink/Gelly can help, by complementing the querying
> >>> (neo4j) with efficient iterative computation.
> >>>
> >>> We all agreed that the main challenge is efficiently getting the data
> out
> >>> of neo4j and into Flink. There have been some attempts to do similar
> >>> things
> >>> with neo4j and Spark, but the performance results are not very
> promising:
> >>>
> >>> - Mazerunner [2] is using HDFS for communication. We think that's it's
> not
> >>> worth it going towards this direction, as dumping the neo4j database to
> >>> HDFS and then reading it back to Flink would probably be terribly slow.
> >>> - In [3], you can see Michael Hunger's findings on using neo's HTTP
> >>> interface to import data into Spark, run PageRank and then put data
> back
> >>> into neo4j. It seems that this took > 2h for a 125m edge graph. The
> main
> >>> bottlenecks appear to be (1) reading the data as an RDD => this had to
> be
> >>> performed into small batches to avoid OOM errors and (2) PageRank
> >>> computation itself, which seems weird to me.
> >>>
> >>> We decided to experiment with neo4j HTTP and Flink and we'll report
> back
> >>> when we have some results.
> >>>
> >>> In the meantime, if you have any ideas on how we could speed up reading
> >>> from neo4j or any suggestion on approaches that I haven't mentioned,
> >>> please
> >>> feel free to reply to this e-mail or add your comment in the shared
> >>> document.
> >>>
> >>> Cheers,
> >>> -Vasia.
> >>>
> >>> [1]:
> >>>
> >>>
> https://docs.google.com/document/d/13qT_e-y8aTNWQnD43jRBq1074Y1LggPNDsic_Obwc28/edit?usp=sharing
> >>> [2]: https://github.com/kbastani/neo4j-mazerunner
> >>> [3]: https://gist.github.com/jexp/0dfad34d49a16000e804
> >>>
> >>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message