flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Accessing RDF triples using Flink
Date Thu, 07 Apr 2016 08:14:31 GMT
Hi Ritesh,
Jena could store triples in NQuadsInputFormat that is an HadoopInputFormat
so that you can read data in effiient way with Flink. Unfortunately I
rembember that I had some problem usign it so I just export my Jena model
as NQuads so then I can parse it efficiently with Flink as a text file.
However the parsing with sesame 4 is more efficient in terms of speed and
garbage collection.

What I do is to convert every quad into a tuple5, group triples/quads by
subject and then apply some logic. The quads grouped by subject is what we
call "entiton atom" and combining them leads to an "entiton molecule" (i.e.
a graph rooted in some entiton atom).

We presented our work at FlinkForward 2015 in Berlin:
If you need some code that reads the nquads with Flink I can give you some
code, just write me in private!


On Wed, Apr 6, 2016 at 3:57 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Hi Flavio,
>    1. How do you access your rdf dataset via flink? Are you reading it as
>    a normal input file and splitting the records or you have some wrappers in
>    place to convert the rdf data into triples? Can you please share some code
>    samples if possible?
>    2. I am using Jena TDB command line utilities to make queries against
>    the dataset in order to avoid java garbage collection issues. I am also
>    using Jena java APIs as a dependency but command line utils are way faster
>    (Though it comes with an extra requirement to have Jena command line utils
>    installed in the system). Main reason for this approach being able to pass
>    the string output from the command line to Flink as part of my pipeline.
>    Can you tell me your approach to this?
>    3. Should I dump my query output to a file and then consume it as a
>    normal input source for Flink?
> Basically, any help regarding this will be helpful.
> Regards,
> Ritesh
> Ritesh Kumar Singh
> [image: https://]about.me/riteshoneinamillion
> <https://about.me/riteshoneinamillion?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
> On Wed, Apr 6, 2016 at 2:45 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>> Ho Ritesh,
>> I have sone experience with Rdf and Flink. What do you mean for accessing
>> a Jena model? How do you create it?
>> From my experience reading triples from jena models is evil because it
>> has some problems with garbage collection.
>> On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <riteshoneinamillion@gmail.com>
>> wrote:
>>> Hi,
>>> I need some suggestions regarding accessing RDF triples from flink. I'm
>>> trying to integrate flink in a pipeline where the input for flink comes
>>> from SPARQL query on a Jena model. And after modification of triples using
>>> flink, I will be performing SPARQL update using Jena to save my changes.
>>>    - Are there any recommended input format for loading the triples to
>>>    flink?
>>>    - Will this use case be classified as a flink streaming job or a
>>>    batch processing job?
>>>    - How will loading of the dataset vary with the input size?
>>>    - Are there any recommended packages/ projects for these type of
>>>    projects?
>>> Any suggestion will be of great help.
>>> Regards,
>>> Ritesh
>>> https://riteshtoday.wordpress.com/

View raw message