incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Spicar (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLEREZZA-643) Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8
Date Wed, 26 Oct 2011 12:35:32 GMT

    [ https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135900#comment-13135900
] 

Daniel Spicar commented on CLEREZZA-643:
----------------------------------------

Thank you for your contribution, Rupert.

I inspected the patch you submitted and the current version of the code. An improved RDF-JSON
serializer is something we would really like. In real-world applications we use Clerezza for,
one major problem is bad performance and/or excessive memory consumption. We are dealing with
huge graphs there. Therefore your contribution sounds really promising. But as I am not the
author of the original code, I reviewed the original code as well with respect to the above
scenario. My focus was determining whether the implementations scale with large Graphs.

Comments on the Serializer:
- As I understand the rdf-json specification, sorting of output is not required, only grouping
by subject and predicate. Therefore I don't think the more expensive subject-predicate sort
(that you commented out but still included) is necessary. Or am I missing something? Can this
part be safely removed?

- The original code (unpatched) does NOT properly stream the serialization. This is a concern
when the source graph contains too many unique subjects/predicates/objects, because all the
generated JSONObjects/Arrays are stored in memory before being written to the output stream.
This is especially concerning when many BLOBs are stored in the graph.

- The patch does correctly stream the serialization, but it loads the entire source graph
into memory for sorting (toArray call at line 99). Again this may easily exceed available
memory. The original code does not load the entire source graph into memory as it uses filter
(when the underlying graph is backed by a TripleStore). The iterators returned by filter only
access the data in the graph for one triple at a time upon the call to next().

Conclusion:
I think none of the solutions can support graphs that exceed memory size. I assume the unpatched
version can deal with slightly larger graphs than your solution but that is irrelevant. We
need a solution that will work reliably with graphs larger than memory size. As you mentioned
an optimal solution would exploit a sorted (or at least grouped) iterator provided by the
underlying TripleCollection. I think that is the approach we need to take to solve this issue
in a scalable manner.

Now the there is the question whether to accept your patch for Clerezza until we implement
a better solution. I am not sure. Your solution is a significant improvement in terms of speed
of serialization, but the original code is easier to quick-fix such that the results are streamed
properly to the outputstream (I think exploiting the JSON simple streaming interface may do
the trick). So the question seems to be, what is more important, a solution that, while possibly
very slow, will not exceed available memory, or a solution that significantly improves serialization
performance. 

My opinion is that since we seemed to live so far with a solution that can not deal with very
large graphs anyway, the speed improvement may be more valuable. However we need to get working
on a better solution as described above.

I think we should raise this issue on the mailing list for discussion.
                
> Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser
using Platform encoding instead of UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLEREZZA-643
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-643
>             Project: Clerezza
>          Issue Type: Improvement
>            Reporter: Rupert Westenthaler
>            Assignee: Daniel Spicar
>         Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific encodings
instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs (at least
when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because of the use
of multiple TripleCollection.filter(..) calls fist to filter all predicates for an subject
and than all objects for each subject/predicate combination. A trying to serialize a graph
with 50k triples ended in several minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on a sorted
array of the triples. With this method one can serialize graphs with 100k in about 1sec. This
patch also changes encoding to UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message