clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rupert Westenthaler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLEREZZA-683) Indexed in-memory graph
Date Fri, 07 Jun 2013 10:06:19 GMT

    [ https://issues.apache.org/jira/browse/CLEREZZA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677933#comment-13677933
] 

Rupert Westenthaler commented on CLEREZZA-683:
----------------------------------------------

The current IndexedTripleCollection does not work when used with different Resource implementation.


Here is an example of such a case

(1) create an IndexedMGraph
(2) add some default Triples (new TripleImpl(new UriRef("http://example.com/"), RDFS.label,
new PlainLiteral("test")));
(3) add some Triples from a wrapped Jena Graph

and the IndexedMGraph will no longer work as expected

The reason for that is the Comparator implementation of the IndexedTripleCollection that uses
(1) hashCode and (2) toString for comparation. While this works as long as all the Resources
within the triple collection do share the same implementation it breaks as soon as Resource
implementations that douse different hasCode and/or toString implementations.

As a consequence the assumption to have compatible hasCode / toString implementations is not
valid. This implies that neither hasCode, toString can be used for the Comparator implementation
of the IndexedMGraph.

Because of that the implementation will need to be completely changed to the following algorithm:

    1. URIs are sorted by the UriRef#getUnicodeString()
    2. Literals 
        a. sort by the Literal#getLexicalForm()
        b. sort by PlainLiteral#getLanguage() (<code>null</code> value first)
        c. sort by TypedLiteral#getDataType() (<code>null</code> value fist)
    3. BNode 
        a. sorted by their System#identityHashCode(Object)
        b. on hasCode conflicts (same hasCode but not equals) a random order is chosen and
the decision is kept in the conflictsMap

This algorithm ensures:

    * parsed Resources are not required to correctly implement hasCode and equals
    * parsed UriRef , BNode and Literal instances MUST NOT implement multiple of those interfaces.
This means e.g. that an UriRef MUST NOT also implement BNode nor Literal. If this is the case
sorting based on resource type might get broken.
     * parsed Literals MAY implement both PlainLiteral AND TypedLiteral. This allows wrappers
over frameworks that do not distinguish between those two literal types to be used with the
Indexed in-memory graph. As an example Sesame dose not distinguish between Plain- and TypedLiterals.
Therefore Clerezza implementations will most likely create a single wrapper class that does
implement both the PlainLiteral and TypedLiteral interface.
                
> Indexed in-memory graph
> -----------------------
>
>                 Key: CLEREZZA-683
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-683
>             Project: Clerezza
>          Issue Type: New Feature
>          Components: rdf.core
>            Reporter: Rupert Westenthaler
>
> # Indexed in-memory graph
> Implementation of a TripleCollection that internally manages SPO, POS, OSP indexes for
fast filtered iterators. The current state of development is hosted at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/.
However the intention is that this module becomes direct part of clerezza. 
> ## Background:
> For Apache Stanbol having fast filtered iterators over in-memory graphs is really important,
because Stanbol uses in-memory graph to store extracted metadata for parsed ContentItems.
> When enhancing longer texts with EnhancementChain configurations that produce a lot of
enhancements (e.g. keyword extraction based on dbpedia) such in-memory graphs can get bigger
than 100k triples. Especially if also triples for suggested entities are included within the
result.
> ## Implementation:
> Because of that I started to implement an TripleCollection that used TreeMaps to manage
SPO, POS, OSP indexes. 
> For fast sorting (comparator) I use the same Resource#hashCode Resource#toString based
solution as used in the rdf.rdfjson serializer. I hope this is also sufficient for Literals
(someone should check that).
> The implementation of the "filter(..)" method is purely based on "NavigableSet.subSet(..).iterator()".
I only need to wrap the iterator to ensure that by calls to Iterator.remove():
> 1) Triples are removed from all three indexes
> 2)  GraphEvents are dispatched correctly
> Note also the trick with the two static fields UriRef MIN and UriRef MAX used to generate
lower/upper bound triples as parsed to  "NavigableSet.subSet(..)".
> The implementation is currently hosted on http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/
> It has no dependencies to Apache Stanbol. However users that do not want to check-out
Stanbol as a whole will need to edit the pom.xml file and provide information usually imported
from the parent poms.
> ## Tests:
> This implementation passes all MGraphTest UnitTests.
> In addition I have copied the tests define for SimpleTripleCollection
> To compare the performance I also implemented code that
> * allows to create a random Graph with n Triples
> * create a TestCase with configurable numbers of Subjects, Predicates and Objects
> * performs than m calls to #filter(...)
> This performance test runs also as UnitTest
> 1. by using the SimpleMGraph implementation
> 2. by using the IndexedMGraph implementation
> NOTE: While implementing this I recognized that the SimpleTripleCollectionTest does not
extend MGraphTest and therefore the SimpleTripleCollection class is not checked against the
tests defined by MGraphTest. This might actually an Issue!
> ## Performance
> This is a copy from a run of the above described PerformanceTest
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - Filter Performance
Test (graph size 100000 triples, iterations 1000)
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST
SimpleMGraph with 100000 triples ---
> 10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,P,O] in 8321ms with 2 results
> 18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,P,n] in 7358ms with 734 results
> 25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,n,O] in 7266ms with 100 results
> 31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,P,O] in 6519ms with 232 results
> 39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,n,n] in 7398ms with 8030 results
> 45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,P,n] in 5934ms with 8318000 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,n,O] in 10666ms with 2260 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST
completed in 53463ms
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST
IndexedMGraph 100000 triples ---
> 55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,P,O] in 20ms with 2 results
> 55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,P,n] in 19ms with 734 results
> 55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,n,O] in 33ms with 100 results
> 55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,P,O] in 28ms with 232 results
> 55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[S,n,n] in 21ms with 8030 results
> 57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,P,n] in 1065ms with 8318000 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run
[n,n,O] in 8ms with 2260 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST
completed in 1194ms
> best
> Rupert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message