incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rupert Westenthaler (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CLEREZZA-683) Indexed in-memory graph
Date Fri, 03 Feb 2012 11:21:53 GMT
Indexed in-memory graph
-----------------------

                 Key: CLEREZZA-683
                 URL: https://issues.apache.org/jira/browse/CLEREZZA-683
             Project: Clerezza
          Issue Type: New Feature
          Components: rdf.core
            Reporter: Rupert Westenthaler


# Indexed in-memory graph

Implementation of a TripleCollection that internally manages SPO, POS, OSP indexes for fast
filtered iterators. The current state of development is hosted at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/.
However the intention is that this module becomes direct part of clerezza. 

## Background:

For Apache Stanbol having fast filtered iterators over in-memory graphs is really important,
because Stanbol uses in-memory graph to store extracted metadata for parsed ContentItems.
When enhancing longer texts with EnhancementChain configurations that produce a lot of enhancements
(e.g. keyword extraction based on dbpedia) such in-memory graphs can get bigger than 100k
triples. Especially if also triples for suggested entities are included within the result.

## Implementation:

Because of that I started to implement an TripleCollection that used TreeMaps to manage SPO,
POS, OSP indexes. 

For fast sorting (comparator) I use the same Resource#hashCode Resource#toString based solution
as used in the rdf.rdfjson serializer. I hope this is also sufficient for Literals (someone
should check that).

The implementation of the "filter(..)" method is purely based on "NavigableSet.subSet(..).iterator()".
I only need to wrap the iterator to ensure that by calls to Iterator.remove():

1) Triples are removed from all three indexes
2)  GraphEvents are dispatched correctly

Note also the trick with the two static fields UriRef MIN and UriRef MAX used to generate
lower/upper bound triples as parsed to  "NavigableSet.subSet(..)".

The implementation is currently hosted on http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/

It has no dependencies to Apache Stanbol. However users that do not want to check-out Stanbol
as a whole will need to edit the pom.xml file and provide information usually imported from
the parent poms.

## Tests:

This implementation passes all MGraphTest UnitTests.
In addition I have copied the tests define for SimpleTripleCollection

To compare the performance I also implemented code that

* allows to create a random Graph with n Triples
* create a TestCase with configurable numbers of Subjects, Predicates and Objects
* performs than m calls to #filter(...)

This performance test runs also as UnitTest

1. by using the SimpleMGraph implementation
2. by using the IndexedMGraph implementation

NOTE: While implementing this I recognized that the SimpleTripleCollectionTest does not extend
MGraphTest and therefore the SimpleTripleCollection class is not checked against the tests
defined by MGraphTest. This might actually an Issue!

## Performance

This is a copy from a run of the above described PerformanceTest

2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - Filter Performance
Test (graph size 100000 triples, iterations 1000)
2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST SimpleMGraph
with 100000 triples ---
10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O]
in 8321ms with 2 results
18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n]
in 7358ms with 734 results
25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O]
in 7266ms with 100 results
31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O]
in 6519ms with 232 results
39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n]
in 7398ms with 8030 results
45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n]
in 5934ms with 8318000 results
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O]
in 10666ms with 2260 results
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST completed
in 53463ms
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST IndexedMGraph
100000 triples ---
55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O]
in 20ms with 2 results
55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n]
in 19ms with 734 results
55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O]
in 33ms with 100 results
55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O]
in 28ms with 232 results
55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n]
in 21ms with 8030 results
57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n]
in 1065ms with 8318000 results
57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O]
in 8ms with 2260 results
57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  --- TEST completed
in 1194ms

best
Rupert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message