Return-Path: X-Original-To: apmail-incubator-clerezza-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-clerezza-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91C349941 for ; Fri, 3 Feb 2012 11:22:19 +0000 (UTC) Received: (qmail 25649 invoked by uid 500); 3 Feb 2012 11:22:19 -0000 Delivered-To: apmail-incubator-clerezza-dev-archive@incubator.apache.org Received: (qmail 25570 invoked by uid 500); 3 Feb 2012 11:22:18 -0000 Mailing-List: contact clerezza-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: clerezza-dev@incubator.apache.org Delivered-To: mailing list clerezza-dev@incubator.apache.org Received: (qmail 25492 invoked by uid 99); 3 Feb 2012 11:22:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Feb 2012 11:22:18 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Feb 2012 11:22:14 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id C90B718BBF9 for ; Fri, 3 Feb 2012 11:21:53 +0000 (UTC) Date: Fri, 3 Feb 2012 11:21:53 +0000 (UTC) From: "Rupert Westenthaler (Created) (JIRA)" To: clerezza-dev@incubator.apache.org Message-ID: <1150355025.7050.1328268113825.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (CLEREZZA-683) Indexed in-memory graph MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Indexed in-memory graph ----------------------- Key: CLEREZZA-683 URL: https://issues.apache.org/jira/browse/CLEREZZA-683 Project: Clerezza Issue Type: New Feature Components: rdf.core Reporter: Rupert Westenthaler # Indexed in-memory graph Implementation of a TripleCollection that internally manages SPO, POS, OSP indexes for fast filtered iterators. The current state of development is hosted at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/. However the intention is that this module becomes direct part of clerezza. ## Background: For Apache Stanbol having fast filtered iterators over in-memory graphs is really important, because Stanbol uses in-memory graph to store extracted metadata for parsed ContentItems. When enhancing longer texts with EnhancementChain configurations that produce a lot of enhancements (e.g. keyword extraction based on dbpedia) such in-memory graphs can get bigger than 100k triples. Especially if also triples for suggested entities are included within the result. ## Implementation: Because of that I started to implement an TripleCollection that used TreeMaps to manage SPO, POS, OSP indexes. For fast sorting (comparator) I use the same Resource#hashCode Resource#toString based solution as used in the rdf.rdfjson serializer. I hope this is also sufficient for Literals (someone should check that). The implementation of the "filter(..)" method is purely based on "NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to ensure that by calls to Iterator.remove(): 1) Triples are removed from all three indexes 2) GraphEvents are dispatched correctly Note also the trick with the two static fields UriRef MIN and UriRef MAX used to generate lower/upper bound triples as parsed to "NavigableSet.subSet(..)". The implementation is currently hosted on http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/ It has no dependencies to Apache Stanbol. However users that do not want to check-out Stanbol as a whole will need to edit the pom.xml file and provide information usually imported from the parent poms. ## Tests: This implementation passes all MGraphTest UnitTests. In addition I have copied the tests define for SimpleTripleCollection To compare the performance I also implemented code that * allows to create a random Graph with n Triples * create a TestCase with configurable numbers of Subjects, Predicates and Objects * performs than m calls to #filter(...) This performance test runs also as UnitTest 1. by using the SimpleMGraph implementation 2. by using the IndexedMGraph implementation NOTE: While implementing this I recognized that the SimpleTripleCollectionTest does not extend MGraphTest and therefore the SimpleTripleCollection class is not checked against the tests defined by MGraphTest. This might actually an Issue! ## Performance This is a copy from a run of the above described PerformanceTest 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - Filter Performance Test (graph size 100000 triples, iterations 1000) 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST SimpleMGraph with 100000 triples --- 10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 8321ms with 2 results 18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 7358ms with 734 results 25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 7266ms with 100 results 31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 6519ms with 232 results 39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 7398ms with 8030 results 45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 5934ms with 8318000 results 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 10666ms with 2260 results 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST completed in 53463ms 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST IndexedMGraph 100000 triples --- 55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 20ms with 2 results 55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 19ms with 734 results 55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 33ms with 100 results 55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 28ms with 232 results 55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 21ms with 8030 results 57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 1065ms with 8318000 results 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 8ms with 2260 results 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST completed in 1194ms best Rupert -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira