Return-Path: X-Original-To: apmail-incubator-clerezza-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-clerezza-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 513559760 for ; Mon, 6 Feb 2012 12:54:21 +0000 (UTC) Received: (qmail 64970 invoked by uid 500); 6 Feb 2012 12:54:21 -0000 Delivered-To: apmail-incubator-clerezza-dev-archive@incubator.apache.org Received: (qmail 64933 invoked by uid 500); 6 Feb 2012 12:54:20 -0000 Mailing-List: contact clerezza-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: clerezza-dev@incubator.apache.org Delivered-To: mailing list clerezza-dev@incubator.apache.org Received: (qmail 64925 invoked by uid 99); 6 Feb 2012 12:54:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 12:54:20 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 12:54:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 2B9BA42F89 for ; Mon, 6 Feb 2012 12:53:59 +0000 (UTC) Date: Mon, 6 Feb 2012 12:53:59 +0000 (UTC) From: "Daniel Spicar (Commented) (JIRA)" To: clerezza-dev@incubator.apache.org Message-ID: <1288668436.1871.1328532839180.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1150355025.7050.1328268113825.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CLEREZZA-683) Indexed in-memory graph MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLEREZZA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201251#comment-13201251 ] Daniel Spicar commented on CLEREZZA-683: ---------------------------------------- Hi Rupert, This looks very interesting. I support the idea that this becomes part of clerezza. I am very busy with other things currently but I intend to check this out soon. > Indexed in-memory graph > ----------------------- > > Key: CLEREZZA-683 > URL: https://issues.apache.org/jira/browse/CLEREZZA-683 > Project: Clerezza > Issue Type: New Feature > Components: rdf.core > Reporter: Rupert Westenthaler > > # Indexed in-memory graph > Implementation of a TripleCollection that internally manages SPO, POS, OSP indexes for fast filtered iterators. The current state of development is hosted at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/. However the intention is that this module becomes direct part of clerezza. > ## Background: > For Apache Stanbol having fast filtered iterators over in-memory graphs is really important, because Stanbol uses in-memory graph to store extracted metadata for parsed ContentItems. > When enhancing longer texts with EnhancementChain configurations that produce a lot of enhancements (e.g. keyword extraction based on dbpedia) such in-memory graphs can get bigger than 100k triples. Especially if also triples for suggested entities are included within the result. > ## Implementation: > Because of that I started to implement an TripleCollection that used TreeMaps to manage SPO, POS, OSP indexes. > For fast sorting (comparator) I use the same Resource#hashCode Resource#toString based solution as used in the rdf.rdfjson serializer. I hope this is also sufficient for Literals (someone should check that). > The implementation of the "filter(..)" method is purely based on "NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to ensure that by calls to Iterator.remove(): > 1) Triples are removed from all three indexes > 2) GraphEvents are dispatched correctly > Note also the trick with the two static fields UriRef MIN and UriRef MAX used to generate lower/upper bound triples as parsed to "NavigableSet.subSet(..)". > The implementation is currently hosted on http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/ > It has no dependencies to Apache Stanbol. However users that do not want to check-out Stanbol as a whole will need to edit the pom.xml file and provide information usually imported from the parent poms. > ## Tests: > This implementation passes all MGraphTest UnitTests. > In addition I have copied the tests define for SimpleTripleCollection > To compare the performance I also implemented code that > * allows to create a random Graph with n Triples > * create a TestCase with configurable numbers of Subjects, Predicates and Objects > * performs than m calls to #filter(...) > This performance test runs also as UnitTest > 1. by using the SimpleMGraph implementation > 2. by using the IndexedMGraph implementation > NOTE: While implementing this I recognized that the SimpleTripleCollectionTest does not extend MGraphTest and therefore the SimpleTripleCollection class is not checked against the tests defined by MGraphTest. This might actually an Issue! > ## Performance > This is a copy from a run of the above described PerformanceTest > 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - Filter Performance Test (graph size 100000 triples, iterations 1000) > 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST SimpleMGraph with 100000 triples --- > 10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 8321ms with 2 results > 18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 7358ms with 734 results > 25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 7266ms with 100 results > 31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 6519ms with 232 results > 39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 7398ms with 8030 results > 45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 5934ms with 8318000 results > 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 10666ms with 2260 results > 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST completed in 53463ms > 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST IndexedMGraph 100000 triples --- > 55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 20ms with 2 results > 55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 19ms with 734 results > 55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 33ms with 100 results > 55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 28ms with 232 results > 55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 21ms with 8030 results > 57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 1065ms with 8318000 results > 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 8ms with 2260 results > 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - --- TEST completed in 1194ms > best > Rupert -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira