Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B6289D5F for ; Fri, 2 Mar 2012 16:08:25 +0000 (UTC) Received: (qmail 33607 invoked by uid 500); 2 Mar 2012 16:08:24 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 33552 invoked by uid 500); 2 Mar 2012 16:08:24 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 33514 invoked by uid 99); 2 Mar 2012 16:08:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2012 16:08:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2012 16:08:20 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 06D5554D0 for ; Fri, 2 Mar 2012 16:07:59 +0000 (UTC) Date: Fri, 2 Mar 2012 16:07:59 +0000 (UTC) From: "Sarven Capadisli (Commented) (JIRA)" To: jena-dev@incubator.apache.org Message-ID: <2014107963.12447.1330704479029.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1540463998.42005.1316438349494.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221016#comment-13221016 ] Sarven Capadisli commented on JENA-117: --------------------------------------- I was wondering if you could dumb these options down for me. I don't understand how they work exactly: --compression Use compression for intermediate files I've tried this: $ java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.tar.gz INFO Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST ERROR [line: 1, col: 13] Unknown char: (0) Exception in thread "main" org.openjena.riot.RiotException: [line: 1, col: 13] Unknown char: (0) at org.openjena.riot.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:125) at org.openjena.riot.lang.LangEngine.raiseException(LangEngine.java:169) at org.openjena.riot.lang.LangEngine.nextToken(LangEngine.java:116) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:50) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:34) at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:69) at org.openjena.riot.lang.LangBase.parse(LangBase.java:43) at cmd.tdbloader3.exec(tdbloader3.java:233) at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) at arq.cmdline.CmdMain.mainRun(CmdMain.java:59) at arq.cmdline.CmdMain.mainRun(CmdMain.java:46) at cmd.tdbloader3.main(tdbloader3.java:129) /tmp/indicators.tar.gz contains multiple .nt files --buffer-size The size of buffers for IO in bytes What's the default for this? How would I determine optimal value for what I'm trying to import (whether it is compressed file or a directory with multiple N-Triples) --gzip-outside GZIP...(Buffered...()) No idea. --spill-size The size of spillable segments in tuples|records --spill-size-auto Automatically set the size of spillable segments No idea. Again, how can I determine optimal value? --no-stats Do not generate the stats file How much does this effect performance? --no-buffer Do not use Buffered{Input|Output}Stream When should I? --max-merge-files Specify the maximum number of files to merge at the same time (default: 100) This is not clear for me. I've managed to get it going with this: java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/*.nt INFO Load: /tmp/countries.nt -- 2012/03/02 10:55:34 EST INFO Load: /tmp/incomeLevels.nt -- 2012/03/02 10:55:35 EST INFO Load: /tmp/indicators.nt -- 2012/03/02 10:55:35 EST INFO Add: 50,000 tuples (Batch: 29,940 / Avg: 29,940) INFO Load: /tmp/lendingTypes.nt -- 2012/03/02 10:55:36 EST INFO Load: /tmp/regions.nt -- 2012/03/02 10:55:36 EST INFO Load: /tmp/sources.nt -- 2012/03/02 10:55:36 EST INFO Load: /tmp/topics.nt -- 2012/03/02 10:55:36 EST INFO Node Table (1/3): building nodes.dat and sorting hash|id ... INFO Add: 50,000 records for node table (1/3) phase (Batch: 24,789 / Avg: 24,789) INFO Add: 100,000 records for node table (1/3) phase (Batch: 204,081 / Avg: 44,208) INFO Add: 150,000 records for node table (1/3) phase (Batch: 274,725 / Avg: 61,374) INFO Total: 166,728 tuples : 2.50 seconds : 66,664.54 tuples/sec [2012/03/02 10:55:39 EST] INFO Node Table (2/3): generating input data using node ids... INFO Add: 50,000 records for node table (2/3) phase (Batch: 70,721 / Avg: 70,721) INFO Total: 55,560 tuples : 0.74 seconds : 75,081.08 tuples/sec [2012/03/02 10:55:39 EST] INFO Node Table (3/3): building node table B+Tree index (i.e. node2id.dat and node2id.idn files)... INFO Total: 26,120 tuples : 0.20 seconds : 129,306.93 tuples/sec [2012/03/02 10:55:40 EST] INFO Index: creating SPO index... INFO Add: 50,000 records to SPO (Batch: 131,233 / Avg: 131,233) INFO Total: 55,561 tuples : 0.53 seconds : 105,629.27 tuples/sec [2012/03/02 10:55:40 EST] INFO Index: creating GSPO index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:40 EST] INFO Index: sorting data for POS index... INFO Add: 50,000 records to POS (Batch: 684,931 / Avg: 684,931) INFO Total: 55,561 tuples : 0.08 seconds : 731,065.81 tuples/sec [2012/03/02 10:55:40 EST] INFO Index: creating POS index... INFO Add: 50,000 records to POS (Batch: 200,000 / Avg: 200,000) INFO Total: 55,561 tuples : 0.40 seconds : 139,600.50 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: sorting data for OSP index... INFO Add: 50,000 records to OSP (Batch: 2,083,333 / Avg: 2,083,333) INFO Total: 55,561 tuples : 0.03 seconds : 1,792,290.38 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: creating OSP index... INFO Add: 50,000 records to OSP (Batch: 181,818 / Avg: 181,818) INFO Total: 55,561 tuples : 0.43 seconds : 130,731.76 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: sorting data for GPOS index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: creating GPOS index... INFO Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: sorting data for GOSP index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: creating GOSP index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: sorting data for POSG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: creating POSG index... INFO Total: 0 tuples : 0.09 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: sorting data for OSPG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST] INFO Index: creating OSPG index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST] INFO Index: sorting data for SPOG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST] INFO Index: creating SPOG index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST] INFO Total: 55,576 tuples : 7.33 seconds : 7,580.96 tuples/sec [2012/03/02 10:55:42 EST] However, I face this: $ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE { ?s ?p ?o . } LIMIT 100' 10:56:30 WARN ModTDBDataset :: Unexpected: Not a TDB dataset for type DatasetTDB ------------- | s | p | o | ============= ------------- One final thing I'd like to know how to do is assigning graph names. --graph is not available as it was in tdbloader. This is fairly important for me because my dataset is close to 500m triples (I think). I'd appreciate it if you could help me clarify these issues. -Sarven > A pure Java version of tdbloader2, a.k.a. tdbloader3 > ---------------------------------------------------- > > Key: JENA-117 > URL: https://issues.apache.org/jira/browse/JENA-117 > Project: Apache Jena > Issue Type: Improvement > Components: TDB > Reporter: Paolo Castagna > Assignee: Paolo Castagna > Priority: Minor > Labels: performance, tdbloader2 > Attachments: TDB_JENA-117_r1171714.patch > > > There is probably a significant performance improvement for tdbloader2 in replacing the UNIX sort over text files with an external sorting pure Java implementation. > Since JENA-99 we now have a SortedDataBag which does exactly that. > ThresholdPolicyCount> policy = new ThresholdPolicyCount>(1000000); > SerializationFactory> serializerFactory = new TupleSerializationFactory(); > Comparator> comparator = new TupleComparator(); > SortedDataBag> sortedDataBag = new SortedDataBag>(policy, serializerFactory, comparator); > TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial. > Preliminary results seems promising and show that the Java implementation can be faster than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons of long values rather than strings. > An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here: > https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java > A further advantage in doing the sorting with Java rather than UNIX sort is that we could stream results directly into the BPlusTreeRewriter rather than on disk and then reading them from disk into the BPlusTreeRewriter. > I've not done an experiment yet to see if this is actually a significant improvement. > Using compression for intermediate files might help, but more experiments are necessary to establish if it is worthwhile or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira