Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18E9190C1 for ; Mon, 12 Mar 2012 08:30:03 +0000 (UTC) Received: (qmail 27432 invoked by uid 500); 12 Mar 2012 08:30:02 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 27378 invoked by uid 500); 12 Mar 2012 08:30:02 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 27367 invoked by uid 99); 12 Mar 2012 08:30:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Mar 2012 08:30:02 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Mar 2012 08:30:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 25D491B2D2 for ; Mon, 12 Mar 2012 08:29:39 +0000 (UTC) Date: Mon, 12 Mar 2012 08:29:39 +0000 (UTC) From: "Paolo Castagna (Closed) (JIRA)" To: jena-dev@incubator.apache.org Message-ID: <1838715850.1359.1331540979172.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1540463998.42005.1316438349494.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Closed] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paolo Castagna closed JENA-117. ------------------------------- > A pure Java version of tdbloader2, a.k.a. tdbloader3 > ---------------------------------------------------- > > Key: JENA-117 > URL: https://issues.apache.org/jira/browse/JENA-117 > Project: Apache Jena > Issue Type: Improvement > Components: TDB > Reporter: Paolo Castagna > Assignee: Paolo Castagna > Priority: Minor > Labels: performance, tdbloader2 > Fix For: TDB 0.9.1 > > Attachments: TDB_JENA-117_r1171714.patch > > > There is probably a significant performance improvement for tdbloader2 in replacing the UNIX sort over text files with an external sorting pure Java implementation. > Since JENA-99 we now have a SortedDataBag which does exactly that. > ThresholdPolicyCount> policy = new ThresholdPolicyCount>(1000000); > SerializationFactory> serializerFactory = new TupleSerializationFactory(); > Comparator> comparator = new TupleComparator(); > SortedDataBag> sortedDataBag = new SortedDataBag>(policy, serializerFactory, comparator); > TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial. > Preliminary results seems promising and show that the Java implementation can be faster than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons of long values rather than strings. > An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here: > https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java > A further advantage in doing the sorting with Java rather than UNIX sort is that we could stream results directly into the BPlusTreeRewriter rather than on disk and then reading them from disk into the BPlusTreeRewriter. > I've not done an experiment yet to see if this is actually a significant improvement. > Using compression for intermediate files might help, but more experiments are necessary to establish if it is worthwhile or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira