Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FAB09D91 for ; Thu, 8 Mar 2012 12:21:48 +0000 (UTC) Received: (qmail 17141 invoked by uid 500); 8 Mar 2012 12:21:48 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 17085 invoked by uid 500); 8 Mar 2012 12:21:48 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 17076 invoked by uid 99); 8 Mar 2012 12:21:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Mar 2012 12:21:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of castagna.lists@googlemail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-wi0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Mar 2012 12:21:40 +0000 Received: by wibhq12 with SMTP id hq12so270818wib.6 for ; Thu, 08 Mar 2012 04:21:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=R4fG2P2/cCPLodMdvejqftyz74fLwZ0YYs8x07jnLHo=; b=VZVgep6PEennmAqU1fT8tl9blOoZ4Ixr3jkgh8SmBbEsDJEpZM0MlGAUYXSJy9ZbIT ofSGwZwsy1iLNFFD4IEOUfjnoePGzixPt7MqNtQy29wsNob02MrlLpfkhxeADd8p+r3E X4eIoVnNsRwwPJPgfTWOMwyeYPeuEj5wu9sxlj/owqCOKbKo+kAvAvZqqiMiOcrCm9VU K1QnYCRLpZrme9KMFUm+K3U+076v07RbVM9DwnAH6lQhz0QFdJPB5dZoUtGulLFGaUUO VoHhppMPUWGK95hJbpP2cP1UcULjH8ira4tPiMeOxmHUYhtvYWn9zg+JgUsOjp1LQo1+ UzxA== Received: by 10.180.101.136 with SMTP id fg8mr33915045wib.4.1331209280083; Thu, 08 Mar 2012 04:21:20 -0800 (PST) Received: from [192.168.2.3] (80-42-205-153.dynamic.dsl.as9105.com. [80.42.205.153]) by mx.google.com with ESMTPS id cc3sm7958943wib.7.2012.03.08.04.21.18 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 08 Mar 2012 04:21:19 -0800 (PST) Message-ID: <4F58A43D.5080709@googlemail.com> Date: Thu, 08 Mar 2012 12:21:17 +0000 From: Paolo Castagna User-Agent: Thunderbird 2.0.0.24 (X11/20101027) MIME-Version: 1.0 To: jena-dev@incubator.apache.org Subject: Re: Strategies for loading large (>500m triples) datasets References: <339406804618164@jngomktg.net> <4F4F6234.5020507@apache.org> <4F4FAB83.9040403@googlemail.com> <4F54663F.5060306@googlemail.com> <4F55C039.7000705@googlemail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org FYI Paolo Castagna wrote: > This time, UNIX sort filled /tmp... > I'll try specifying the --temporary-directory=DIR or, better, via $TMPDIR > env variable (this way there is no need to change tdbloader2 script). This time, I was able to load the Freebase data dump (converted into RDF) using tdbloader2 (which is included in TDB). This is how I run tdbloader2 using an EC2 m1.xlarge instance (i.e. 15 GB memory): export JVM_ARGS="-Xmx4096m -server" export TMPDIR=/mnt/data/tmp tdbloader2 --loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz Total elapsed time to load 618,465,279 triples: ~12 hours (i.e. ~10,000 triples/s overall speed) This is the log: Mar 7 13:11:37 ip-10-54-167-166 build: 13:11:37 -- TDB Bulk Loader Start Mar 7 13:11:37 ip-10-54-167-166 build: 13:11:37 Data phase Mar 7 13:11:39 ip-10-54-167-166 build: Load: /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/07 13:11:38 UTC Mar 7 13:11:42 ip-10-54-167-166 build: Add: 50,000 Data (Batch: 16,550 / Avg: 16,550) Mar 7 13:11:43 ip-10-54-167-166 build: Add: 100,000 Data (Batch: 39,184 / Avg: 23,272) [...] Mar 7 19:13:51 ip-10-54-167-166 build: Add: 618,450,000 Data (Batch: 53,078 / Avg: 28,457) Mar 7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session opened for user root by (uid=0) Mar 7 19:17:01 ip-10-54-167-166 CRON[7727]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Mar 7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session closed for user root Mar 7 19:24:44 ip-10-54-167-166 build: Total: 618,465,279 tuples : 22,385.15 seconds : 27,628.37 tuples/sec [2012/03/07 19:24:44 UTC] Mar 7 19:24:45 ip-10-54-167-166 build: 19:24:44 Index phase Mar 7 19:24:45 ip-10-54-167-166 build: 19:24:45 Index SPO Mar 7 21:03:18 ip-10-54-167-166 build: 21:03:18 Build SPO Mar 7 21:14:24 ip-10-54-167-166 build: 21:14:24 Index POS Mar 7 23:38:28 ip-10-54-167-166 build: 23:38:28 Build POS Mar 7 23:49:03 ip-10-54-167-166 build: 23:49:03 Index OSP Mar 8 00:56:13 ip-10-54-167-166 build: 00:56:13 Build OSP Mar 8 01:08:17 ip-10-54-167-166 build: 01:08:17 Index phase end Mar 8 01:08:59 ip-10-54-167-166 build: 01:08:59 -- TDB Bulk Loader Finish Mar 8 01:08:59 ip-10-54-167-166 build: 01:08:59 -- 43000 seconds Paolo