Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D97C9B15 for ; Tue, 6 Mar 2012 07:44:36 +0000 (UTC) Received: (qmail 65044 invoked by uid 500); 6 Mar 2012 07:44:35 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 64878 invoked by uid 500); 6 Mar 2012 07:44:32 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 64854 invoked by uid 99); 6 Mar 2012 07:44:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Mar 2012 07:44:31 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of castagna.lists@googlemail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-wi0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Mar 2012 07:44:25 +0000 Received: by mail-wi0-f175.google.com with SMTP id hq12so2548735wib.6 for ; Mon, 05 Mar 2012 23:44:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=OicC2AXszgUG/58aCsF4M1jO13vTJ62vMUPzC5I+KgU=; b=hmsyM9Yizsh43UFRDQv4+nCT3fjhTmbPPQ1DawVsleZd8jTqgGJ1uGv35M5lfrAvKM sfIzQTn+u1aqJgvPKq42/aVD66rtA0CuSzUzQeVMWqqkYlJjBw5WiLy53zHhnTElb2dZ a/MONXjpKCpB1fQowaT1sT2LzDIAqzj8HcYdVZZNIqk7DvWgfLMrGBS4sHrkFEZz9llI 7sNbWomnFKApOGSPmIqa8567Mr53u718LpmhEpI5HPmOo6nCqzdTeyx/5sHNp2Y/t6dl sH3j8vlY410sv5bTE8qhLe+jLFt/zoghKHZDGxHflve+8k1i3+gQPlRS4BDHr7W6e+qh jVVQ== Received: by 10.180.78.130 with SMTP id b2mr16887193wix.1.1331019845236; Mon, 05 Mar 2012 23:44:05 -0800 (PST) Received: from [192.168.2.3] (80-42-198-103.dynamic.dsl.as9105.com. [80.42.198.103]) by mx.google.com with ESMTPS id ff2sm11397573wib.9.2012.03.05.23.44.03 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 05 Mar 2012 23:44:04 -0800 (PST) Message-ID: <4F55C039.7000705@googlemail.com> Date: Tue, 06 Mar 2012 07:43:53 +0000 From: Paolo Castagna User-Agent: Thunderbird 2.0.0.24 (X11/20101027) MIME-Version: 1.0 To: jena-dev@incubator.apache.org Subject: Re: Strategies for loading large (>500m triples) datasets References: <339406804618164@jngomktg.net> <4F4F6234.5020507@apache.org> <4F4FAB83.9040403@googlemail.com> <4F54663F.5060306@googlemail.com> In-Reply-To: <4F54663F.5060306@googlemail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Paolo Castagna wrote: > Paolo Castagna wrote: >> I have some code to convert Freebase dumps in RDF, it's ~600 million >> triples, I'll use that to gather some numbers. Ideally, comparing >> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of >> time and costs). > > FYI > > Code to convert Freebase dumps in RDF is here: > https://github.com/castagna/freebase2rdf > > I have been using Amazon EC2 instances to run a few experiments during > the last couple of days with m1.xlarge instances (i.e. 15 GB memory). > > tdbloader didn't complete, it was just getting slower and slower... > > > With tdbloader2 I had a java.lang.OutOfMemoryError: > > Mar 5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / Avg: 21,206) > Mar 5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > Mar 5 05:35:10 ip-10-53-58-155 build: #011at java.util.HashMap.(HashMap.java:209) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at org.apache.xerces.impl.validation.ValidationState.(Unknown Source) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.(LiteralLabelImpl.java:96) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46) > Mar 5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79) > > I'll try giving the JVM more RAM. I tried with -Xmx2048m, but I had the same problem. I'll try with -Xmx4096m. > tdbloader3 run out of disk space (because it is writing temporary files > in /tmp and the available instance disk space is mounted on /mnt :-() > I'll see how to change/fix this and re-run. This run almost to completion this time, but I was using --spill-size-auto policy which clearly need improvements. ... Mar 6 04:28:11 ip-10-54-171-206 build: INFO Add: 77,550,000 records to POS (Batch: 605 / Avg: 144,190) Mar 6 04:29:15 ip-10-54-171-206 build: INFO Add: 77,600,000 records to POS (Batch: 777 / Avg: 128,869) Mar 6 04:30:20 ip-10-54-171-206 build: INFO Add: 77,650,000 records to POS (Batch: 776 / Avg: 116,492) Mar 6 04:47:11 ip-10-54-171-206 build: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Mar 6 04:47:11 ip-10-54-171-206 build: #011at java.lang.Long.valueOf(Long.java:557) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:367) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:363) Mar 6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.iterator.Iter$4.next(Iter.java:293) Mar 6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.data.AbstractDataBag.addAll(AbstractDataBag.java:76) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.createBPlusTreeIndex(tdbloader3.java:378) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.exec(tdbloader3.java:252) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59) Mar 6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46) Mar 6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.main(tdbloader3.java:129) I'll try with a fixed --spill-size 10000000. Paolo > > Paolo