Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5459C796B for ; Fri, 16 Sep 2011 09:15:17 +0000 (UTC) Received: (qmail 38901 invoked by uid 500); 16 Sep 2011 09:15:17 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 38865 invoked by uid 500); 16 Sep 2011 09:15:16 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 38850 invoked by uid 99); 16 Sep 2011 09:15:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 09:15:16 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of castagna.lists@googlemail.com designates 74.125.82.41 as permitted sender) Received: from [74.125.82.41] (HELO mail-ww0-f41.google.com) (74.125.82.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 09:15:09 +0000 Received: by wwf10 with SMTP id 10so318154wwf.0 for ; Fri, 16 Sep 2011 02:14:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=j2T9EcvAsGF7paP81ZcU97pLqNYZpOXxitPisXZ249o=; b=CRHGaLbdHS+eQ9n1+QOYF56e9oNbLYjQuvn5RmKsoONdYWaMkUENsNGIrCL7nlWlst GXy8CLpZ1T+H2OD1aCUbVEIBwEBABUFdlpChgde0Lw/8DM/Fx9q1C93dFbMNuJtgn4Gh B7Dtn4zTATG/5bbXRsqtJfP055KkuRdvEYZYI= Received: by 10.227.149.143 with SMTP id t15mr1572479wbv.59.1316164489552; Fri, 16 Sep 2011 02:14:49 -0700 (PDT) Received: from [192.168.2.2] (80-42-196-251.dynamic.dsl.as9105.com. [80.42.196.251]) by mx.google.com with ESMTPS id p8sm8286347wbo.20.2011.09.16.02.14.47 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 16 Sep 2011 02:14:48 -0700 (PDT) Message-ID: <4E731386.701@googlemail.com> Date: Fri, 16 Sep 2011 10:14:46 +0100 From: Paolo Castagna User-Agent: Thunderbird 2.0.0.24 (X11/20101027) MIME-Version: 1.0 To: jena-dev@incubator.apache.org Subject: Re: tdbloader3: getting closer... help welcome! References: <4E6DDFFD.6090208@googlemail.com> In-Reply-To: <4E6DDFFD.6090208@googlemail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Paolo Castagna wrote: > TODO: > > - Add MiniMRCluster so that it is easy for developers to run tests with multiple reducers on a laptop. Done. > - Split the first MapReduce job into two: one to produce offset values for each partition, the other to generate data files with correct ids for subsequent jobs. Done. > - Build the node table concatenating output files from the MapReduce jobs above. Done. All the changes are in a branch, here: https://github.com/castagna/tdbloader3/tree/hadoop-0.20.203.0 There is only one final step which is currently not done using MapReduce: the node2id.dat|idn files (i.e. the B+Tree index to map RDF node hashes of 128 bits to RDF node ids (68 bits)) are built from the nodes.dat file at the end of all MapReduce jobs. Iterator> iter = objects.all(); while ( iter.hasNext() ) { Pair pair = iter.next(); long id = pair.getLeft() ; Node node = NodeLib.fetchDecode(id, objects) ; Hash hash = new Hash(recordFactory.keyLength()) ; setHash(hash, node) ; byte k[] = hash.getBytes() ; Record record = recordFactory.create(k) ; Bytes.setLong(id, record.getValue(), 0) ; nodeToId.add(record); } I need to run a few experiments, but this saves a find() to search if a record is already in the index. We know the objects file contains only unique RDF node values. Indeed, while I was doing this I looked back at tdbloader2 and I think we could use the BPlusTreeRewriter 'trick' for the node table as well. I cannot reuse BPlusTreeRewriter as it is since it has been written for SPO or GSPO, etc. indexes where we have records with 3 or 4 slots of constant size (in this case 64 bits). In the case of the node table we have records of two slots only respectively of size 128 bits for the hash and 68 bits for the node id. I am keen to try to improve the first phase of the tdbloader2 since I expect it could further improve performances and scalability (in particular when the node table indexes do not fit in RAM anymore). @Andy, does this idea make sense? > - Test on a cluster with a large (> 1B dataset). Soon... Paolo