Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6485110272 for ; Mon, 2 Sep 2013 14:12:38 +0000 (UTC) Received: (qmail 53425 invoked by uid 500); 2 Sep 2013 14:12:38 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 53201 invoked by uid 500); 2 Sep 2013 14:12:38 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 53193 invoked by uid 99); 2 Sep 2013 14:12:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Sep 2013 14:12:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mreichman@pixelforensics.com designates 209.85.212.51 as permitted sender) Received: from [209.85.212.51] (HELO mail-vb0-f51.google.com) (209.85.212.51) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Sep 2013 14:12:31 +0000 Received: by mail-vb0-f51.google.com with SMTP id x16so3097761vbf.10 for ; Mon, 02 Sep 2013 07:12:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pixelforensics.com; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=8hccqHKnhiH1+0QrB8up3x7DsRoef68mcGqYMcH68uQ=; b=Ysi/0IsbbfsjyDOzjhU7u5JVQRrCkYv3/tBgqXUfy6TNyyk0/AT0Wh26xIrfCyVZjW 0+k4VJ2Hc8KALWDCnp0dv385M6AOeCuGbnWMWjWVrS9R7DyDlj93lceX6qb+Z8OGfY0A 6wD6rSBJEx3pm3VPSEyNi219BxdVT/Ez+rTRg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=8hccqHKnhiH1+0QrB8up3x7DsRoef68mcGqYMcH68uQ=; b=RrV6TanxKDxHUSTdkJaMp2M7FaAkQT3GxSEkSjcu3be7YQYNXkJj0AdDrV2qvjeL5e NLbI0OUXREMhoG031XBukO4mqBobJLaD20NAWMq1McMn4Qee3FhZCdOA5AAqrtdqguRY vv+2BepgGqhnzL278By7EXEHNvxC4QAVjXq44hOAt6sn+DRKf5MknvSRctx7HbOl7uJc N4ZQUdrO2yI+uwi627cRGkoS14Sm8pDb73CkDQrblrj0sGAX9SOZ9MulfX2jAkMUXYES gnGyvHbOwzuapxmmCJWWwzH/C8lOvWui31KZh3MbBFk73tk6EXV2C90UoqgbynX7vyXx bbGw== X-Gm-Message-State: ALoCoQnziFOgoTvZJYxIxJOv0YklOV7tpEB2y+YtFYBOGCkHwx2kCFmLvGZ3dqr+BQa//mldNgVJ MIME-Version: 1.0 X-Received: by 10.59.8.232 with SMTP id dn8mr23144022ved.8.1378131130346; Mon, 02 Sep 2013 07:12:10 -0700 (PDT) Received: by 10.58.230.138 with HTTP; Mon, 2 Sep 2013 07:12:10 -0700 (PDT) Date: Mon, 2 Sep 2013 09:12:10 -0500 Message-ID: Subject: accessing accumulo row in mapper setup method? From: Marc Reichman To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7bd75d5c7a7db604e5672830 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd75d5c7a7db604e5672830 Content-Type: text/plain; charset=ISO-8859-1 Hello, I am running a search job of a single piece of query data against potential targets in an accumulo table, using AccumuloRowInputFormat. In most cases, the query data itself is also in the same accumulo table. To date, my client program has pulled the query data from accumulo using a basic scanner, stored the data into HDFS, and added the file(s) in question to distributed cache. My mapper then pulls the data from distributed cache into a private class member in its setup method and uses it in all of the map calls. I had a thought, that maybe I'm spending a bit too much overhead on the client-side doing this, and that my job submission performance is slow because of all of the HDFS i/o and distributed cache handling for arguably small files, in the 100-200k range max. Does it seem like a reasonable idea to skip the preparation on the client-side, and have the mapper setup pull the data directly from accumulo in its setup method instead? Questions related to this: 1. Does this put a lot of pressure on the tabletserver which contains the data, to have many mappers hitting at once during setup for the first wave? 2. Is there any way whatsoever for the mapper to use the existing client connection already being made? Or would I have to do the usual setup with my own zookeeper connection, and if so does that make for a much worse performance impact? Thanks, Marc --047d7bd75d5c7a7db604e5672830 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello,

I am running a search job of a s= ingle piece of query data against potential targets in an accumulo table, u= sing AccumuloRowInputFormat. In most cases, the query data itself is also i= n the same accumulo table.

To date, my client program has pulled the query data fr= om accumulo using a basic scanner, stored the data into HDFS, and added the= file(s) in question to distributed cache. My mapper then pulls the data fr= om distributed cache into a private class member in its setup method and us= es it in all of the map calls.

I had a thought, that maybe I'm spending a bit too = much overhead on the client-side doing this, and that my job submission per= formance is slow because of all of the HDFS i/o and distributed cache handl= ing for arguably small files, in the 100-200k range max.

Does it seem like a reasonable idea to skip the prepara= tion on the client-side, and have the mapper setup pull the data directly f= rom accumulo in its setup method instead?

Question= s related to this:
1. Does this put a lot of pressure on the tabletserver which contains = the data, to have many mappers hitting at once during setup for the first w= ave?
2. Is there any way whatsoever for the mapper to use the exi= sting client connection already being made? Or would I have to do the usual= setup with my own zookeeper connection, and if so does that make for a muc= h worse performance impact?

Thanks,
Marc
--047d7bd75d5c7a7db604e5672830--