Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (nike.apache.org: domain of mreichman@pixelforensics.com
 designates 209.85.212.51 as permitted sender)
MIME-Version: 1.0
Date: Mon, 2 Sep 2013 09:12:10 -0500
Message-ID: 
 <CADDp_G_wz-wFd_oeG4Co7eNtuHc0NNRofLSwLRu_AqSQCeWvtQ@mail.gmail.com>
Subject: accessing accumulo row in mapper setup method?
From: Marc Reichman <mreichman@pixelforensics.com>
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=047d7bd75d5c7a7db604e5672830

--047d7bd75d5c7a7db604e5672830
Content-Type: text/plain; charset=ISO-8859-1

Hello,

I am running a search job of a single piece of query data against potential
targets in an accumulo table, using AccumuloRowInputFormat. In most cases,
the query data itself is also in the same accumulo table.

To date, my client program has pulled the query data from accumulo using a
basic scanner, stored the data into HDFS, and added the file(s) in question
to distributed cache. My mapper then pulls the data from distributed cache
into a private class member in its setup method and uses it in all of the
map calls.

I had a thought, that maybe I'm spending a bit too much overhead on the
client-side doing this, and that my job submission performance is slow
because of all of the HDFS i/o and distributed cache handling for arguably
small files, in the 100-200k range max.

Does it seem like a reasonable idea to skip the preparation on the
client-side, and have the mapper setup pull the data directly from accumulo
in its setup method instead?

Questions related to this:
1. Does this put a lot of pressure on the tabletserver which contains the
data, to have many mappers hitting at once during setup for the first wave?
2. Is there any way whatsoever for the mapper to use the existing client
connection already being made? Or would I have to do the usual setup with
my own zookeeper connection, and if so does that make for a much worse
performance impact?

Thanks,
Marc

--047d7bd75d5c7a7db604e5672830
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello,<div><br></div><div>I am running a search job of a s=
ingle piece of query data against potential targets in an accumulo table, u=
sing AccumuloRowInputFormat. In most cases, the query data itself is also i=
n the same accumulo table.</div>
<div><br></div><div>To date, my client program has pulled the query data fr=
om accumulo using a basic scanner, stored the data into HDFS, and added the=
 file(s) in question to distributed cache. My mapper then pulls the data fr=
om distributed cache into a private class member in its setup method and us=
es it in all of the map calls.</div>
<div><br></div><div>I had a thought, that maybe I&#39;m spending a bit too =
much overhead on the client-side doing this, and that my job submission per=
formance is slow because of all of the HDFS i/o and distributed cache handl=
ing for arguably small files, in the 100-200k range max.</div>
<div><br></div><div>Does it seem like a reasonable idea to skip the prepara=
tion on the client-side, and have the mapper setup pull the data directly f=
rom accumulo in its setup method instead?</div><div><br></div><div>Question=
s related to this:</div>
<div>1. Does this put a lot of pressure on the tabletserver which contains =
the data, to have many mappers hitting at once during setup for the first w=
ave?</div><div>2. Is there any way whatsoever for the mapper to use the exi=
sting client connection already being made? Or would I have to do the usual=
 setup with my own zookeeper connection, and if so does that make for a muc=
h worse performance impact?</div>
<div><br></div><div>Thanks,</div><div>Marc</div></div>

--047d7bd75d5c7a7db604e5672830--