Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of dhutchis@stevens.edu
 designates 74.125.149.18 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPMpPc5apD=5LLakBznyGZ7WSgUw7SF4PRjwecb+Gc4ticdr9w@mail.gmail.com>
References: 
 <CAPx=JkbY5Gh4kdhcA_xXLEnF9BsuUCxRa=67tbD7i+myUXFeXQ@mail.gmail.com>
 <54300548.2070708@gmail.com>
 <CAL5zq9ZoNR+BKCdPSuV5aHCSo6T=sidhsBPCjujMKP3enFpGig@mail.gmail.com>
 <CAPx=Jkah9XS-ptNn4osX6NQttpj6hxZUDYhG+ai48MP2bNVxdA@mail.gmail.com>
 <CAOiJXP7pLT9OZvZsm6eT=CfqU81Uei+sUOxEwEK0_pCK2PQz7A@mail.gmail.com>
 <CAPx=JkYLSXmwZrrHSCcYvEWA7onsK0Hr2gLWMeg0K6f2tWqOUw@mail.gmail.com>
 <54309DA0.5090008@gmail.com>
 <CAPMpPc5apD=5LLakBznyGZ7WSgUw7SF4PRjwecb+Gc4ticdr9w@mail.gmail.com>
From: Dylan Hutchison <dhutchis@stevens.edu>
Date: Mon, 6 Oct 2014 11:43:47 -0400
Message-ID: 
 <CAPx=JkZEW11JK_ZH0PbJgxO16y+ph0JBiP7qOvKqa8NrGkSnwg@mail.gmail.com>
Subject: Re: Determining tablets assigned to table splits, and the number of
 rows in each tablet
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=20cf30363c0b1539d40504c2f4c4

--20cf30363c0b1539d40504c2f4c4
Content-Type: text/plain; charset=UTF-8

Yep, ticket here: ACCUMULO-3206
<https://issues.apache.org/jira/browse/ACCUMULO-3206>

There is a related movement at ACCUMULO-3005
<https://issues.apache.org/jira/browse/ACCUMULO-3005> to make the
information of number of entries, number of bytes per tablet / tablet
server per table, available via a RESTful web server as an extension of the
monitor.  With the extra operations you suggest, number of keys in a range
and median key in a range, we would want to keep that at the API level so
that we can introduce authorizations.  Sounds great!

Could you layout a list of all the stats that Accumulo tracks already so
that we know what to implement, either here or on JIRA?  This will form the
basis for extending the API.

~Dylan


On Mon, Oct 6, 2014 at 10:31 AM, Adam Fuchs <afuchs@apache.org> wrote:

> A few years ago we hashed out a rough idea of creating a stats API
> that would allow users to ask a variety of questions that leverage
> information that is already present in the system. Those questions
> would include things like:
>  * Estimate of number of keys in a range. This would satisfy the "key
> count per tablet" request, but could also be used for things like
> predicting query result sizes.
>  * Find the median key in a range. This is useful for doing things
> like parallelizing processing by ranges and predicting sizes of
> intersections.
>
> I think these would best be exposed in both the iterator API and as
> client operations. We never got around to building this before, mostly
> due to prioritization with other features. However, it seems to be
> coming up in conversation frequently these days. There are going to be
> a few tricky parts around cell-level security (information leakage)
> and accuracy of estimates. Is somebody working on creating this ticket
> already?
>
> Adam
>
>
> On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser <josh.elser@gmail.com> wrote:
> > I'll re-state it: I'd be happy to work with you to figure out some Java
> APIs
> > for clients to consume for these kinds of metrics. A JIRA issue is the
> best
> > way to encapsulate this. Would also love to help you provide a patch for
> it,
> > too :)
> >
> > The biggest concern (at least for creating an API for entries in a table
> --
> > by tablet/tabletserver/otherwise) is going to be that the number of
> entries
> > is an approximation, not definitive. This is not prohibitive, though, as
> > long as we're clear that it is an approximation and not an exact metric.
> >
> > Dylan Hutchison wrote:
> >>
> >> It should suffice to list the number of entries for a table, tablet and
> >> tablet server.  No need to worry about number of unique rows, number of
> >> unique column families, etc.  By entry I mean number of (key,value)s.
> >>
> >> For load balancing, we care about how much physical data is on each
> tablet
> >> / tablet server.  This is directly proportional to the number of
> entries,
> >> assuming that the key size and value size in b
> >
> > ytes do not
> >>
> >> differ too drastically.  If they do (say for raw documents of vastly
> >> different sizes), the best measure is the /size of the data in bytes
> /for
> >> each tablet / tablet server.  I didn't suggest it because it doesn't
> look
> >> like Accumulo tracks it so it would involve a lot of new implementation
> and
> >> book-keeping, which could hamper performance.
> >>
> >> Accumulo does already track the number of entries for tables, tablets
> and
> >> tablet server.  It's just hard to get to, relying on the format of the
> >> metadata table and accessing the non-public Monitor classes.  Bringing
> it to
> >> the public API just looks like a matter of reworking the API and
> letting the
> >> client gather the information that the Monitor already does by
> connecting to
> >> each tablet server.  Does that sound reasonable?
> >>
> >> Regards, Dylan
> >>
> >> On Sat, Oct 4, 2014 at 4:11 PM, David Medinets <
> david.medinets@gmail.com
> >> <mailto:david.medinets@gmail.com>> wrote:
> >>
> >>     Adding this functionality in
> >
> > to Accumulo's API would reduce it's
> >>
> >>     efficiency for users that don't need this level of tracking. Let
> >>     ingest procedures take the performance hit. There are
> >>     synchronization issues that reduce degrade performance. Also what
> >>     would be the appropriate level of tracking - at the row,
> >>     column-family, or every level? Whatever answer you give, someone
> >>     else will ask for something different. And then there are the
> >>     aggregation questions. Not to mention the additional storage
> >>     requirements.
> >>
> >>
> >>
> >> --
> >> www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>
>


-- 
www.cs.stevens.edu/~dhutchis

--20cf30363c0b1539d40504c2f4c4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yep, ticket here: <a href=3D"https://issues.apache.org/jir=
a/browse/ACCUMULO-3206">ACCUMULO-3206</a><div><br></div><div>There is a rel=
ated movement at <a href=3D"https://issues.apache.org/jira/browse/ACCUMULO-=
3005">ACCUMULO-3005</a>=C2=A0to make the information of number of entries, =
number of bytes per tablet / tablet server per table, available via a RESTf=
ul web server as an extension of the monitor.=C2=A0 With the extra operatio=
ns you suggest, number of keys in a range and median key in a range, we wou=
ld want to keep that at the API level so that we can introduce authorizatio=
ns.=C2=A0 Sounds great! =C2=A0<br><br></div><div>Could you layout a list of=
 all the stats that Accumulo tracks already so that we know what to impleme=
nt, either here or on JIRA?=C2=A0 This will form the basis for extending th=
e API.</div><div><br></div><div>~Dylan</div><div><br></div></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Oct 6, 2014 at 10:3=
1 AM, Adam Fuchs <span dir=3D"ltr">&lt;<a href=3D"mailto:afuchs@apache.org"=
 target=3D"_blank">afuchs@apache.org</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">A few years ago we hashed out a rough idea of creating a =
stats API<br>
that would allow users to ask a variety of questions that leverage<br>
information that is already present in the system. Those questions<br>
would include things like:<br>
=C2=A0* Estimate of number of keys in a range. This would satisfy the &quot=
;key<br>
count per tablet&quot; request, but could also be used for things like<br>
predicting query result sizes.<br>
=C2=A0* Find the median key in a range. This is useful for doing things<br>
like parallelizing processing by ranges and predicting sizes of<br>
intersections.<br>
<br>
I think these would best be exposed in both the iterator API and as<br>
client operations. We never got around to building this before, mostly<br>
due to prioritization with other features. However, it seems to be<br>
coming up in conversation frequently these days. There are going to be<br>
a few tricky parts around cell-level security (information leakage)<br>
and accuracy of estimates. Is somebody working on creating this ticket<br>
already?<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
Adam<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
<br>
On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser &lt;<a href=3D"mailto:josh.elser=
@gmail.com">josh.elser@gmail.com</a>&gt; wrote:<br>
&gt; I&#39;ll re-state it: I&#39;d be happy to work with you to figure out =
some Java APIs<br>
&gt; for clients to consume for these kinds of metrics. A JIRA issue is the=
 best<br>
&gt; way to encapsulate this. Would also love to help you provide a patch f=
or it,<br>
&gt; too :)<br>
&gt;<br>
&gt; The biggest concern (at least for creating an API for entries in a tab=
le --<br>
&gt; by tablet/tabletserver/otherwise) is going to be that the number of en=
tries<br>
&gt; is an approximation, not definitive. This is not prohibitive, though, =
as<br>
&gt; long as we&#39;re clear that it is an approximation and not an exact m=
etric.<br>
&gt;<br>
&gt; Dylan Hutchison wrote:<br>
&gt;&gt;<br>
&gt;&gt; It should suffice to list the number of entries for a table, table=
t and<br>
&gt;&gt; tablet server.=C2=A0 No need to worry about number of unique rows,=
 number of<br>
&gt;&gt; unique column families, etc.=C2=A0 By entry I mean number of (key,=
value)s.<br>
&gt;&gt;<br>
&gt;&gt; For load balancing, we care about how much physical data is on eac=
h tablet<br>
&gt;&gt; / tablet server.=C2=A0 This is directly proportional to the number=
 of entries,<br>
&gt;&gt; assuming that the key size and value size in b<br>
&gt;<br>
&gt; ytes do not<br>
&gt;&gt;<br>
&gt;&gt; differ too drastically.=C2=A0 If they do (say for raw documents of=
 vastly<br>
&gt;&gt; different sizes), the best measure is the /size of the data in byt=
es /for<br>
&gt;&gt; each tablet / tablet server.=C2=A0 I didn&#39;t suggest it because=
 it doesn&#39;t look<br>
&gt;&gt; like Accumulo tracks it so it would involve a lot of new implement=
ation and<br>
&gt;&gt; book-keeping, which could hamper performance.<br>
&gt;&gt;<br>
&gt;&gt; Accumulo does already track the number of entries for tables, tabl=
ets and<br>
&gt;&gt; tablet server.=C2=A0 It&#39;s just hard to get to, relying on the =
format of the<br>
&gt;&gt; metadata table and accessing the non-public Monitor classes.=C2=A0=
 Bringing it to<br>
&gt;&gt; the public API just looks like a matter of reworking the API and l=
etting the<br>
&gt;&gt; client gather the information that the Monitor already does by con=
necting to<br>
&gt;&gt; each tablet server.=C2=A0 Does that sound reasonable?<br>
&gt;&gt;<br>
&gt;&gt; Regards, Dylan<br>
&gt;&gt;<br>
&gt;&gt; On Sat, Oct 4, 2014 at 4:11 PM, David Medinets &lt;<a href=3D"mail=
to:david.medinets@gmail.com">david.medinets@gmail.com</a><br>
&gt;&gt; &lt;mailto:<a href=3D"mailto:david.medinets@gmail.com">david.medin=
ets@gmail.com</a>&gt;&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0Adding this functionality in<br>
&gt;<br>
&gt; to Accumulo&#39;s API would reduce it&#39;s<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0efficiency for users that don&#39;t need this l=
evel of tracking. Let<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0ingest procedures take the performance hit. The=
re are<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0synchronization issues that reduce degrade perf=
ormance. Also what<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0would be the appropriate level of tracking - at=
 the row,<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0column-family, or every level? Whatever answer =
you give, someone<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0else will ask for something different. And then=
 there are the<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0aggregation questions. Not to mention the addit=
ional storage<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0requirements.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; <a href=3D"http://www.cs.stevens.edu/~dhutchis" target=3D"_blank">=
www.cs.stevens.edu/~dhutchis</a> &lt;<a href=3D"http://www.cs.stevens.edu/~=
dhutchis" target=3D"_blank">http://www.cs.stevens.edu/~dhutchis</a>&gt;<br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div dir=3D"ltr"><div><a href=3D"http://www.cs.stevens.edu/~dhutchis" targe=
t=3D"_blank">www.cs.stevens.edu/~dhutchis</a><br></div></div>
</div>

--20cf30363c0b1539d40504c2f4c4--