Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A38081725C for ; Mon, 6 Oct 2014 15:44:36 +0000 (UTC) Received: (qmail 80590 invoked by uid 500); 6 Oct 2014 15:44:36 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 80547 invoked by uid 500); 6 Oct 2014 15:44:36 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 80536 invoked by uid 99); 6 Oct 2014 15:44:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Oct 2014 15:44:36 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dhutchis@stevens.edu designates 74.125.149.18 as permitted sender) Received: from [74.125.149.18] (HELO na3sys009aog137.obsmtp.com) (74.125.149.18) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 06 Oct 2014 15:44:30 +0000 Received: from nexus.stevens.edu ([155.246.14.12]) by na3sys009aob137.postini.com ([74.125.148.12]) with SMTP ID DSNKVDK4yTZiJaeRxA+1DQJ17/l9Ge0rpw8C@postini.com; Mon, 06 Oct 2014 08:44:10 PDT Received: from mail-ie0-f174.google.com (mail-ie0-f174.google.com [209.85.223.174]) (Authenticated sender: dhutchis) by nexus.stevens.edu (Postfix) with ESMTPSA id 1246B1828F6 for ; Mon, 6 Oct 2014 11:44:09 -0400 (EDT) Received: by mail-ie0-f174.google.com with SMTP id tr6so3584535ieb.19 for ; Mon, 06 Oct 2014 08:44:08 -0700 (PDT) X-Received: by 10.42.13.205 with SMTP id e13mr10347981ica.61.1412610248743; Mon, 06 Oct 2014 08:44:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.148.143 with HTTP; Mon, 6 Oct 2014 08:43:47 -0700 (PDT) In-Reply-To: References: <54300548.2070708@gmail.com> <54309DA0.5090008@gmail.com> From: Dylan Hutchison Date: Mon, 6 Oct 2014 11:43:47 -0400 Message-ID: Subject: Re: Determining tablets assigned to table splits, and the number of rows in each tablet To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=20cf30363c0b1539d40504c2f4c4 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30363c0b1539d40504c2f4c4 Content-Type: text/plain; charset=UTF-8 Yep, ticket here: ACCUMULO-3206 There is a related movement at ACCUMULO-3005 to make the information of number of entries, number of bytes per tablet / tablet server per table, available via a RESTful web server as an extension of the monitor. With the extra operations you suggest, number of keys in a range and median key in a range, we would want to keep that at the API level so that we can introduce authorizations. Sounds great! Could you layout a list of all the stats that Accumulo tracks already so that we know what to implement, either here or on JIRA? This will form the basis for extending the API. ~Dylan On Mon, Oct 6, 2014 at 10:31 AM, Adam Fuchs wrote: > A few years ago we hashed out a rough idea of creating a stats API > that would allow users to ask a variety of questions that leverage > information that is already present in the system. Those questions > would include things like: > * Estimate of number of keys in a range. This would satisfy the "key > count per tablet" request, but could also be used for things like > predicting query result sizes. > * Find the median key in a range. This is useful for doing things > like parallelizing processing by ranges and predicting sizes of > intersections. > > I think these would best be exposed in both the iterator API and as > client operations. We never got around to building this before, mostly > due to prioritization with other features. However, it seems to be > coming up in conversation frequently these days. There are going to be > a few tricky parts around cell-level security (information leakage) > and accuracy of estimates. Is somebody working on creating this ticket > already? > > Adam > > > On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser wrote: > > I'll re-state it: I'd be happy to work with you to figure out some Java > APIs > > for clients to consume for these kinds of metrics. A JIRA issue is the > best > > way to encapsulate this. Would also love to help you provide a patch for > it, > > too :) > > > > The biggest concern (at least for creating an API for entries in a table > -- > > by tablet/tabletserver/otherwise) is going to be that the number of > entries > > is an approximation, not definitive. This is not prohibitive, though, as > > long as we're clear that it is an approximation and not an exact metric. > > > > Dylan Hutchison wrote: > >> > >> It should suffice to list the number of entries for a table, tablet and > >> tablet server. No need to worry about number of unique rows, number of > >> unique column families, etc. By entry I mean number of (key,value)s. > >> > >> For load balancing, we care about how much physical data is on each > tablet > >> / tablet server. This is directly proportional to the number of > entries, > >> assuming that the key size and value size in b > > > > ytes do not > >> > >> differ too drastically. If they do (say for raw documents of vastly > >> different sizes), the best measure is the /size of the data in bytes > /for > >> each tablet / tablet server. I didn't suggest it because it doesn't > look > >> like Accumulo tracks it so it would involve a lot of new implementation > and > >> book-keeping, which could hamper performance. > >> > >> Accumulo does already track the number of entries for tables, tablets > and > >> tablet server. It's just hard to get to, relying on the format of the > >> metadata table and accessing the non-public Monitor classes. Bringing > it to > >> the public API just looks like a matter of reworking the API and > letting the > >> client gather the information that the Monitor already does by > connecting to > >> each tablet server. Does that sound reasonable? > >> > >> Regards, Dylan > >> > >> On Sat, Oct 4, 2014 at 4:11 PM, David Medinets < > david.medinets@gmail.com > >> > wrote: > >> > >> Adding this functionality in > > > > to Accumulo's API would reduce it's > >> > >> efficiency for users that don't need this level of tracking. Let > >> ingest procedures take the performance hit. There are > >> synchronization issues that reduce degrade performance. Also what > >> would be the appropriate level of tracking - at the row, > >> column-family, or every level? Whatever answer you give, someone > >> else will ask for something different. And then there are the > >> aggregation questions. Not to mention the additional storage > >> requirements. > >> > >> > >> > >> -- > >> www.cs.stevens.edu/~dhutchis > -- www.cs.stevens.edu/~dhutchis --20cf30363c0b1539d40504c2f4c4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yep, ticket here: ACCUMULO-3206

There is a rel= ated movement at ACCUMULO-3005=C2=A0to make the information of number of entries, = number of bytes per tablet / tablet server per table, available via a RESTf= ul web server as an extension of the monitor.=C2=A0 With the extra operatio= ns you suggest, number of keys in a range and median key in a range, we wou= ld want to keep that at the API level so that we can introduce authorizatio= ns.=C2=A0 Sounds great! =C2=A0

Could you layout a list of= all the stats that Accumulo tracks already so that we know what to impleme= nt, either here or on JIRA?=C2=A0 This will form the basis for extending th= e API.

~Dylan


On Mon, Oct 6, 2014 at 10:3= 1 AM, Adam Fuchs <afuchs@apache.org> wrote:
A few years ago we hashed out a rough idea of creating a = stats API
that would allow users to ask a variety of questions that leverage
information that is already present in the system. Those questions
would include things like:
=C2=A0* Estimate of number of keys in a range. This would satisfy the "= ;key
count per tablet" request, but could also be used for things like
predicting query result sizes.
=C2=A0* Find the median key in a range. This is useful for doing things
like parallelizing processing by ranges and predicting sizes of
intersections.

I think these would best be exposed in both the iterator API and as
client operations. We never got around to building this before, mostly
due to prioritization with other features. However, it seems to be
coming up in conversation frequently these days. There are going to be
a few tricky parts around cell-level security (information leakage)
and accuracy of estimates. Is somebody working on creating this ticket
already?

Adam


On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser <josh.elser@gmail.com> wrote:
> I'll re-state it: I'd be happy to work with you to figure out = some Java APIs
> for clients to consume for these kinds of metrics. A JIRA issue is the= best
> way to encapsulate this. Would also love to help you provide a patch f= or it,
> too :)
>
> The biggest concern (at least for creating an API for entries in a tab= le --
> by tablet/tabletserver/otherwise) is going to be that the number of en= tries
> is an approximation, not definitive. This is not prohibitive, though, = as
> long as we're clear that it is an approximation and not an exact m= etric.
>
> Dylan Hutchison wrote:
>>
>> It should suffice to list the number of entries for a table, table= t and
>> tablet server.=C2=A0 No need to worry about number of unique rows,= number of
>> unique column families, etc.=C2=A0 By entry I mean number of (key,= value)s.
>>
>> For load balancing, we care about how much physical data is on eac= h tablet
>> / tablet server.=C2=A0 This is directly proportional to the number= of entries,
>> assuming that the key size and value size in b
>
> ytes do not
>>
>> differ too drastically.=C2=A0 If they do (say for raw documents of= vastly
>> different sizes), the best measure is the /size of the data in byt= es /for
>> each tablet / tablet server.=C2=A0 I didn't suggest it because= it doesn't look
>> like Accumulo tracks it so it would involve a lot of new implement= ation and
>> book-keeping, which could hamper performance.
>>
>> Accumulo does already track the number of entries for tables, tabl= ets and
>> tablet server.=C2=A0 It's just hard to get to, relying on the = format of the
>> metadata table and accessing the non-public Monitor classes.=C2=A0= Bringing it to
>> the public API just looks like a matter of reworking the API and l= etting the
>> client gather the information that the Monitor already does by con= necting to
>> each tablet server.=C2=A0 Does that sound reasonable?
>>
>> Regards, Dylan
>>
>> On Sat, Oct 4, 2014 at 4:11 PM, David Medinets <david.medinets@gmail.com
>> <mailto:david.medin= ets@gmail.com>> wrote:
>>
>>=C2=A0 =C2=A0 =C2=A0Adding this functionality in
>
> to Accumulo's API would reduce it's
>>
>>=C2=A0 =C2=A0 =C2=A0efficiency for users that don't need this l= evel of tracking. Let
>>=C2=A0 =C2=A0 =C2=A0ingest procedures take the performance hit. The= re are
>>=C2=A0 =C2=A0 =C2=A0synchronization issues that reduce degrade perf= ormance. Also what
>>=C2=A0 =C2=A0 =C2=A0would be the appropriate level of tracking - at= the row,
>>=C2=A0 =C2=A0 =C2=A0column-family, or every level? Whatever answer = you give, someone
>>=C2=A0 =C2=A0 =C2=A0else will ask for something different. And then= there are the
>>=C2=A0 =C2=A0 =C2=A0aggregation questions. Not to mention the addit= ional storage
>>=C2=A0 =C2=A0 =C2=A0requirements.
>>
>>
>>
>> --
>> = www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>



--
=
--20cf30363c0b1539d40504c2f4c4--