Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E880017FA0 for ; Sat, 4 Oct 2014 21:52:49 +0000 (UTC) Received: (qmail 67564 invoked by uid 500); 4 Oct 2014 21:52:49 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 67515 invoked by uid 500); 4 Oct 2014 21:52:49 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 67505 invoked by uid 99); 4 Oct 2014 21:52:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Oct 2014 21:52:49 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dhutchis@stevens.edu designates 74.125.149.84 as permitted sender) Received: from [74.125.149.84] (HELO na3sys009aog135.obsmtp.com) (74.125.149.84) by apache.org (qpsmtpd/0.29) with SMTP; Sat, 04 Oct 2014 21:52:43 +0000 Received: from nexus.stevens.edu ([155.246.14.12]) by na3sys009aob135.postini.com ([74.125.148.12]) with SMTP ID DSNKVDBsFiq5mnmRlIU/C8+A5ND8FoCw92k2@postini.com; Sat, 04 Oct 2014 14:52:23 PDT Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com [209.85.213.169]) (Authenticated sender: dhutchis) by nexus.stevens.edu (Postfix) with ESMTPSA id 823F6182FA6 for ; Sat, 4 Oct 2014 17:52:22 -0400 (EDT) Received: by mail-ig0-f169.google.com with SMTP id uq10so3491054igb.4 for ; Sat, 04 Oct 2014 14:52:22 -0700 (PDT) X-Received: by 10.50.50.175 with SMTP id d15mr8909162igo.35.1412459542154; Sat, 04 Oct 2014 14:52:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.148.143 with HTTP; Sat, 4 Oct 2014 14:52:01 -0700 (PDT) In-Reply-To: References: <54300548.2070708@gmail.com> From: Dylan Hutchison Date: Sat, 4 Oct 2014 17:52:01 -0400 Message-ID: Subject: Re: Determining tablets assigned to table splits, and the number of rows in each tablet To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7b2e549c4591bd05049fdd6e X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2e549c4591bd05049fdd6e Content-Type: text/plain; charset=UTF-8 It should suffice to list the number of entries for a table, tablet and tablet server. No need to worry about number of unique rows, number of unique column families, etc. By entry I mean number of (key,value)s. For load balancing, we care about how much physical data is on each tablet / tablet server. This is directly proportional to the number of entries, assuming that the key size and value size in bytes do not differ too drastically. If they do (say for raw documents of vastly different sizes), the best measure is the *size of the data in bytes *for each tablet / tablet server. I didn't suggest it because it doesn't look like Accumulo tracks it so it would involve a lot of new implementation and book-keeping, which could hamper performance. Accumulo does already track the number of entries for tables, tablets and tablet server. It's just hard to get to, relying on the format of the metadata table and accessing the non-public Monitor classes. Bringing it to the public API just looks like a matter of reworking the API and letting the client gather the information that the Monitor already does by connecting to each tablet server. Does that sound reasonable? Regards, Dylan On Sat, Oct 4, 2014 at 4:11 PM, David Medinets wrote: > Adding this functionality into Accumulo's API would reduce it's efficiency > for users that don't need this level of tracking. Let ingest procedures > take the performance hit. There are synchronization issues that reduce > degrade performance. Also what would be the appropriate level of tracking - > at the row, column-family, or every level? Whatever answer you give, > someone else will ask for something different. And then there are the > aggregation questions. Not to mention the additional storage requirements. > -- www.cs.stevens.edu/~dhutchis --047d7b2e549c4591bd05049fdd6e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
It should suffice to list the number of entries for a tabl= e, tablet and tablet server.=C2=A0 No need to worry about number of unique = rows, number of unique column families, etc.=C2=A0 By entry I mean number o= f (key,value)s.

For load balancing, we care about how mu= ch physical data is on each tablet / tablet server.=C2=A0 This is directly = proportional to the number of entries, assuming that the key size and value= size in bytes do not differ too drastically.=C2=A0 If they do (say for raw= documents of vastly different sizes), the best measure is the size of t= he data in bytes for each tablet / tablet server.=C2=A0 I didn't su= ggest it because it doesn't look like Accumulo tracks it so it would in= volve a lot of new implementation and book-keeping, which could hamper perf= ormance.

Accumulo does already track the number of= entries for tables, tablets and tablet server.=C2=A0 It's just hard to= get to, relying on the format of the metadata table and accessing the non-= public Monitor classes.=C2=A0 Bringing it to the public API just looks like= a matter of reworking the API and letting the client gather the informatio= n that the Monitor already does by connecting to each tablet server.=C2=A0 = Does that sound reasonable?

Regards, Dylan

On Sat, Oct 4, 2014 a= t 4:11 PM, David Medinets <david.medinets@gmail.com> = wrote:
Adding this = functionality into Accumulo's API would reduce it's efficiency for = users that don't need this level of tracking. Let ingest procedures tak= e the performance hit. There are synchronization issues that reduce degrade= performance. Also what would be the appropriate level of tracking - at the= row, column-family, or every level? Whatever answer you give, someone else= will ask for something different. And then there are the aggregation quest= ions. Not to mention the additional storage requirements.


--
--047d7b2e549c4591bd05049fdd6e--