accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Determining tablets assigned to table splits, and the number of rows in each tablet
Date Sat, 04 Oct 2014 20:11:24 GMT
Adding this functionality into Accumulo's API would reduce it's efficiency
for users that don't need this level of tracking. Let ingest procedures
take the performance hit. There are synchronization issues that reduce
degrade performance. Also what would be the appropriate level of tracking -
at the row, column-family, or every level? Whatever answer you give,
someone else will ask for something different. And then there are the
aggregation questions. Not to mention the additional storage requirements.

On Sat, Oct 4, 2014 at 3:57 PM, Dylan Hutchison <dhutchis@stevens.edu>
wrote:

> David, thanks for the pointer to the articles.  I read them a few months
> ago but forgot.  Will need to read the HyperLogLog paper
> <https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf>
> .
>
> *The number of unique rows within a tablet are not explicitly tracked.*
>
>
> Right Josh, I misspoke.  For load balancing, we're interested in the *number
> of entries in each tablet*, not the number of unique rows.  Only counting
> the number of unique rows doesn't distinguish between really big rows and
> singleton rows, and as David pointed out, we need client-controlled means
> of doing unique row counting/estimation.
>
> We can see the number of entries in a Table and the number of entries in a
> Table of a particular Tablet Server, because these are listed in the
> monitor.
> [image: Inline image 2]
>
> David, you may recognize the name of this tablet server.  Just got Accumulo
> Vagrant <https://github.com/medined/Accumulo_1_5_0_By_Vagrant> working
> last week, thanks ;)
>
> [image: Inline image 1]
>
> However, there could be multiple Tablets assigned to the same Tablet
> Server.  Here is an outline of the procedure I followed to read the
> *TabletStats.numEntries*
> <https://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html#numEntries>
> for the correct Tablet that holds a split range.
>
> Given table name,
>
>    -
>
>    get a list of all tablet servers by connecting to the Master and
>    referencing the MasterMonitorInfo
>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/master/thrift/MasterClientService.Client.html#getMasterStats(org.apache.accumulo.trace.thrift.TInfo,%20org.apache.accumulo.core.security.thrift.TCredentials)>
>    -
>
>    get internal table ID via Tables.getNameToIdMap
>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/impl/Tables.html#getNameToIdMap(org.apache.accumulo.core.client.Instance)>
>    -
>
>    connect to each tablet server  TabletStat
>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html>s
>    of tablets that are on the tablet server under the given internal table ID
>    -
>
>    Scan Metadata table starting at the {tableName converted to internal
>    table ID}
>    -
>
>    and ending at {internal table ID}’<’     (last entry for this table in
>    the metadata table)
>    -
>
>       Example row: 1<  (if the internal table ID is 1 and this is the
>       last split in the row)
>       -
>
>    look at the column for the previous row:  ~tab:~pr
>    -
>
>       Example row-col-val:   1< ~tab:~pr []    \x00
>       -
>
>       (this table has no table splits-- no end row and no previous row
>       start)
>       -
>
>    Create an extent for the value using KeyExtent
>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/data/KeyExtent.html>
>    -
>
>       (shortcut for parsing the metadata table and getting the previous
>       and current end row)
>       -
>
>    Among the list of TabletStats, find the one whose previous end row and
>    next end row match the result from the Metadata table.
>
> Take that tabletStat.numEntries
> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html#numEntries>
> to get the number of entries in this table split range.
>
> Later this information is combined into a method that returns an array of
> triples
>
> (tablet_split_range, tablet_num_entries,
> tablet_server_list_for_this_tablet)
>
>
> I recommend adding the ability to get the number of entries for tables,
> tablet servers and tablets to the public API.  It would be nice to
> reference any of the data from the Accumulo monitor programmatically; in
> this case we cross-reference monitor data with the Metadata table.  Josh,
> is JIRA the place to file those kinds of suggestions?
>
> Regards,
> Dylan
>
> --
> www.cs.stevens.edu/~dhutchis
>

Mime
View raw message