kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weber, Richard" <riwe...@akamai.com>
Subject Re: Good way to find "Real" size of the tables
Date Thu, 01 Dec 2016 15:49:24 GMT
Comments below

> On Nov 30, 2016, at 4:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
> On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riweber@akamai.com <mailto:riweber@akamai.com>>
> Hi All,
> I'm trying to figure out the right/best/easiest way to find out how much space that a
given table is taking up on the various tablet servers.  I'm looking really at finding:
> * Physical space taken on all disks
> * Logical space taken on all disks
> * Sizing of Indices/Bloom Filters, etc.
> * Sizing with and without replication.
> I'm trying to run an apples vs apples comparison of how big data is when stored in Kudu
compared to storing it in it's native format (Gzipped CSV) as well as in Parquet format on
HDFS.  Ultimately, I'd like to be able to do reporting on the different tables to say Table
X is taking up Y Tb, where Y consists of A physical size, B Index, C Bloom, etc.
> Looking through the Web UI I don't really see any good summary of how much space the
entire table is taking.  It seems like I'd need to walk through each Tablet server, connect
to the metrics page and generate the summary information myself.
> Yea, unfortunately we do not expose much of this information in a useful way at the moment.
The metrics page is the best source of info for the various sizes, and even those are often
estimates rather than always being accurate at the moment.


> In terms of cross-server metrics aggregation, it's been our philosophy so far that we
should try to avoid doing a poor job of things that other systems are likely to do better
-- metrics aggregation being one such thing. It's likely we'll add simple aggregation of table
sizes, since that info is very useful for SQL engines to do JOIN ordering, but I don't think
we'd start adding the more granular breakdowns like indexes, blooms, etc.

Definitely understand on that.  Index sizes (and sizes of other related data) are mainly of
interests to me just to compare what the performance improvements of Kudu vs Parquet vs CSV
"cost" in terms of storage.

> If your use case is a one-time experiment to understand the data volumes, it would be
pretty straightforward to write a tool to do this kind of summary against the on-disk metadata
of a tablet server. For example, you can load the tablet metadata, group the blocks by type/column,
and then aggregate as you prefer. Unfortunately this would give you only the physical size
and not the logical, since you'd have to scan the actual data to know its uncompressed sizes.

I'm looking for the sizings really for two purposes.  
1) As mentioned above, to help assess the "costs" Kudu vs other systems we already have in
place, especially in terms of Storage
2) Perform longer-term monitoring of sizing of different table sizes, how they're growing,
how much resources they're using and so on.

For one particular use case we have, our data comes in as Protobuf data, and is imported as
ORC data into a Hive table.  Looking at Parquet vs ORC, the datasizes are about 3x larger.
 Kudu seems like will give use a much more performant and natural fit to our dataset, but
if it's 2x larger than Parquet again, that really increases the costs of storage.  

So on that note, I'm not looking for an exact number on the size.  If it's off say +-5% (for
a number), that's certainly close enough in the ballpark.

> If you have any interest in helping to build such a tool I'd be happy to point you in
the right direction. Otherwise let's file a JIRA to add this as a new feature in a future

Let me poke and ponder a bit on that first and see what I can get via hack & kludge. 
We need to publish our metrics in a CSV format for the monitoring bit, so I don't know how
necessarily useful our solution would be to the larger community.



> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

View raw message