kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weber, Richard" <riwe...@akamai.com>
Subject Re: Good way to find "Real" size of the tables
Date Mon, 12 Dec 2016 15:24:33 GMT
Sorry, wound up setting this part of my project aside to complete the remainder of my evaluation
of Kudu.  I do hope I'll be able to swing back around as I'd like to compare sizing of different
knobs we've twisted in Kudu, as well as against HDFS basaed file formats.


I'll definitely post an update/script once I get something together.


-- Rick Weber



From: Todd Lipcon <todd@cloudera.com>
Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
Date: Monday, December 12, 2016 at 1:22 AM
To: "user@kudu.apache.org" <user@kudu.apache.org>
Subject: Re: Good way to find "Real" size of the tables


Hey Rick, 


Just wanted to check and see if you were able to make any progress on the experiments you
were running. Would be great to share your findings or any issues you encountered.




On Thu, Dec 1, 2016 at 10:49 PM, Weber, Richard <riweber@akamai.com> wrote:

Comments below 



On Nov 30, 2016, at 4:29 PM, Todd Lipcon <todd@cloudera.com> wrote:


On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riweber@akamai.com> wrote:

Hi All, 


I'm trying to figure out the right/best/easiest way to find out how much space that a given
table is taking up on the various tablet servers.  I'm looking really at finding:

* Physical space taken on all disks

* Logical space taken on all disks

* Sizing of Indices/Bloom Filters, etc.

* Sizing with and without replication.


I'm trying to run an apples vs apples comparison of how big data is when stored in Kudu compared
to storing it in it's native format (Gzipped CSV) as well as in Parquet format on HDFS.  Ultimately,
I'd like to be able to do reporting on the different tables to say Table X is taking up Y
Tb, where Y consists of A physical size, B Index, C Bloom, etc.


Looking through the Web UI I don't really see any good summary of how much space the entire
table is taking.  It seems like I'd need to walk through each Tablet server, connect to the
metrics page and generate the summary information myself.



Yea, unfortunately we do not expose much of this information in a useful way at the moment.
The metrics page is the best source of info for the various sizes, and even those are often
estimates rather than always being accurate at the moment.




In terms of cross-server metrics aggregation, it's been our philosophy so far that we should
try to avoid doing a poor job of things that other systems are likely to do better -- metrics
aggregation being one such thing. It's likely we'll add simple aggregation of table sizes,
since that info is very useful for SQL engines to do JOIN ordering, but I don't think we'd
start adding the more granular breakdowns like indexes, blooms, etc.


Definitely understand on that.  Index sizes (and sizes of other related data) are mainly of
interests to me just to compare what the performance improvements of Kudu vs Parquet vs CSV
"cost" in terms of storage.


If your use case is a one-time experiment to understand the data volumes, it would be pretty
straightforward to write a tool to do this kind of summary against the on-disk metadata of
a tablet server. For example, you can load the tablet metadata, group the blocks by type/column,
and then aggregate as you prefer. Unfortunately this would give you only the physical size
and not the logical, since you'd have to scan the actual data to know its uncompressed sizes.


I'm looking for the sizings really for two purposes.  

1) As mentioned above, to help assess the "costs" Kudu vs other systems we already have in
place, especially in terms of Storage

2) Perform longer-term monitoring of sizing of different table sizes, how they're growing,
how much resources they're using and so on.


For one particular use case we have, our data comes in as Protobuf data, and is imported as
ORC data into a Hive table.  Looking at Parquet vs ORC, the datasizes are about 3x larger.
 Kudu seems like will give use a much more performant and natural fit to our dataset, but
if it's 2x larger than Parquet again, that really increases the costs of storage.  


So on that note, I'm not looking for an exact number on the size.  If it's off say +-5% (for
a number), that's certainly close enough in the ballpark.



If you have any interest in helping to build such a tool I'd be happy to point you in the
right direction. Otherwise let's file a JIRA to add this as a new feature in a future release.


Let me poke and ponder a bit on that first and see what I can get via hack & kludge. 
We need to publish our metrics in a CSV format for the monitoring bit, so I don't know how
necessarily useful our solution would be to the larger community.








Todd Lipcon
Software Engineer, Cloudera




Todd Lipcon
Software Engineer, Cloudera

View raw message