kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Good way to find "Real" size of the tables
Date Mon, 12 Dec 2016 06:22:34 GMT
Hey Rick,

Just wanted to check and see if you were able to make any progress on the
experiments you were running. Would be great to share your findings or any
issues you encountered.


On Thu, Dec 1, 2016 at 10:49 PM, Weber, Richard <riweber@akamai.com> wrote:

> Comments below
> On Nov 30, 2016, at 4:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
> On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riweber@akamai.com> w
> rote:
>> Hi All,
>> I'm trying to figure out the right/best/easiest way to find out how much
>> space that a given table is taking up on the various tablet servers.  I'm
>> looking really at finding:
>> * Physical space taken on all disks
>> * Logical space taken on all disks
>> * Sizing of Indices/Bloom Filters, etc.
>> * Sizing with and without replication.
>> I'm trying to run an apples vs apples comparison of how big data is when
>> stored in Kudu compared to storing it in it's native format (Gzipped CSV)
>> as well as in Parquet format on HDFS.  Ultimately, I'd like to be able to
>> do reporting on the different tables to say Table X is taking up Y Tb,
>> where Y consists of A physical size, B Index, C Bloom, etc.
>> Looking through the Web UI I don't really see any good summary of how
>> much space the entire table is taking.  It seems like I'd need to walk
>> through each Tablet server, connect to the metrics page and generate the
>> summary information myself.
> Yea, unfortunately we do not expose much of this information in a useful
> way at the moment. The metrics page is the best source of info for the
> various sizes, and even those are often estimates rather than always being
> accurate at the moment.
> Ok
> In terms of cross-server metrics aggregation, it's been our philosophy so
> far that we should try to avoid doing a poor job of things that other
> systems are likely to do better -- metrics aggregation being one such
> thing. It's likely we'll add simple aggregation of table sizes, since that
> info is very useful for SQL engines to do JOIN ordering, but I don't think
> we'd start adding the more granular breakdowns like indexes, blooms, etc.
> Definitely understand on that.  Index sizes (and sizes of other related
> data) are mainly of interests to me just to compare what the performance
> improvements of Kudu vs Parquet vs CSV "cost" in terms of storage.
> If your use case is a one-time experiment to understand the data volumes,
> it would be pretty straightforward to write a tool to do this kind of
> summary against the on-disk metadata of a tablet server. For example, you
> can load the tablet metadata, group the blocks by type/column, and then
> aggregate as you prefer. Unfortunately this would give you only the
> physical size and not the logical, since you'd have to scan the actual data
> to know its uncompressed sizes.
> I'm looking for the sizings really for two purposes.
> 1) As mentioned above, to help assess the "costs" Kudu vs other systems we
> already have in place, especially in terms of Storage
> 2) Perform longer-term monitoring of sizing of different table sizes, how
> they're growing, how much resources they're using and so on.
> For one particular use case we have, our data comes in as Protobuf data,
> and is imported as ORC data into a Hive table.  Looking at Parquet vs ORC,
> the datasizes are about 3x larger.  Kudu seems like will give use a much
> more performant and natural fit to our dataset, but if it's 2x larger than
> Parquet again, that really increases the costs of storage.
> So on that note, I'm not looking for an exact number on the size.  If it's
> off say +-5% (for a number), that's certainly close enough in the ballpark.
> If you have any interest in helping to build such a tool I'd be happy to
> point you in the right direction. Otherwise let's file a JIRA to add this
> as a new feature in a future release.
> Let me poke and ponder a bit on that first and see what I can get via hack
> & kludge.  We need to publish our metrics in a CSV format for the
> monitoring bit, so I don't know how necessarily useful our solution would
> be to the larger community.
> Thanks
> --Rick
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Todd Lipcon
Software Engineer, Cloudera

View raw message