kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Kim <bbuil...@gmail.com>
Subject Re: Schema Normalization
Date Mon, 10 Oct 2016 23:44:07 GMT
Todd,

We are not going crazy with normalization. Actually, we are only normalizing where necessary.
For example, we have a table for profiles and behaviors. They are joined together by a behavior
status table. Each one of these tables are de-normalized when it comes to basic attributes.
That’s the extent of it. From the sound of it, it looks like we are good for now.

Thanks,
Ben


> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <todd@cloudera.com> wrote:
> 
> Hey Ben,
> 
> Yea, we currently don't do great with very wide tables. For example, on flushes, we'll
separately write and fsync each of the underlying columns, so if you have hundreds, it can
get very expensive. Another factor is that currently every 'Write' RPC actually contains the
full schema information for all columns, regardless of whether you've set them for a particular
row.
> 
> I'm sure we'll make improvements in these areas in the coming months/years, but for now,
the recommendation is to stick with a schema that looks more like an RDBMS schema than an
HBase one.
> 
> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't bother normalizing
out a 'date' column into a 'date_id' and separate 'dates' table, as one might have done in
a fully normalized RDBMS table in days of yore. Kudu's columnar layout, in conjunction with
encodings like dictionary encoding, make that kind of normalization ineffective or even counter-productive
as they introduce extra joins and query-time complexity.
> 
> One other item to note is that with more normalized schemas, it requires more of your
query engine's planning capabilities. If you aren't doing joins, a very dumb query planner
is fine. If you're doing complex joins across 10+ tables, then the quality of plans makes
an enormous difference in query performance. To speak in concrete terms, I would guess that
with more heavily normalized schemas, Impala's query planner would do a lot better job than
Spark's, given that we don't currently expose information on table sizes to Spark and thus
it's likely to do a poor job of join ordering.
> 
> Hope that helps
> 
> -Todd
> 
> 
> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>>
wrote:
> I would like to know if normalization techniques should or should not be necessary when
modeling table schemas in Kudu. I read that a table with around 50 columns is ideal. This
would mean a very wide table should be avoided.
> 
> Thanks,
> Ben
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Mime
View raw message