kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Kim <bbuil...@gmail.com>
Subject Re: Schema Normalization
Date Mon, 10 Oct 2016 23:51:43 GMT
Todd,

Our usage is very basic right now, but if we do expand to doing more in the area of analytics,
then we will consider using Impala too. Right now, we want to prove the power of Kudu to the
coders, who despise SQL, and then give the analysts a go at it. They will need a JDBC interface
in which Impala would help.

Thanks,
Ben


> On Oct 10, 2016, at 4:46 PM, Todd Lipcon <todd@cloudera.com> wrote:
> 
> On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>>
wrote:
> Todd,
> 
> We are not going crazy with normalization. Actually, we are only normalizing where necessary.
For example, we have a table for profiles and behaviors. They are joined together by a behavior
status table. Each one of these tables are de-normalized when it comes to basic attributes.
That’s the extent of it. From the sound of it, it looks like we are good for now.
> 
> Yea, sounds good.
> 
> One thing to keep an eye on is https://issues.cloudera.org/browse/IMPALA-4252 <https://issues.cloudera.org/browse/IMPALA-4252>
if you use Impala -this should help a lot wth joins where one side of the join has selective
predicates on a large table.
> 
> -Todd
>  
> 
>> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <todd@cloudera.com <mailto:todd@cloudera.com>>
wrote:
>> 
>> Hey Ben,
>> 
>> Yea, we currently don't do great with very wide tables. For example, on flushes,
we'll separately write and fsync each of the underlying columns, so if you have hundreds,
it can get very expensive. Another factor is that currently every 'Write' RPC actually contains
the full schema information for all columns, regardless of whether you've set them for a particular
row.
>> 
>> I'm sure we'll make improvements in these areas in the coming months/years, but for
now, the recommendation is to stick with a schema that looks more like an RDBMS schema than
an HBase one.
>> 
>> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't bother normalizing
out a 'date' column into a 'date_id' and separate 'dates' table, as one might have done in
a fully normalized RDBMS table in days of yore. Kudu's columnar layout, in conjunction with
encodings like dictionary encoding, make that kind of normalization ineffective or even counter-productive
as they introduce extra joins and query-time complexity.
>> 
>> One other item to note is that with more normalized schemas, it requires more of
your query engine's planning capabilities. If you aren't doing joins, a very dumb query planner
is fine. If you're doing complex joins across 10+ tables, then the quality of plans makes
an enormous difference in query performance. To speak in concrete terms, I would guess that
with more heavily normalized schemas, Impala's query planner would do a lot better job than
Spark's, given that we don't currently expose information on table sizes to Spark and thus
it's likely to do a poor job of join ordering.
>> 
>> Hope that helps
>> 
>> -Todd
>> 
>> 
>> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>>
wrote:
>> I would like to know if normalization techniques should or should not be necessary
when modeling table schemas in Kudu. I read that a table with around 50 columns is ideal.
This would mean a very wide table should be avoided.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Mime
View raw message