madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Nasby <>
Subject Re: Sessionize function - create table or view by default?
Date Wed, 20 Jul 2016 13:10:56 GMT
On 7/19/16 7:36 PM, Frank McQuillan wrote:
> "create_view (optional)
> BOOLEAN default: TRUE. Determines whether to create a view or materialize a
> table as output. If you only needed session info once, creating a view
> could be significantly faster than materializing as a table."
> Question is:  should it really default TRUE (view) or is it better to
> default FALSE (table)? i.e., Or does it really not matter?

tl;dr: it depends, but more importantly I think MADlib should promote 
views as at least an option (if not the default) across the board.

It's going to depend heavily on what you're doing.

If you're building a "pipeline" of operations where sessionization is 
just the first step of several, and the session data is only referred to 
once, creating a view gives the planner a lot more flexibility on how to 
produce output.

When you use a temp table, the planner has no choice: it must not only 
materialize the complete result set immediately, but it also has to 
modify the catalog to record the temp table.

A pythonic analogy would be that a view is like using a generator (where 
data only needs to be brought forth as it's consumed) while a temp table 
is like using a list. Except there's an even larger difference in SQL: a 
view means the optimizer has a chance to change the execution plan of 
the final query that's using the sessionize output, taking into account 
everything you're doing with the data.

If there are multiple steps in a pipeline, this difference can become 
very large. I've worked on query chains where switching temp tables to 
views has had a 5-10x impact.

The two cases where a temp table would be better are if you need to 
refer to the sessionized data many times (sometimes you'd need to hit it 
more than twice for the temp table to be a win), or if the optimizer 
ends up picking a bad plan when everything is combined into a single 
query. In the case of a bad plan, it would still be better to either try 
and tweak the optimizer settings, or to insert an "optimization fence", 
typically done by sticking an OFFSET 0 clause in.

To me, the bigger picture is promoting the option of views across all 
MADlib set returning operations, because when multiple operations are 
chained together you can see a very large benefit. I suspect that it's 
more common to chain things together, making views a better default... 
but that's just a guess. So if there was a standard default for this 
across the board, perhaps views would be best.
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble!
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

View raw message