madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Hellerstein <>
Subject Re: Sessionize function - create table or view by default?
Date Wed, 20 Jul 2016 21:17:32 GMT
I heartily concur with this assessment FWIW.  I generally use views or WITH clauses myself
when building up these kinds of pipelines. 

In my experience, the PostgreSQL optimizer is not great at collapsing views though.  So caveat
emptor.  Users should be made very aware of EXPLAIN!


On July 20, 2016 at 12:46:18 PM, Frank McQuillan ( wrote:

Hey Jim,  

Thank you for the thoughtful response.  

Given your comments, I think we ought to stick with a view as the default  
for sessionize. Looking ahead to MADlib 2.0, one thing we want to better  
support is workflows since chaining operations together is such a common  
data science thing to do. That means looking across all existing MADlib  
functions to determine what changes to returns sets we need to make, such  
as standardizing on views.  


On Wed, Jul 20, 2016 at 6:10 AM, Jim Nasby <> wrote:  

> On 7/19/16 7:36 PM, Frank McQuillan wrote:  
>> "create_view (optional)  
>> BOOLEAN default: TRUE. Determines whether to create a view or materialize  
>> a  
>> table as output. If you only needed session info once, creating a view  
>> could be significantly faster than materializing as a table."  
>> Question is: should it really default TRUE (view) or is it better to  
>> default FALSE (table)? i.e., Or does it really not matter?  
> tl;dr: it depends, but more importantly I think MADlib should promote  
> views as at least an option (if not the default) across the board.  
> It's going to depend heavily on what you're doing.  
> If you're building a "pipeline" of operations where sessionization is just  
> the first step of several, and the session data is only referred to once,  
> creating a view gives the planner a lot more flexibility on how to produce  
> output.  
> When you use a temp table, the planner has no choice: it must not only  
> materialize the complete result set immediately, but it also has to modify  
> the catalog to record the temp table.  
> A pythonic analogy would be that a view is like using a generator (where  
> data only needs to be brought forth as it's consumed) while a temp table is  
> like using a list. Except there's an even larger difference in SQL: a view  
> means the optimizer has a chance to change the execution plan of the final  
> query that's using the sessionize output, taking into account everything  
> you're doing with the data.  
> If there are multiple steps in a pipeline, this difference can become very  
> large. I've worked on query chains where switching temp tables to views has  
> had a 5-10x impact.  
> The two cases where a temp table would be better are if you need to refer  
> to the sessionized data many times (sometimes you'd need to hit it more  
> than twice for the temp table to be a win), or if the optimizer ends up  
> picking a bad plan when everything is combined into a single query. In the  
> case of a bad plan, it would still be better to either try and tweak the  
> optimizer settings, or to insert an "optimization fence", typically done by  
> sticking an OFFSET 0 clause in.  
> To me, the bigger picture is promoting the option of views across all  
> MADlib set returning operations, because when multiple operations are  
> chained together you can see a very large benefit. I suspect that it's more  
> common to chain things together, making views a better default... but  
> that's just a guess. So if there was a standard default for this across the  
> board, perhaps views would be best.  
> --  
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX  
> Experts in Analytics, Data Architecture and PostgreSQL  
> Data in Trouble? Get it in Treble!  
> 855-TREBLE2 (855-873-2532) mobile: 512-569-9461  

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message