madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [madlib] reductionista edited a comment on pull request #525: DL: Model Hopper Refactor
Date Mon, 23 Nov 2020 23:53:17 GMT

reductionista edited a comment on pull request #525:
URL: https://github.com/apache/madlib/pull/525#issuecomment-732492520


   Thanks for testing it out Frank!
   
   You're the first to test this on with tensorflow 1.13, and based on one of the errors you're
seeing, I'm guessing you're also the first to test it with gpdb5.  I only tested it on gpdb6
and postgres.
   
   > WARNING:  This version of tensorflow does not support XLA auto-cluster JIT optimization.
 HINT:  upgrading tensorflow may improve performance.  (seg1 slice1 10.128.0.41:40001 pid=6271)
   > CONTEXT:  PL/Python function "fit_transition"
   > ```
   > 
   > What does user need to do to enable XLA? I am on TF 1.13.1 currently.
   
   Nothing, probably just need to upgrade to TF 1.14.  I should add the specific minimum version
requirement in the warning message.  I think it was added in 1.14 but I've only used 1.10
and 1.14, so I'll need to confirm that.  Could also be that it's been in there for a while,
and they just changed how you access it.  I'll look both of those up, and also run some more
comprehensive performance tests to see how much of a difference it actually makes and in what
kind of setups.  If it looks like it's not helping much anyway (even though the TF docs say
it should) then we can get rid of this message... or maybe even not bother to enable it. 
I figure if it does give better performance consistently across models, then we may as well
let the user know there's an easy way to get that, without having to tune any parameters or
upgrade hardware.  Would also be important to know if there are some cases where it could
hurt performance.
    
   > ERROR:  plpy.Error: madlib_keras_fit_multiple_model error: No GPUs configured on hosts.
(plpython.c:5038)
   > CONTEXT:  Traceback (most recent call last):
   >   PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
   >     fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
   >   PL/Python function "madlib_keras_fit_multiple_model", line 147, in __init__
   >   PL/Python function "madlib_keras_fit_multiple_model", line 295, in get_accessible_gpus_for_seg
   > PL/Python function "madlib_keras_fit_multiple_model"
   > ```
   > 
   > so it looks like `use gpus=NULL` is now defaulting to `TRUE` but it should default
to `FALSE` i.e., CPUs. It used to default to CPUs.
   
   Good catch!  For some reason, I get a different error when I tried to reproduce this (
about using BOOLEAN::None in a query).  But I think I see the cause of it--fixing.  I thought
we had some dev-check tests for this, will add one if we don't.
   
   > ```
   > ERROR:  plpy.SPIError: PRIMARY KEY and DISTRIBUTED BY definitions incompatible
   > HINT:  When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED
BY clause must be equal to or a left-subset of the PRIMARY KEY
   > CONTEXT:  Traceback (most recent call last):
   >   PL/Python function "madlib_keras_fit_multiple_model", line 24, in <module>
   >     fit_obj.fit_multiple_model()
   >   PL/Python function "madlib_keras_fit_multiple_model", line 241, in fit_multiple_model
   >   PL/Python function "madlib_keras_fit_multiple_model", line 509, in init_model_output_tbl
   > PL/Python function "madlib_keras_fit_multiple_model"
   > ```
   > 
   > I have also seen this error when running on a single segment.
   
   Interesting!  Looks like this was due to an unnecessarily strict check in the gpdb5 server
code, where it required a specific order for the keys in the PRIMARY KEY constraint.  The
server team has fixed that in gpdb6:
   https://github.com/greenplum-db/gpdb/commit/e572af54e4255d501d261d51d32e1f3d3cf4b9c9
   
   I can't recall for sure if we decided we care about supporting deep learning in gpdb5 for
this release.  But since it's an easy fix, I'll just reverse the order of the keys so that
it works on both.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message