From dev-return-6263-archive-asf-public=cust-asf.ponee.io@madlib.apache.org Mon Nov 23 23:51:59 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id E720D180654 for ; Tue, 24 Nov 2020 00:51:58 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 23A0B47A69 for ; Mon, 23 Nov 2020 23:51:58 +0000 (UTC) Received: (qmail 14159 invoked by uid 500); 23 Nov 2020 23:51:57 -0000 Mailing-List: contact dev-help@madlib.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.apache.org Delivered-To: mailing list dev@madlib.apache.org Received: (qmail 14139 invoked by uid 99); 23 Nov 2020 23:51:57 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2020 23:51:57 +0000 From: =?utf-8?q?GitBox?= To: dev@madlib.apache.org Subject: =?utf-8?q?=5BGitHub=5D_=5Bmadlib=5D_reductionista_commented_on_pull_request_?= =?utf-8?q?=23525=3A_DL=3A_Model_Hopper_Refactor?= Message-ID: <160617551748.15632.16462854376851932960.asfpy@gitbox.apache.org> Date: Mon, 23 Nov 2020 23:51:57 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit References: In-Reply-To: reductionista commented on pull request #525: URL: https://github.com/apache/madlib/pull/525#issuecomment-732492520 Thanks for testing it out Frank! You're the first to test this on with tensorflow 1.13, and based on one of the errors you're seeing, I'm guessing you're also the first to test it with gpdb5. I only tested it on gpdb6 and postgres. > WARNING: This version of tensorflow does not support XLA auto-cluster JIT optimization. HINT: upgrading tensorflow may improve performance. (seg1 slice1 10.128.0.41:40001 pid=6271) > CONTEXT: PL/Python function "fit_transition" > ``` > > What does user need to do to enable XLA? I am on TF 1.13.1 currently. Nothing, probably just need to upgrade to TF 1.14. I should add the specific minimum version requirement in the warning message. I think it was added in 1.14 but I've only used 1.10 and 1.14, so I'll need to confirm that. Could also be that it's been in there for a while, and they just changed how you access it. I'll look both of those up, and also run some more comprehensive performance tests to see how much of a difference it actually makes and in what kind of setups. If it looks like it's not helping much anyway (even though the TF docs say it should) then we can get rid of this message... or maybe even not bother to enable it. I figure if it does give better performance consistently across models, then we may as well let the user know there's an easy way to get that, without having to tune any parameters or upgrade hardware. Would also be important to know if there are some cases where it could hurt performance. > ERROR: plpy.Error: madlib_keras_fit_multiple_model error: No GPUs configured on hosts. (plpython.c:5038) > CONTEXT: Traceback (most recent call last): > PL/Python function "madlib_keras_fit_multiple_model", line 23, in > fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals()) > PL/Python function "madlib_keras_fit_multiple_model", line 147, in __init__ > PL/Python function "madlib_keras_fit_multiple_model", line 295, in get_accessible_gpus_for_seg > PL/Python function "madlib_keras_fit_multiple_model" > ``` > > so it looks like `use gpus=NULL` is now defaulting to `TRUE` but it should default to `FALSE` i.e., CPUs. It used to default to CPUs. Good catch! For some reason, I get a different error when I tried to reproduce this ( about using BOOLEAN::None in a query). But I think I see the cause of it--fixing. I thought we had some dev-check tests for this, will add one if we don't. > ``` > ERROR: plpy.SPIError: PRIMARY KEY and DISTRIBUTED BY definitions incompatible > HINT: When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY > CONTEXT: Traceback (most recent call last): > PL/Python function "madlib_keras_fit_multiple_model", line 24, in > fit_obj.fit_multiple_model() > PL/Python function "madlib_keras_fit_multiple_model", line 241, in fit_multiple_model > PL/Python function "madlib_keras_fit_multiple_model", line 509, in init_model_output_tbl > PL/Python function "madlib_keras_fit_multiple_model" > ``` > > I have also seen this error when running on a single segment. Interesting! Looks like this was due to an unreasonably strict check in the gpdb5 server code, where it required a specific order for the keys in the PRIMARY KEY constraint. The server team has fixed that in gpdb6: https://github.com/greenplum-db/gpdb/commit/e572af54e4255d501d261d51d32e1f3d3cf4b9c9 I can't recall for sure if we decided we care about supporting deep learning in gpdb5 for this release. But since it's an easy fix, I'll just reverse the order of the keys so that it works on both. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org