Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BC8C8200C80 for ; Thu, 11 May 2017 01:53:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BB1DC160B9C; Wed, 10 May 2017 23:53:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0C955160BC6 for ; Thu, 11 May 2017 01:53:07 +0200 (CEST) Received: (qmail 57155 invoked by uid 500); 10 May 2017 23:53:07 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 57146 invoked by uid 99); 10 May 2017 23:53:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 May 2017 23:53:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B3EDE1AFA4C for ; Wed, 10 May 2017 23:53:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.201 X-Spam-Level: X-Spam-Status: No, score=-99.201 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id aFeiKRIIJV8b for ; Wed, 10 May 2017 23:53:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2E0235F642 for ; Wed, 10 May 2017 23:53:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 73F90E01D8 for ; Wed, 10 May 2017 23:53:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 25A7D21E0A for ; Wed, 10 May 2017 23:53:04 +0000 (UTC) Date: Wed, 10 May 2017 23:53:04 +0000 (UTC) From: "Xiao Li (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 10 May 2017 23:53:08 -0000 [ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20685. ----------------------------- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 > BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument > --------------------------------------------------------------------------------------- > > Key: SPARK-20685 > URL: https://issues.apache.org/jira/browse/SPARK-20685 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > There's a latent corner-case bug in PYSpark UDF evaluation where executing a stage with a single UDF that takes more than one argument _where that argument is repeated_ will crash at execution with a confusing error. > Here's a repro: > {code} > from pyspark.sql.types import * > spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) > spark.sql("SELECT add(1, 1)").first() > {code} > This fails with > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main > process() > File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 107, in > func = lambda _, it: map(mapper, it) > File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 93, in > mapper = lambda a: udf(*a) > File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 71, in > return lambda *a: f(*a) > TypeError: () takes exactly 2 arguments (1 given) > {code} > The problem was introduced by SPARK-14267: there code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF, but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs). > I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org