From user-return-472-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sat May 30 15:02:29 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B85DA180638 for ; Sat, 30 May 2020 17:02:28 +0200 (CEST) Received: (qmail 11825 invoked by uid 500); 30 May 2020 15:02:28 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 11812 invoked by uid 99); 30 May 2020 15:02:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 May 2020 15:02:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 695831825D5 for ; Sat, 30 May 2020 15:02:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.2 X-Spam-Level: X-Spam-Status: No, score=-0.2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HgCP_Tv-coeV for ; Sat, 30 May 2020 15:02:26 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::132; helo=mail-il1-x132.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-il1-x132.google.com (mail-il1-x132.google.com [IPv6:2607:f8b0:4864:20::132]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 7D5D67D3FB for ; Sat, 30 May 2020 15:02:23 +0000 (UTC) Received: by mail-il1-x132.google.com with SMTP id j3so5301683ilk.11 for ; Sat, 30 May 2020 08:02:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=18mYvwRQ3+2UVkuYiQ6tVK5W6bqRdG3035yFztZm2+Q=; b=gQPnk/cmz/7rNhjo5tp7y4W51i1hmNV92S2hzzkGEMRzueRAw2VcmYMit3Xbd1xTNJ lD3gE/Dw3W8Abjb7Ei7iYajtWiwelewsJ9+1skuzlCCRaVoUtDUQNR47vHO+MfBQvweL goGInRjWlYOrBVJlQfCVRM3tEXygjiG1JTRcSQ76vjkhjiAOOVtGIJelc6a0vDKzFRdm IOtTODqB2rG2IkdOv2k17FnSKGQFgP1DBX3ZXxS2rZcnuc7mTRpv5dZuOfFnH7nwQU/H 28tEFnvWhICaNwOeKaUyyoudtYVCKLu4mxftAqkQ1s8TKYBGx3rkxeNwIh6zDEFYuaAq LZ8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=18mYvwRQ3+2UVkuYiQ6tVK5W6bqRdG3035yFztZm2+Q=; b=myiPcrwyZm8gBzt4X1VcgXrT9Y312l5q86fbJ8yJf/lLJmltBFytxc2+CYsMgx1kky lKER1zuz4LpRYtTov4jHz65siUMLBoriXJUgSlYp3wanRU3JF8w3knPzFDU54i+RjG4b EKDM/0/h/TRwOjplOxCYT8ubR5aBZ/HKb08ZFBLEWyIQmRsqUaAEqLQqAo20lZY2h6uP 936Y8HD0FJeONL7+jgYIzXWDnAsN7amf9DVrSSPR/3KHlUH6aj60R4ga2S+gS4I1cR9k HnzEtfV0H+5j7TF1eHmPdpZ07HKOKHEvXe8XfcVaOHJnTRlFuNflnoyjnfbkLYZwr2R7 Eomg== X-Gm-Message-State: AOAM530b6Fh1pJbkeToHxgtlbCCRV6Xc5knbVsrDHe7kh4xkZzNShGZT 50dax6sZA+pYgQi4DgprFjsrywmqPR7Ozh/6pEKXrTcwxwE= X-Google-Smtp-Source: ABdhPJw20zJBXdtyOQQvtayvAkSybGoqosnZQQVg75uDwx8pUkMp4dUUIQ61BWw3I7qDMr9bFh8Q7qyPC8YdqEwrwcs= X-Received: by 2002:a92:c7a4:: with SMTP id f4mr12767921ilk.44.1590850941581; Sat, 30 May 2020 08:02:21 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Wes McKinney Date: Sat, 30 May 2020 10:01:45 -0500 Message-ID: Subject: Re: Cast string array to number/boolean with invalid values To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable It's https://issues.apache.org/jira/browse/ARROW-1489 On Sat, May 30, 2020 at 9:56 AM Neal Richardson wrote: > > Sounds reasonable, could you please open a JIRA issue? > > Neal > > On Sat, May 30, 2020 at 1:01 AM Yue Ni wrote: >> >> Hi there, >> >> I find arrow compute provides Cast API allowing users to cast from strin= g to number/boolean values, but sometimes the string values contain some in= valid values that cannot be casted to a number/boolean (sorry, data is real= ly messy), for example, in a string array like ["1", "2", "3", "None", ""].= I wonder if there is any way to handle those invalid values during casting= . >> >> Currently from the code I read (cast.h/cast.cc), it seems the cast will = fail and return when dealing with invalid values, I wonder if there is any = way I can ask the Cast API to return NULL for invalid values, so that it is= easier to process these NULL values later. >> >> And since it is rarely possible to guarantee all string values in an arr= ay are valid, **any** invalid value in an array/entire data set will make t= he cast process failed. This requires users using the cast API to figure ou= t which value in the array has the invalid value by themself, which is not = easy to do programmatically (only an error status message is set in the con= text). IMHO the following strategy could be a better default strategy when = casting from string to number/boolean: >> 1) when finding an invalid value, set NULL as its value >> 2) set an error status indicating this array casting has some invalid va= lues >> 3) keep finish casting the remaining elements in the array >> But I believe there are users who prefer bailing out as soon as possible= as well, it will be great if we can provide different cast options to make= both strategies possible. >> >> Thanks so much. >> >> Regards, >> Yue