From user-return-474-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun May 31 14:23:19 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id A778E18065E for ; Sun, 31 May 2020 16:23:19 +0200 (CEST) Received: (qmail 18457 invoked by uid 500); 31 May 2020 14:23:18 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 18447 invoked by uid 99); 31 May 2020 14:23:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 May 2020 14:23:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E948C181473 for ; Sun, 31 May 2020 14:13:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id jpzOK4lHBoT0 for ; Sun, 31 May 2020 14:13:26 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.222.177; helo=mail-qk1-f177.google.com; envelope-from=niyue.com@gmail.com; receiver= Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 767A5BB8CA for ; Sun, 31 May 2020 14:13:26 +0000 (UTC) Received: by mail-qk1-f177.google.com with SMTP id n11so6716480qkn.8 for ; Sun, 31 May 2020 07:13:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=080OvshMUJkFUOivHl2Rb4EfveRmf3rgAKUdJzMNNYw=; b=OzS63u9GwFJ/4X8fp0vAVM1SX1OYQrnqg3E4Dj8ZgL1E13Rr3Ans77T3NQtlhnLXvi MjaxpbZoCVd1bSwF2C8kMxQnPlA5aYrmgJBgJkq+t2xx0qlyYI/8f4alZHvQsFr7dv0w w9o6MIG9a0a3i0PMj/QtE+j99CKeO0ScRLSUJsb8QdUDtXpQ/7F0JbrDXAPig4l+Dsp8 nTIKyYza/KP+4yNeJo2Wb+6ugH9qndMSikLGW1uzIG1e+ErLU46/tW2G5Tfv7JoRpda1 Bt1NVPPH8xKgMwLnHObG2Jhvo9t3LG9uEUYdwLAE93XnDUQy8jgTEKcjSWhbUE23hHH4 K+RQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=080OvshMUJkFUOivHl2Rb4EfveRmf3rgAKUdJzMNNYw=; b=OTXVtMwWPFFpu9swYMNZzSLMzy2HjUV8dmlE+ZgCtjpb6KwqXm7ir2eedsYycY62Xw e2kGMwv5bkLYXSGyRlXk7HlzTA5N5FbqvX6u23RuhWs7D83EnmNXzUlxwYRYuzhRJRxB 25PJZn+osgtt5LRYq3df9KjVssdX7thnXGNxMg/+P2p2497+hGRDTT2OXhMQUzxf3fwV 0vYZt7k6J4Cg3Jcl+eFBirUtdAOlOqvnTIckRoi39s5GUEVtKLcemTy/jZemHupCDMk/ 8n5sHxsBNzB1VUV21fsSJzn9103iUJY4ydzewH6xJm/xytSCdS08bzm2+Bv4qt/PHorg YngA== X-Gm-Message-State: AOAM530b6HLUvLp3YLO4DPkx1GbOOplA7hbgbTUQrAWtk9XE4y8/D+Jd u4pd/D4ZzeZi7HxKFyH5tTgjlIIsIIMxY4zqnPWaRXLDlvrtuw== X-Google-Smtp-Source: ABdhPJxvg1rgbvXR+2Gx7xIsWPLyFM/8AFt/q8grqxbtNbboub1vwMgpBPDeWZjwzVHu4YmBJ5/oS7mWCOdrYUHJMmA= X-Received: by 2002:ae9:e00f:: with SMTP id m15mr16712293qkk.223.1590934399517; Sun, 31 May 2020 07:13:19 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yue Ni Date: Sun, 31 May 2020 22:12:43 +0800 Message-ID: Subject: Re: Cast string array to number/boolean with invalid values To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000be8bde05a6f2475c" --000000000000be8bde05a6f2475c Content-Type: text/plain; charset="UTF-8" Thanks Neal and Wes. https://issues.apache.org/jira/browse/ARROW-1489 is exactly what I am searching for. On Sat, May 30, 2020 at 11:02 PM Wes McKinney wrote: > It's https://issues.apache.org/jira/browse/ARROW-1489 > > On Sat, May 30, 2020 at 9:56 AM Neal Richardson > wrote: > > > > Sounds reasonable, could you please open a JIRA issue? > > > > Neal > > > > On Sat, May 30, 2020 at 1:01 AM Yue Ni wrote: > >> > >> Hi there, > >> > >> I find arrow compute provides Cast API allowing users to cast from > string to number/boolean values, but sometimes the string values contain > some invalid values that cannot be casted to a number/boolean (sorry, data > is really messy), for example, in a string array like ["1", "2", "3", > "None", ""]. I wonder if there is any way to handle those invalid values > during casting. > >> > >> Currently from the code I read (cast.h/cast.cc), it seems the cast will > fail and return when dealing with invalid values, I wonder if there is any > way I can ask the Cast API to return NULL for invalid values, so that it is > easier to process these NULL values later. > >> > >> And since it is rarely possible to guarantee all string values in an > array are valid, **any** invalid value in an array/entire data set will > make the cast process failed. This requires users using the cast API to > figure out which value in the array has the invalid value by themself, > which is not easy to do programmatically (only an error status message is > set in the context). IMHO the following strategy could be a better default > strategy when casting from string to number/boolean: > >> 1) when finding an invalid value, set NULL as its value > >> 2) set an error status indicating this array casting has some invalid > values > >> 3) keep finish casting the remaining elements in the array > >> But I believe there are users who prefer bailing out as soon as > possible as well, it will be great if we can provide different cast options > to make both strategies possible. > >> > >> Thanks so much. > >> > >> Regards, > >> Yue > --000000000000be8bde05a6f2475c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks Neal and Wes.=C2=A0https://issues.apache.org/jira/browse/ARROW-1489=C2=A0is exactl= y what I am searching for.=C2=A0

=
On Sat, May 30, 2020 at 11:02 PM Wes = McKinney <wesmckinn@gmail.com= > wrote:
It's https= ://issues.apache.org/jira/browse/ARROW-1489

On Sat, May 30, 2020 at 9:56 AM Neal Richardson
<neal.p= .richardson@gmail.com> wrote:
>
> Sounds reasonable, could you please open a JIRA issue?
>
> Neal
>
> On Sat, May 30, 2020 at 1:01 AM Yue Ni <niyue.com@gmail.com> wrote:
>>
>> Hi there,
>>
>> I find arrow compute provides Cast API allowing users to cast from= string to number/boolean values, but sometimes the string values contain s= ome invalid values that cannot be casted to a number/boolean (sorry, data i= s really messy), for example, in a string array like ["1", "= 2", "3", "None", ""]. I wonder if there = is any way to handle those invalid values during casting.
>>
>> Currently from the code I read (cast.h/cast.cc), it seems the cast= will fail and return when dealing with invalid values, I wonder if there i= s any way I can ask the Cast API to return NULL for invalid values, so that= it is easier to process these NULL values later.
>>
>> And since it is rarely possible to guarantee all string values in = an array are valid, **any** invalid value in an array/entire data set will = make the cast process failed. This requires users using the cast API to fig= ure out which value in the array has the invalid value by themself, which i= s not easy to do programmatically (only an error status message is set in t= he context). IMHO the following strategy could be a better default strategy= when casting from string to number/boolean:
>> 1) when finding an invalid value, set NULL as its value
>> 2) set an error status indicating this array casting has some inva= lid values
>> 3) keep finish casting the remaining elements in the array
>> But I believe there are users who prefer bailing out as soon as po= ssible as well, it will be great if we can provide different cast options t= o make both strategies possible.
>>
>> Thanks so much.
>>
>> Regards,
>> Yue
--000000000000be8bde05a6f2475c--