From dev-return-11566-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Apr 3 03:53:37 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B62FA180668 for ; Wed, 3 Apr 2019 05:53:36 +0200 (CEST) Received: (qmail 62511 invoked by uid 500); 3 Apr 2019 03:53:34 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 62499 invoked by uid 99); 3 Apr 2019 03:53:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2019 03:53:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 530AC183940 for ; Wed, 3 Apr 2019 03:53:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id EOk0nO4iHqPG for ; Wed, 3 Apr 2019 03:53:31 +0000 (UTC) Received: from mail-vs1-f53.google.com (mail-vs1-f53.google.com [209.85.217.53]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4A38A624C9 for ; Wed, 3 Apr 2019 03:53:31 +0000 (UTC) Received: by mail-vs1-f53.google.com with SMTP id t78so9153748vsc.1 for ; Tue, 02 Apr 2019 20:53:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc; bh=UZZR06MAkLLTIpBe/jGIpjXtnrAK5PuDaGkvrI1ceAU=; b=rrYO4XI9Wsg9N1nQxB0HpVMZgjsnxzIL0GBolwHGJZTUe0Bf/YbsVWv1IKGuaerSgd 4p+pdT8dcl5GTiw4bR6VlFuI7v/t3Ws2g/FePjeg6wGY2iQbwXSUyQ3aLVmOEvmS5roD xcY1EhosfU+HSOWDw7l9ZNtx78lRepneG+w4j1mPreNuDvpVCZMK2xlKb0I9NdUHd7t5 IN2DM3cQ0S8iEkqHw9r8UGfF1Bmn7zinbyjnEGf7+QlFyrwAyQ4yEOYR4Seh7A/T2PrC ED5mjtukx4NtuWbTwwqb7F1UPDVGEa79uFVbQNEO8iWjn1n0l9JO4vIgzjnlrgBFPjhr 3Z8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=UZZR06MAkLLTIpBe/jGIpjXtnrAK5PuDaGkvrI1ceAU=; b=F2X0oQGTvJbTs824pUfOS0Ia/sWtDvG2PInnBBZigcAD2ssjnxFfYlWQk5D0HQbjCa hJbXAGnAQG1c2oaPMLqEqeS65canbaypAjbHnqzPs9/7JBhGekzfH2uPCyHm3SOBzLxq amnkrneeJyggkWiiqiJZUoVSW5UavaD7RYh6WFbiQLw6UZtk0yMA+9x9tmBi6GmdS/+9 Ecb5jhyzjmrLpe8TJEd/sZB21rNyuLekfOYALALxtSUomV8rSI3u4BYsSxsTx3FdmI5t q1x/9Rg79iRHxKJH5rRkUnESq89VxRISfB5bt+XUrwdowQpM1EVRKPy7ehAfGTfUpAtn 7aGg== X-Gm-Message-State: APjAAAX6hzs5sf97h+DsZnEfdl3XJPqp5tyiNnj0hSPwBY1WLl6EEgDr qLclyRYHAueEqlL+OHj/0mePrhRz/do7NuKvLCY= X-Google-Smtp-Source: APXvYqz+RByJLNMND20w3s1eYLk150j4Ga0iXl8/AdHD+YJjVsD/YL6qyJ5zoxu2WuUnR50T3OsPqPzypTZa88Cb4u8= X-Received: by 2002:a67:fa95:: with SMTP id f21mr18817960vsq.180.1554263605487; Tue, 02 Apr 2019 20:53:25 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Tue, 2 Apr 2019 20:53:13 -0700 Message-ID: Subject: Re: [DISCUSS][Format] Time Interval Changes To: Wes McKinney Cc: dev@arrow.apache.org, Jacques Nadeau Content-Type: multipart/alternative; boundary="00000000000017af7905859832cd" --00000000000017af7905859832cd Content-Type: text/plain; charset="UTF-8" Based on the discussion so far, my attempt at concrete Schema proposals below. Jacques I think summarizes what we've discussed, apologies if I've misunderstood. Wes would Option 1 work to support the Pandas Time Delta use-case? I'm leaning towards Option 1 if it satisfies everyone (but happy to implement whatever we come to a consensus on). ** Option 1: New Type: ** /// An absolute length of time unrelated to any calendar artifacts. For the purposes /// of Arrow Implementations, adding this value to a Timestamp ("t1") naively (i.e. simply summing /// the two number) is acceptable even though in some cases the resulting Timestamp (t2) would /// not account for leap-seconds during the elapsed time between "t1" and "t2". Similarly, representing /// the difference between two Unix timestamp is acceptable, but would yield a value that is possibly a few seconds /// off from the true elapsed time. /// /// The resolution defaults to /// millisecond, but can be any of the other supported TimeUnit values as /// with Timestamp and Time types. This type is always represented as /// an 8-byte integer. table DurationInterval { unit: TimeUnit = MILLISECOND; } ** Option 2: New TimeDelta enum on Interval Unit (strong definition around leap-seconds): ** enum IntervalUnit: short { YEAR_MONTH, DAY_TIME, TIME_DELTA} // A "calendar" interval which models types that don't necessarily // have a precise duration without the context of a base timestamp (e.g. // days can differ in length during day light savings time transitions). In the case // of TimeDelta it is possible no precise definition is possible if the base timestamp occurs // at an instant when a leap second was added (but would only differ by at most 1 second). // YEAR_MONTH - Indicates the number of elapsed whole months, stored as // 4-byte integers. // DAY_TIME - Indicates the number of elapsed days and milliseconds, // stored as 2 contiguous 32-bit integers (8-bytes in total). Support // of this IntervalUnit is not required for full arrow compatibility. // TIME_DELTA - Indicates absolute time difference between Unix Timstamps (i.e. excluding leap seconds). This value is always represented as an 8-byte integer. table Interval { unit: IntervalUnit; resolution: TimeUnit // Only relevant for TIME_DELTA } On Tue, Apr 2, 2019 at 10:03 AM Wes McKinney wrote: > Since there were some mentions of leap seconds: > > I think the intent of the timedelta/duration type should be to express > the difference between UNIX timestamps (from second to nanosecond > resolution), which don't include leap seconds. We use the > timedelta64[ns] type in pandas for example, which is a > nanosecond-resolution difference of UNIX timestamps. > > On Tue, Apr 2, 2019 at 10:05 AM Jacques Nadeau wrote: > > > > > > > > I could go either way, it has some benefits for forward compatibility I > > > suppose, but on the other hand YAGNI, if you feel strongly, I'm ok > > > including it. However, the more optional fields we have for a specific > > > enum value, makes me lean more towards a new type instead of just an > enum. > > > > > I'm okay with skipping for now. Appreciate the focus on only what we > > actually need. > > > > > > > > > Could you elaborate on defining standard arithmetic conversions between > > > time-delta/duration in seconds and other time unit (days, months, > years) as > > > part of the standard/format, I'm still not sure I understand what the > > > use-case is here. > > > > > > > Here goes nothing... > > > > Seems like there are two options for durations: > > 1) they aren't related to any other type > > 2) they have a relationship to timestamps and dates. > > > > If 1, then the only thing I could understand is real world duration how > > seconds are defined (and fractions thereof). E.g. [1] :D. In this > > situation, there is no way to express any unit of time of higher > > granularity than a second (e.g. days) since it is up to application > > implementer to define the relationship. This severely limits the > > expressiveness of the concept. (I can't ever use something TimeUnit.DAYS) > > and stops the ability to cover the existing interval YEAR_MONTH type I > > believe (since it has a resolution of months). > > > > If 2, then we must define the canonical value of ts + duration, otherwise > > duration are somewhat meaningless, thus the proposed translation chart > > (which causes its own oddities depending on the resolution of the time > type > > you are adding to). > > > > That being said, having started to remember previous discussions on this, > > I'm most inclined to simply pick #1 and ignore the need for anything > more. > > The curiousness of interval math in database systems underscores the fact > > that it apparently doesn't matter that much. In most cases, today + 3 > > months is close enough to today + 90 days for government work. > > > > Let's +2 a patch and get it merged quickly so we never have to think > about > > this again :) > > > > [1] "the duration of 9,192,631,770 periods > > of the radiation > corresponding to > > the transition between the two hyperfine levels > > of the ground state > of > > the caesium-133 atom" (at a > > temperature of 0 K ) > > > > > > --00000000000017af7905859832cd--