Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 006B6200BA6 for ; Tue, 4 Oct 2016 00:23:59 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F30C9160AE5; Mon, 3 Oct 2016 22:23:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D9AF1160ADC for ; Tue, 4 Oct 2016 00:23:57 +0200 (CEST) Received: (qmail 24564 invoked by uid 500); 3 Oct 2016 22:23:56 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 24551 invoked by uid 99); 3 Oct 2016 22:23:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2016 22:23:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 27D0A1A7933 for ; Mon, 3 Oct 2016 22:23:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=dremio-com.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id udP5O4S5jYI8 for ; Mon, 3 Oct 2016 22:23:53 +0000 (UTC) Received: from mail-vk0-f51.google.com (mail-vk0-f51.google.com [209.85.213.51]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 2066D5FBB7 for ; Mon, 3 Oct 2016 22:23:53 +0000 (UTC) Received: by mail-vk0-f51.google.com with SMTP id y190so142636088vkd.3 for ; Mon, 03 Oct 2016 15:23:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dremio-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=v831ct9yEvaMGWuBvX0DJDLQ5uPl3qn7dafMvfpv6dw=; b=KeL70HxP99Am0Xw4WgqLvkWYSY7AXnkDpuvg44b/kAeBDA9IDYgn13VCO/BewS2Lzj M9B0k66ll0DJElG1mynLfzkb+XUfeO6jo2hyjCZwmCDGgAM+pn0nRaFYou5sjM8lhWx6 9zD+NjBqKyMxGWiw0H4uVVz+FCBL1a2kTqzm+hFC5DiQQ2bTY1Gi9wjgZpeV6garcSQ/ SADzM8z9pV+pn1Ojea4tYeFkybhx8c5h1mSndoZG1omjY9COakPtwzleIU1+RFzRyy2Q U9C0UY75+Il9pUGzb+7SW4+g/yMhWZWvzx/CExKmRoAhN17Y2/xhBfdLjwQMuGegnOd9 GjeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=v831ct9yEvaMGWuBvX0DJDLQ5uPl3qn7dafMvfpv6dw=; b=BlZcTi5qL3xmbHP8uppUqu6GoSosdAFuwK3NJB/dzHv2n8hXTqSlwueDus31c+A9X+ 88SnngJJqoQI9VgQFpbCx3v3Z4+hSU1Dy4rgSgaKEldchK5/uOBbgAkUlTdMj+pdCinj cRaqzp8haFjp0nd5m5H2x0TnNtTTpWebvhTadv7NKFjZeaeDjSRND2AKCsJBqLZhMqRO ZsfNKw1hIBvaJihe5HnxFUEuuFYv7zEhARtWWACuzWIxGN5iKzu488GvwdM/0aLz5s1o 69oXFu/k09k1/GDTp4n1P492VEVwbKrDZBbhgF9aXS+hyOw3bG0R6R0osFQjR2Rc/wxb V+0g== X-Gm-Message-State: AA6/9RlSiq2STOIV2PCkx0NWPwH3dXulYckfKOdPkM98Rul1yXWNAIJ12nBJtiNR5L0AokuTOxL8duJHqmHmt6XF X-Received: by 10.31.191.14 with SMTP id p14mr286852vkf.40.1475533425891; Mon, 03 Oct 2016 15:23:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.46.16 with HTTP; Mon, 3 Oct 2016 15:23:25 -0700 (PDT) In-Reply-To: References: <3b2a4c04-cdeb-2c3d-76c9-bdfd3717a2ec@xhochy.com> <02CBC6F0-96BE-476B-8A75-89C853DF1D22@apache.org> From: Julien Le Dem Date: Mon, 3 Oct 2016 15:23:25 -0700 Message-ID: Subject: Re: Timestamps with different precision / Timedeltas To: dev@arrow.apache.org Content-Type: multipart/alternative; boundary=001a114da71eb48a8b053dfd655d archived-at: Mon, 03 Oct 2016 22:23:59 -0000 --001a114da71eb48a8b053dfd655d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I created a JIRA for the Timestamp type if you want to comment in it: https://issues.apache.org/jira/browse/ARROW-315 On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem wrote: > consistency with Parquet a + > Parquet supports timestamp millis and micros (no nanos) > https://github.com/apache/parquet-format/blob/master/ > LogicalTypes.md#datetime-types > > currently Arrow timestamps have a timezone field. > https://github.com/apache/arrow/blob/master/format/Message.fbs#L67 > Wes: regarding your suggestion do we want to change timestamp as follows? > - remove "timestamp" field and say it's UTC > - add unit field (MICROS | MILLIS) > > > > On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss > wrote: > >> +1 for nano or milli, or something else? >> >> TL;DR; >> >> epochMilli++ >> >> =E2=80=94 >> >> Wes, the hierarchy is eminently reasonable, so +1 from me for that. >> Regarding your aside, I am also a fan of the >> http://speleotrove.com/decimal/decarith.html < >> http://speleotrove.com/decimal/decarith.html> specification, though I >> must admit I am biased simply because it addresses the Rexx Lost Digits >> condition. >> >> The most commonly used timestamps I see are stored as epoch milliseconds= , >> or epochMillis. It may not be canonical, however there are many billion= s >> of devices and software applications utilizing it. >> >> To support extremely fine grained DateTime representations, particularly >> in common scientific applications, I=E2=80=99m for _epochNano_, with log= ical >> casting to work with existing datasets that are in epochMilli instead. = We >> can deal with the rollover in 300k years. >> >> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, = I >> doubt it will ever happen. No, I=E2=80=99m not a millennial. >> >> My only concern is for use of 64-bit logical DateTime at the small >> Physics level. For that use case, UT2 is more appropriate; measurements >> are frequently in fractions of nanoseconds. Perhaps there could be a wa= y >> to logically cast a signed int96, which is supported by Parquet. >> >> Timestamp [logical type] >> extends FixedDecimal [logical type] (int64) >> extends FixedWidth [physical type] byteArray[8] >> >> Timestamp96 [logical type] >> extends FixedDecimal [logical type] (int96) >> extends FixedWidth [physical type] byteArray[12] >> >> =E2=80=94 >> >> Although inappurtenant to this specific discussion, I would like to see = a >> standardized DateTime specification that uses a signed int64 as the deci= mal >> epochSecond and an unsigned int96 as the fractional representation of a >> second. >> >> TimestampHiggs [logical type] >> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of = 2 >> columns, the fixed decimal epochSecond and the fractional second as >> (n/2^96). >> extends FixedWidth [physical type] byteArray[8], byteArray[12] >> >> =E2=80=94Donald >> >> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau wrote= : >> > >> > +1 >> > >> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney >> wrote: >> > >> >> hello, >> >> >> >> For the current iteration of Arrow, can we agree to support int64 UNI= X >> >> timestamps with a particular resolution (second through nanosecond), >> >> as these are reasonably common representations? We can look to expand >> >> later if it is needed. >> >> >> >> Thanks >> >> Wes >> >> >> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney >> wrote: >> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for >> >>> purposes of moving data between systems, at minimum) we should propo= se >> >>> timestamp metadata and physical memory representation that maximizes >> >>> interoperability with other systems. It seems like a fixed decimal >> >>> would meet this requirement as UNIX-like timestamps at some resoluti= on >> >>> could pass unmodified with appropriate metadata. >> >>> >> >>> We will also need decimal types in Arrow (at least to accommodate >> >>> common database representations and file formats like Parquet), so >> >>> this seems like a reasonable potential hierarchy of types: >> >>> >> >>> Timestamp [logical type] >> >>> extends FixedDecimal [logical type] >> >>> extends FixedWidth [physical type] >> >>> >> >>> I did a bit of internet searching but did not find a canonical >> >>> reference or implementation of fixed decimals; that would be helpful= . >> >>> >> >>> As an aside: for floating decimal numbers for numerical data we coul= d >> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/ >> >>> which implements the spec described at >> >>> http://speleotrove.com/decimal/decarith.html >> >>> >> >>> Thanks >> >>> Wes >> >>> >> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel >> >> wrote: >> >>>> Hi all, >> >>>> >> >>>> May I suggest that instead of fixed-point decimals, you consider a >> more >> >>>> general fixed-denominator rational representation, for times and >> other >> >>>> purposes? Powers of ten are convenient for humans, but powers of tw= o >> >> more >> >>>> efficient. For some applications, the efficiency of bit operations >> over >> >>>> divmod is more useful than an exact representation of integral >> >> nanoseconds. >> >>>> >> >>>> std::chrono takes this approach. I'll also humbly point you at my o= wn >> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete >> but >> >>>> basically working), which may provide ideas or useful code. It was >> >> intended >> >>>> for precisely this sort of application. >> >>>> >> >>>> Regards, >> >>>> Alex >> >>>> >> >>>> >> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn wrote: >> >>>> >> >>>>> I agree with that having a Decimal type for timestamps is a nice >> >>>>> definition. Haying your time encoded as seconds or nanoseconds >> should >> >> be >> >>>>> the same as having a scale of the respective amount. But I would >> rather >> >>>>> avoid having a separate decimal physical type. Therefore I'd prefe= r >> the >> >>>>> parquet approach where decimal is only a logical type and backed b= y >> >>>>> either a bytearray, int32 or int64. >> >>>>> >> >>>>> Thus a more general timestamp could look like: >> >>>>> >> >>>>> * Decimals are logical types, physical types are the same as >> defined in >> >>>>> Parquet [1] >> >>>>> * Base unit for timestamps is seconds, you can get milliseconds an= d >> >>>>> nanoseconds by using a different scale. .(Note that seconds and so >> on >> >>>>> are all powers of ten, thus matching the specification of decimal >> scale >> >>>>> really good). >> >>>>> * Timestamp is just another logical type that is referring to >> Decimal >> >>>>> (and optionally may have a timezone) and signalling that we have a >> Time >> >>>>> and not just a "simple" decimal. >> >>>>> * For a first iteration, I would assume no timezone or UTC but not >> >>>>> include a metadata field. Once we're sure the implementation works= , >> we >> >>>>> can add metadata about it. >> >>>>> >> >>>>> Timedeltas could be addressed in a similar way, just without the >> need >> >>>>> for a timezone. >> >>>>> >> >>>>> For my usages, I don't have the use-case for a larger than int64 >> >>>>> timestamp and would like to have it exactly as such in my >> computation, >> >>>>> thus my preference for the Parquet way. >> >>>>> >> >>>>> Uwe >> >>>>> >> >>>>> [1] >> >>>>> >> >>>>> https://github.com/apache/parquet-format/blob/master/ >> >> LogicalTypes.md#decimal >> >>>>> >> >>>>> On 13.07.16 03:06, Julian Hyde wrote: >> >>>>>> I'm talking about a fixed decimal type, not floating decimal. >> (Oracle >> >>>>>> numbers are floating decimal. They have a few nice properties, bu= t >> >>>>>> they are variable width and can get quite large. I've seen one or >> two >> >>>>>> systems that started with binary flo >> >>>> >> >>>> >> >>>>> * Base unit for timestamps is seconds, you can get milliseconds an= d >> >>>> >> >>>> nanoseconds by using a different scale. .(Note that seconds and so = on >> >>>> >> >>>> are all powers of ten, thus matching the specification of decimal >> scale >> >>>> >> >>>> really good). >> >>>> >> >>>> * Timestamp is just another logical type that is referring to Decim= al >> >>>> >> >>>> (and optionally may have a timezone) and signalling that we have a >> Tim >> >>>> >> >>>> ating point numbers, which are >> >>>>>> much worse for business computing, and then change to Java >> >> BigDecimal, >> >>>>>> which gives the right answer but are horribly inefficient.) >> >>>>>> >> >>>>>> A fixed decimal type has virtually zero computational overhead. I= t >> >>>>>> just has a piece of metadata saying something like "every value i= n >> >>>>>> this field is multiplied by 1 million" and leaves it to the clien= t >> >>>>>> program to do that multiplying. >> >>>>>> >> >>>>>> My advice is to create a good fixed decimal type and lean on it >> >> heavily. >> >>>>>> >> >>>>>> Julian >> >>>>>> >> >>>>> >> >>>>> >> >> >> >> > > > -- > Julien > --=20 Julien --001a114da71eb48a8b053dfd655d--