arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: Timestamps with different precision / Timedeltas
Date Mon, 03 Oct 2016 22:23:25 GMT
I created a JIRA for the Timestamp type if you want to comment in it:
https://issues.apache.org/jira/browse/ARROW-315

On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <julien@dremio.com> wrote:

> consistency with Parquet a +
> Parquet supports timestamp millis and micros (no nanos)
> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#datetime-types
>
> currently Arrow timestamps have a timezone field.
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
> Wes: regarding your suggestion do we want to change timestamp as follows?
> - remove "timestamp" field and say it's UTC
> - add unit field (MICROS | MILLIS)
>
>
>
> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <donald.foss@gmail.com>
> wrote:
>
>> +1 for nano or milli, or something else?
>>
>> TL;DR;
>>
>> epochMilli++
>>
>> —
>>
>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>> Regarding your aside, I am also a fan of the
>> http://speleotrove.com/decimal/decarith.html <
>> http://speleotrove.com/decimal/decarith.html> specification, though I
>> must admit I am biased simply because it addresses the Rexx Lost Digits
>> condition.
>>
>> The most commonly used timestamps I see are stored as epoch milliseconds,
>> or epochMillis.  It may not be canonical, however there are many billions
>> of devices and software applications utilizing it.
>>
>> To support extremely fine grained DateTime representations, particularly
>> in common scientific applications, I’m for _epochNano_, with logical
>> casting to work with existing datasets that are in epochMilli instead.  We
>> can deal with the rollover in 300k years.
>>
>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I
>> doubt it will ever happen. No, I’m not a millennial.
>>
>> My only concern is for use of 64-bit logical DateTime at the small
>> Physics level.  For that use case, UT2 is more appropriate; measurements
>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>> to logically cast a signed int96, which is supported by Parquet.
>>
>> Timestamp [logical type]
>> extends FixedDecimal [logical type] (int64)
>> extends FixedWidth [physical type] byteArray[8]
>>
>> Timestamp96 [logical type]
>> extends FixedDecimal [logical type] (int96)
>> extends FixedWidth [physical type] byteArray[12]
>>
>> —
>>
>> Although inappurtenant to this specific discussion, I would like to see a
>> standardized DateTime specification that uses a signed int64 as the decimal
>> epochSecond and an unsigned int96 as the fractional representation of a
>> second.
>>
>> TimestampHiggs [logical type]
>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2
>> columns, the fixed decimal epochSecond and the fractional second as
>> (n/2^96).
>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>
>> —Donald
>>
>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <jacques@apache.org> wrote:
>> >
>> > +1
>> >
>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <wesmckinn@gmail.com>
>> wrote:
>> >
>> >> hello,
>> >>
>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>> >> timestamps with a particular resolution (second through nanosecond),
>> >> as these are reasonably common representations? We can look to expand
>> >> later if it is needed.
>> >>
>> >> Thanks
>> >> Wes
>> >>
>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmckinn@gmail.com>
>> wrote:
>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>> >>> purposes of moving data between systems, at minimum) we should propose
>> >>> timestamp metadata and physical memory representation that maximizes
>> >>> interoperability with other systems. It seems like a fixed decimal
>> >>> would meet this requirement as UNIX-like timestamps at some resolution
>> >>> could pass unmodified with appropriate metadata.
>> >>>
>> >>> We will also need decimal types in Arrow (at least to accommodate
>> >>> common database representations and file formats like Parquet), so
>> >>> this seems like a reasonable potential hierarchy of types:
>> >>>
>> >>> Timestamp [logical type]
>> >>> extends FixedDecimal [logical type]
>> >>> extends FixedWidth [physical type]
>> >>>
>> >>> I did a bit of internet searching but did not find a canonical
>> >>> reference or implementation of fixed decimals; that would be helpful.
>> >>>
>> >>> As an aside: for floating decimal numbers for numerical data we could
>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>> >>> which implements the spec described at
>> >>> http://speleotrove.com/decimal/decarith.html
>> >>>
>> >>> Thanks
>> >>> Wes
>> >>>
>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <alex@alexsamuel.net>
>> >> wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> May I suggest that instead of fixed-point decimals, you consider
a
>> more
>> >>>> general fixed-denominator rational representation, for times and
>> other
>> >>>> purposes? Powers of ten are convenient for humans, but powers of
two
>> >> more
>> >>>> efficient. For some applications, the efficiency of bit operations
>> over
>> >>>> divmod is more useful than an exact representation of integral
>> >> nanoseconds.
>> >>>>
>> >>>> std::chrono takes this approach. I'll also humbly point you at my
own
>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>> but
>> >>>> basically working), which may provide ideas or useful code. It was
>> >> intended
>> >>>> for precisely this sort of application.
>> >>>>
>> >>>> Regards,
>> >>>> Alex
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uwelk@xhochy.com>
wrote:
>> >>>>
>> >>>>> I agree with that having a Decimal type for timestamps is a
nice
>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>> should
>> >> be
>> >>>>> the same as having a scale of the respective amount. But I would
>> rather
>> >>>>> avoid having a separate decimal physical type. Therefore I'd
prefer
>> the
>> >>>>> parquet approach where decimal is only a logical type and backed
by
>> >>>>> either a bytearray, int32 or int64.
>> >>>>>
>> >>>>> Thus a more general timestamp could look like:
>> >>>>>
>> >>>>> * Decimals are logical types, physical types are the same as
>> defined in
>> >>>>> Parquet [1]
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds
and
>> >>>>> nanoseconds by using a different scale. .(Note that seconds
and so
>> on
>> >>>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>> really good).
>> >>>>> * Timestamp is just another logical type that is referring to
>> Decimal
>> >>>>> (and optionally may have a timezone) and signalling that we
have a
>> Time
>> >>>>> and not just a "simple" decimal.
>> >>>>> * For a first iteration, I would assume no timezone or UTC but
not
>> >>>>> include a metadata field. Once we're sure the implementation
works,
>> we
>> >>>>> can add metadata about it.
>> >>>>>
>> >>>>> Timedeltas could be addressed in a similar way, just without
the
>> need
>> >>>>> for a timezone.
>> >>>>>
>> >>>>> For my usages, I don't have the use-case for a larger than int64
>> >>>>> timestamp and would like to have it exactly as such in my
>> computation,
>> >>>>> thus my preference for the Parquet way.
>> >>>>>
>> >>>>> Uwe
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>> https://github.com/apache/parquet-format/blob/master/
>> >> LogicalTypes.md#decimal
>> >>>>>
>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>> (Oracle
>> >>>>>> numbers are floating decimal. They have a few nice properties,
but
>> >>>>>> they are variable width and can get quite large. I've seen
one or
>> two
>> >>>>>> systems that started with binary flo
>> >>>>
>> >>>>
>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds
and
>> >>>>
>> >>>> nanoseconds by using a different scale. .(Note that seconds and
so on
>> >>>>
>> >>>> are all powers of ten, thus matching the specification of decimal
>> scale
>> >>>>
>> >>>> really good).
>> >>>>
>> >>>> * Timestamp is just another logical type that is referring to Decimal
>> >>>>
>> >>>> (and optionally may have a timezone) and signalling that we have
a
>> Tim
>> >>>>
>> >>>> ating point numbers, which are
>> >>>>>> much worse for business computing, and then change to Java
>> >> BigDecimal,
>> >>>>>> which gives the right answer but are horribly inefficient.)
>> >>>>>>
>> >>>>>> A fixed decimal type has virtually zero computational overhead.
It
>> >>>>>> just has a piece of metadata saying something like "every
value in
>> >>>>>> this field is multiplied by 1 million" and leaves it to
the client
>> >>>>>> program to do that multiplying.
>> >>>>>>
>> >>>>>> My advice is to create a good fixed decimal type and lean
on it
>> >> heavily.
>> >>>>>>
>> >>>>>> Julian
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>
>>
>>
>
>
> --
> Julien
>



-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message