arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Timestamps with different precision / Timedeltas
Date Thu, 29 Sep 2016 22:19:59 GMT
hello,

For the current iteration of Arrow, can we agree to support int64 UNIX
timestamps with a particular resolution (second through nanosecond),
as these are reasonably common representations? We can look to expand
later if it is needed.

Thanks
Wes

On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> purposes of moving data between systems, at minimum) we should propose
> timestamp metadata and physical memory representation that maximizes
> interoperability with other systems. It seems like a fixed decimal
> would meet this requirement as UNIX-like timestamps at some resolution
> could pass unmodified with appropriate metadata.
>
> We will also need decimal types in Arrow (at least to accommodate
> common database representations and file formats like Parquet), so
> this seems like a reasonable potential hierarchy of types:
>
> Timestamp [logical type]
> extends FixedDecimal [logical type]
> extends FixedWidth [physical type]
>
> I did a bit of internet searching but did not find a canonical
> reference or implementation of fixed decimals; that would be helpful.
>
> As an aside: for floating decimal numbers for numerical data we could
> utilize an implementation like http://www.bytereef.org/mpdecimal/
> which implements the spec described at
> http://speleotrove.com/decimal/decarith.html
>
> Thanks
> Wes
>
> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <alex@alexsamuel.net> wrote:
>> Hi all,
>>
>> May I suggest that instead of fixed-point decimals, you consider a more
>> general fixed-denominator rational representation, for times and other
>> purposes? Powers of ten are convenient for humans, but powers of two more
>> efficient. For some applications, the efficiency of bit operations over
>> divmod is more useful than an exact representation of integral nanoseconds.
>>
>> std::chrono takes this approach. I'll also humbly point you at my own
>> date/time library, https://github.com/alexhsamuel/cron (incomplete but
>> basically working), which may provide ideas or useful code. It was intended
>> for precisely this sort of application.
>>
>> Regards,
>> Alex
>>
>>
>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uwelk@xhochy.com> wrote:
>>
>>> I agree with that having a Decimal type for timestamps is a nice
>>> definition. Haying your time encoded as seconds or nanoseconds should be
>>> the same as having a scale of the respective amount. But I would rather
>>> avoid having a separate decimal physical type. Therefore I'd prefer the
>>> parquet approach where decimal is only a logical type and backed by
>>> either a bytearray, int32 or int64.
>>>
>>> Thus a more general timestamp could look like:
>>>
>>> * Decimals are logical types, physical types are the same as defined in
>>> Parquet [1]
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>> are all powers of ten, thus matching the specification of decimal scale
>>> really good).
>>> * Timestamp is just another logical type that is referring to Decimal
>>> (and optionally may have a timezone) and signalling that we have a Time
>>> and not just a "simple" decimal.
>>> * For a first iteration, I would assume no timezone or UTC but not
>>> include a metadata field. Once we're sure the implementation works, we
>>> can add metadata about it.
>>>
>>> Timedeltas could be addressed in a similar way, just without the need
>>> for a timezone.
>>>
>>> For my usages, I don't have the use-case for a larger than int64
>>> timestamp and would like to have it exactly as such in my computation,
>>> thus my preference for the Parquet way.
>>>
>>> Uwe
>>>
>>> [1]
>>>
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>>>
>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
>>> > numbers are floating decimal. They have a few nice properties, but
>>> > they are variable width and can get quite large. I've seen one or two
>>> > systems that started with binary flo
>>
>>
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>
>> nanoseconds by using a different scale. .(Note that seconds and so on
>>
>> are all powers of ten, thus matching the specification of decimal scale
>>
>> really good).
>>
>> * Timestamp is just another logical type that is referring to Decimal
>>
>> (and optionally may have a timezone) and signalling that we have a Tim
>>
>> ating point numbers, which are
>>> > much worse for business computing, and then change to Java BigDecimal,
>>> > which gives the right answer but are horribly inefficient.)
>>> >
>>> > A fixed decimal type has virtually zero computational overhead. It
>>> > just has a piece of metadata saying something like "every value in
>>> > this field is multiplied by 1 million" and leaves it to the client
>>> > program to do that multiplying.
>>> >
>>> > My advice is to create a good fixed decimal type and lean on it heavily.
>>> >
>>> > Julian
>>> >
>>>
>>>

Mime
View raw message