arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Korn <uw...@xhochy.com>
Subject Re: Timestamps with different precision / Timedeltas
Date Thu, 14 Jul 2016 14:27:13 GMT
I agree with that having a Decimal type for timestamps is a nice 
definition. Haying your time encoded as seconds or nanoseconds should be 
the same as having a scale of the respective amount. But I would rather 
avoid having a separate decimal physical type. Therefore I'd prefer the 
parquet approach where decimal is only a logical type and backed by 
either a bytearray, int32 or int64.

Thus a more general timestamp could look like:

* Decimals are logical types, physical types are the same as defined in 
Parquet [1]
* Base unit for timestamps is seconds, you can get milliseconds and 
nanoseconds by using a different scale. .(Note that seconds and so on 
are all powers of ten, thus matching the specification of decimal scale 
really good).
* Timestamp is just another logical type that is referring to Decimal 
(and optionally may have a timezone) and signalling that we have a Time 
and not just a "simple" decimal.
* For a first iteration, I would assume no timezone or UTC but not 
include a metadata field. Once we're sure the implementation works, we 
can add metadata about it.

Timedeltas could be addressed in a similar way, just without the need 
for a timezone.

For my usages, I don't have the use-case for a larger than int64 
timestamp and would like to have it exactly as such in my computation, 
thus my preference for the Parquet way.

Uwe

[1] 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

On 13.07.16 03:06, Julian Hyde wrote:
> I'm talking about a fixed decimal type, not floating decimal. (Oracle
> numbers are floating decimal. They have a few nice properties, but
> they are variable width and can get quite large. I've seen one or two
> systems that started with binary floating point numbers, which are
> much worse for business computing, and then change to Java BigDecimal,
> which gives the right answer but are horribly inefficient.)
>
> A fixed decimal type has virtually zero computational overhead. It
> just has a piece of metadata saying something like "every value in
> this field is multiplied by 1 million" and leaves it to the client
> program to do that multiplying.
>
> My advice is to create a good fixed decimal type and lean on it heavily.
>
> Julian
>
>
> On Tue, Jul 12, 2016 at 5:46 PM, Jacques Nadeau <jacques@apache.org> wrote:
>> Julian has some experience with the Oracle internals where the perfect
>> numeric type solves many problems...  :D
>>
>>
>>
>> On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>>> As one data point, none of the systems I work with use decimals for
>>> representing timestamps (UNIX timestamps at some resolution, second /
>>> milli / nano, is the most common), so having decimal as the default
>>> storage class would cause a computational hardship. We may consider
>>> incorporating the Timestamp storage type into the canonical metadata.
>>>
>>> - Wes
>>>
>>> On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>> Is it worth doing a review of different file formats and database
>>>> systems to decide on a timestamp implementation (int64 or int96 with
>>>> some resolution seems to be quite popular as well)? At least in the
>>>> Arrow C++ codebase, we need to add decimal handling logic anyway.
>>>>
>>>> On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jhyde@apache.org> wrote:
>>>>> SQL allows timestamps to be stored with any precision (i.e. number of
>>> digits after the decimal point) between 0 and 9. That strongly indicates to
>>> me that the right implementation of timestamps is as (fixed point) decimal
>>> values.
>>>>> Then devote your efforts to getting the decimal type working correctly.
>>>>>
>>>>>
>>>>>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
>>>>>>
>>>>>> hi Uwe,
>>>>>>
>>>>>> Thanks for bringing this up. So far we've largely been skirting the
>>>>>> "Logical Types Rabbit Hole", but it would be good to start a document
>>>>>> collecting requirements for various logical types (e.g. timestamps)
so
>>>>>> that we can attempt to achieve good solutions on the first try based
>>>>>> on the experiences (good and bad) of other projects.
>>>>>>
>>>>>> In the IPC flatbuffers metadata spec that we drafted for discussion
/
>>>>>> prototype implementation earlier this year [1], we do have a Timestamp
>>>>>> logical type containing only a timezone optional field [2]. If you
>>>>>> contrast this with Feather (which uses Arrow's physical memory layout,
>>>>>> but custom metadata to suit Python/R needs), that has both a unit
and
>>>>>> timezone [3].
>>>>>>
>>>>>> Since there is little consensus in the units of timestamps (more
>>>>>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>>>>>> uniformity), I believe the best route would be to add a unit to the
>>>>>> metadata to indicates second through nanosecond resolution. Same
goes
>>>>>> for a Time type.
>>>>>>
>>>>>> For example, Parquet has both milliseconds and microseconds (in
>>>>>> Parquet 2.0). But earlier versions of Parquet don't have this at
all
>>>>>> [4]. Other systems like Hive and Impala are relying on their own
table
>>>>>> metadata to convert back and forth (e.g. embedding timestamps of
>>>>>> whatever resolution in int64 or int96).
>>>>>>
>>>>>> For Python pandas that want to use Parquet files (via Arrow) in their
>>>>>> workflow, we're stuck with a couple options:
>>>>>>
>>>>>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>>>>>> (or MILLIS? Not all Parquet readers may be aware of the new
>>>>>> microsecond ConvertedType)
>>>>>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>>>>>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>>>>>>
>>>>>> I see use cases for both of these -- for Option 1, you may care about
>>>>>> interoperability with another system that uses Parquet. For Option
2,
>>>>>> you may care about preserving the fidelity of your pandas data.
>>>>>> Realistically, #1 seems like the best default option. It makes sense
>>>>>> to offer #2 as an option.
>>>>>>
>>>>>> I don't think addressing time zones in the first pass is strictly
>>>>>> necessary, but as long as we store timestamps as UTC, we can also
put
>>>>>> the time zone in the KeyValue metadata.
>>>>>>
>>>>>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>>>>>> that in a separate discussion. I agree that it merits inclusion as
a
>>>>>> logical type, but I'm not sure what storage representation makes
the
>>>>>> most sense (e.g. is is not clear to me why Parquet does not store
the
>>>>>> interval as an absolute number of milliseconds; perhaps to accommodate
>>>>>> month-based intervals which may have different absolute lengths
>>>>>> depending on where you start).
>>>>>>
>>>>>> Let me know what you think, and if others have thoughts I'd be
>>> interested too.
>>>>>> thanks,
>>>>>> Wes
>>>>>>
>>>>>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>> [2] :
>>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>>>>>> [3]:
>>> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>>>>>> [4]:
>>> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>>>>>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uwelk@xhochy.com>
wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> in addition to categoricals, we also miss at the moment a conversion
>>> from
>>>>>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two
>>> (exact)
>>>>>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds.
As
>>>>>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
>>> notes there
>>>>>>> are several more. We do not need to cater for all but at least
some
>>> of them.
>>>>>>> Therefore I have the following questions which I like to have
solved
>>> in some
>>>>>>> form before implementing:
>>>>>>>
>>>>>>> * Do we want to cater for other resolutions?
>>>>>>> * If we do not provide, e.g. nanosecond resolution (sadly the
default
>>>>>>>    in Pandas), do we cast with precision loss to the nearest
match? Or
>>>>>>>    should we force the user to do it?
>>>>>>> * Not so important for me at the moment: Do we want to support
time
>>> zones?
>>>>>>> My current objective is to have them for Parquet file writing.
Sadly
>>> this
>>>>>>> has the same limitations. So the two main options seem to be
>>>>>>>
>>>>>>> * "roundtrip will only yield correct timezone and logical type
if we
>>>>>>>    read with Arrow/Pandas again (as we use "proprietary" metadata
to
>>>>>>>    encode it)"
>>>>>>> * "we restrict us to milliseconds and days as resolution" (for
the
>>>>>>>    latter option, we need to decide how graceful we want to be
in the
>>>>>>>    Pandas<->Arrow conversion).
>>>>>>>
>>>>>>> Further datatype we have not yet in Arrow but partly in Parquet
is
>>> timedelta
>>>>>>> (or INTERVAL in Parquet). Probably we need to add another logical
>>> type to
>>>>>>> Arrow to implement them. Open for suggestions here, too.
>>>>>>>
>>>>>>> Also in the Arrow spec there is TIME which seems to be the same
as
>>> TIMESTAMP
>>>>>>> (as far as the comments in the C++ code goes). Is there maybe
some
>>>>>>> distinction I'm missing?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Uwe
>>>>>>>


Mime
View raw message