arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [DISCUSS][Format] Time Interval Changes
Date Wed, 03 Apr 2019 03:53:13 GMT
Based on the discussion so far, my attempt at concrete Schema proposals
below.    Jacques I think summarizes what we've discussed, apologies if
I've misunderstood.  Wes would Option 1 work to support the Pandas Time
Delta use-case?  I'm leaning towards Option 1 if it satisfies everyone (but
happy to implement whatever we come to a consensus on).

** Option 1:  New Type: **
/// An absolute length of time unrelated to any calendar artifacts.  For
the purposes
/// of Arrow Implementations, adding this value to a Timestamp ("t1")
naively (i.e. simply summing
/// the two number) is acceptable even though in some cases the resulting
Timestamp (t2) would
/// not account for leap-seconds during the elapsed time between "t1" and
"t2".  Similarly, representing
/// the difference between two Unix timestamp is acceptable, but would
yield a value that is possibly a few seconds
/// off from the true elapsed time.
///
///  The resolution defaults to
/// millisecond, but can be any of the other supported TimeUnit values as
/// with Timestamp and Time types.  This type is always represented as
/// an 8-byte integer.
table DurationInterval {
   unit: TimeUnit = MILLISECOND;
}

** Option 2: New TimeDelta enum on Interval Unit (strong definition around
leap-seconds): **

enum IntervalUnit: short { YEAR_MONTH, DAY_TIME, TIME_DELTA}
// A "calendar" interval which models types that don't necessarily
// have a precise duration without the context of a base timestamp (e.g.
// days can differ in length during day light savings time transitions).
In the case
// of TimeDelta it is possible no precise definition is possible if the
base timestamp occurs
// at an instant when a leap second was added (but would only differ by at
most 1 second).
// YEAR_MONTH - Indicates the number of elapsed whole months, stored as
//   4-byte integers.
// DAY_TIME - Indicates the number of elapsed days and milliseconds,
//   stored as 2 contiguous 32-bit integers (8-bytes in total).  Support
//   of this IntervalUnit is not required for full arrow compatibility.
// TIME_DELTA - Indicates absolute time difference between Unix Timstamps
(i.e. excluding leap seconds).  This value is always represented as an
8-byte integer.
table Interval {
  unit: IntervalUnit;
  resolution: TimeUnit  // Only relevant for TIME_DELTA
}

On Tue, Apr 2, 2019 at 10:03 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> Since there were some mentions of leap seconds:
>
> I think the intent of the timedelta/duration type should be to express
> the difference between UNIX timestamps (from second to nanosecond
> resolution), which don't include leap seconds. We use the
> timedelta64[ns] type in pandas for example, which is a
> nanosecond-resolution difference of UNIX timestamps.
>
> On Tue, Apr 2, 2019 at 10:05 AM Jacques Nadeau <jacques@apache.org> wrote:
> >
> > >
> > > I could go either way, it has some benefits for forward compatibility I
> > > suppose, but on the other hand YAGNI, if you feel strongly, I'm ok
> > > including it.  However, the more optional fields we have for a specific
> > > enum value, makes me lean more towards a new type instead of just an
> enum.
> > >
> > I'm okay with skipping for now. Appreciate the focus on only what we
> > actually need.
> >
> >
> >
> > > Could you elaborate on defining standard arithmetic conversions between
> > > time-delta/duration in seconds and other time unit (days, months,
> years) as
> > > part of the standard/format, I'm still not sure I understand what the
> > > use-case is here.
> > >
> >
> > Here goes nothing...
> >
> > Seems like there are two options for durations:
> > 1) they aren't related to any other type
> > 2) they have a relationship to timestamps and dates.
> >
> > If 1, then the only thing I could understand is real world duration how
> > seconds are defined (and fractions thereof). E.g. [1] :D. In this
> > situation, there is no way to express any unit of time of higher
> > granularity than a second (e.g. days) since it is up to application
> > implementer to define the relationship. This severely limits the
> > expressiveness of the concept. (I can't ever use something TimeUnit.DAYS)
> > and stops the ability to cover the existing interval YEAR_MONTH type I
> > believe (since it has a resolution of months).
> >
> > If 2, then we must define the canonical value of ts + duration, otherwise
> > duration are somewhat meaningless, thus the proposed translation chart
> > (which causes its own oddities depending on the resolution of the time
> type
> > you are adding to).
> >
> > That being said, having started to remember previous discussions on this,
> > I'm most inclined to simply pick #1 and ignore the need for anything
> more.
> > The curiousness of interval math in database systems underscores the fact
> > that it apparently doesn't matter that much. In most cases, today + 3
> > months is close enough to today + 90 days for government work.
> >
> > Let's +2 a patch and get it merged quickly so we never have to think
> about
> > this again :)
> >
> > [1]  "the duration of 9,192,631,770 periods
> > <https://en.wikipedia.org/wiki/Frequency> of the radiation
> corresponding to
> > the transition between the two hyperfine levels
> > <https://en.wikipedia.org/wiki/Hyperfine_structure> of the ground state
> of
> > the caesium-133 <https://en.wikipedia.org/wiki/Caesium-133> atom" (at a
> > temperature of 0 K <https://en.wikipedia.org/wiki/Absolute_zero>)
> >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message