avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Feature for Date/Time Data Types in Avro?
Date Tue, 18 Jan 2011 19:49:49 GMT
We should get this discussion into JIRA soon.

On 1/18/11 10:38 AM, "Ron Bodkin" <rbodkin@thinkbiganalytics.com> wrote:

>Overall, yes. A couple of points worth addressing in a design:
>1) Do we want to allow encoding time zone data in the records? Storing a
>raw timestamp is sometimes not ideal. It's worth looking at how SQL allows
>timestamps with and without time zones. Is that simpler, or is it actually
>more complex?

It is generally 100000x simpler to serialize only in UTC and let libraries
support what they support W.R.T timezone.  Painful memories of design
mistakes past.
SQL does a lot of TZ work because they support user input and output
formatting.  In the back-end most databases store in only a limited way.

>2) Do we want to allow dates (for storing a day, without a timestamp)?
Days introduce timezone complexity if you want to find out what day a
timestamp is in.
So if we support day, or hour, then that is a significant increase in
complexity.  Furthermore, the timezone may  not even be the same per row.
 We could leave that up to the user and support a day type that is merely
the number of days since some origin point and leaves the timezone
interpretation (and thus conversion to 'day' from 'datetime') in the
user's hands, perhaps with metadata support.

>3) It would be nice to allow some flexibility in the implementation
>classes for dates, e.g., letting Java users use Joda time classes as well
>as java.util.Date

Absolutely.  This is a per-language feature though, so it may not require
much of the spec.  For example, in Java it could simply be a configuration
parameter passed to the DatumReader/Writers.  It doesn't make a lot of
sense to store metadata on the data that says "this is a Joda object, not
java.util.Date" -- that is a user choice and not intrinsic to describing
the data.

There are other questions too -- what are the timestamp units
(milliseconds? configurable?), what is the origin (1970? 2010?
configurable?) -- these decisions affect the serialization size.
I have a manual serialization of timestamps that is a long, in tenths of a
second since 2008, for example.  I have another that is a duration
measured in tenths of a millisecond.  Both were done to reduce the number
of bytes per value for a specific problem domain.
Although I could use such flexibility, I'm not sure that is enough of a
motivator to put that into Avro.  I'm not very bothered with converting
from long to a human readable datetime myself.

>Ron Bodkin
>Think Big Analytics
>m: +1 (415) 509-2895
>On 1/18/11 8:42 AM, "Doug Cutting" <cutting@apache.org> wrote:
>>The way that I have imagined doing this is to specify a standard schema
>>for dates, then implementations can optionally map this to a native date
>>The schema could be a record containing a long, e.g.:
>>{"type": "record", "name":"org.apache.avro.lib.Date", "fields" : [
>>   {"name": "time", "type": "long"}
>>  ]
>>Java could read this into a java.util.Date, Python to a datetime, etc.
>>Such conventions could be added to the Avro specification.
>>Does this sound like a reasonable approach?
>>On 01/17/2011 05:54 PM, Ron Bodkin wrote:
>>> Has anyone discussed the possibility of having built-in support for a
>>> date/time stamp data type in Avro? I think it'd be helpful, since dates
>>> and timestamps are often used as keys in processing map/reduce data
>>> in RPC systems). It's unpleasant to have to write code that converts
>>> longs or strings into dates or timestamps. Minimally, it would be
>>> to allow generating date/time stamps from long timestamps in the client
>>> APIs various language code and to have support for working with Dates
>>> the Java reflection API.
>>> I'd like to get feedback from others if they'd also like to see support
>>> for date/time data types in Avro. It seems like a generally useful
>>> feature that would be worth adding with a patch.
>>> Thanks,
>>> Ron
>>> Ron Bodkin
>>> CEO
>>> Think Big Analytics
>>> m: +1 (415) 509-2895

View raw message