asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Maxon <ima...@uci.edu>
Subject Re: The "real" ADM format
Date Thu, 16 Jun 2016 03:37:03 GMT
I think the int suffixes can be made to work, however there is sort of an
issue with the suffixes for floats or doubles. First, the existing grammar
doesn't deal with it at all for doubles, only floats. Second, "NaN" and
"Infinity" are valid values for a double, but making those work with the
suffix doesn't seem trivial to me.

On Wed, Jun 15, 2016 at 3:52 PM, Ian Maxon <imaxon@uci.edu> wrote:

> I've been looking at this a bit more, it turns out adm.grammar in
> asterix-external-data is the "real" ADM format. It is suppose to
> always accept suffixes of i8/16/32/etc after a digit sequence, but
> something must be wrong with how the grammar is being translated. It
> also appears that in some circumstances the parser can be coaxed into
> taking the output. Therefore it seems to me at this time that the real
> deficiency is in lexer-generator-maven-plugin and not elsewhere.
>
> On 6/8/16, Ian Maxon <imaxon@uci.edu> wrote:
> > I guess I don't view the round-trippability in the same way then, all it
> > means to me is that I can scan/output the data, load it, and end up with
> > the same thing, not necessarily that I can load it without specifying the
> > types and get them anyway because they're inlined to the data. I think if
> > we want that the better thing to do would be to do something like
> mysqldump
> > (e.g. it dumps the metadata/types as an equivalent query basically).
> Also,
> > if we changed the format to conflict with the existing output of
> SocialGen
> > we'd have issues with current experiments and reproducing old results.
> >
> > On Wed, Jun 8, 2016 at 1:17 PM, Chris Hillery <chillery@hillery.land>
> > wrote:
> >
> >> I think the answer there is "round-tripability", right? ADM is meant to
> >> exactly describe the data so that it can be reloaded in the same way it
> >> was. Someone correct me if that isn't a requirement of the format...
> >>
> >> Ceej
> >> On Jun 8, 2016 9:14 AM, "Ian Maxon" <imaxon@uci.edu> wrote:
> >>
> >> > Why should the type be intermingled with the data though when it isn't
> >> > strictly necessary? For example why do I care if someone used an int64
> >> > to
> >> > wrap something I know is actually a short integer, and so on. It also
> >> kind
> >> > of gets rid of the idea of ADM being a superset of JSON.
> >> >
> >> > On Tue, Jun 7, 2016 at 10:49 PM, Preston Carman <prestonc@apache.org>
> >> > wrote:
> >> >
> >> > > The interval type format has been finalized and is the same for AQL
> >> > > and ADM. Below is an example of the format:
> >> > >
> >> > > interval(date("01-01-2011"), date("02-02-2012"))
> >> > >
> >> > > The interval constructor now uses other data type constructors to
> >> > > recreate an interval. The type of interval is defined by the two
> >> > > matching arguments.
> >> > >
> >> > >
> >> > > On Tue, Jun 7, 2016 at 9:36 PM, Chris Hillery <chillery@hillery.land
> >
> >> > > wrote:
> >> > > > Ah, the other thing I forgot to mention is that I didn't include
> >> > interval
> >> > > > types, because I'm not sure about their current status. There
was
> >> some
> >> > > > discussion on the list in January (subject "Round Tripping ADM
> >> Interval
> >> > > > Data") but I'm not sure where it ended up as far as the form
of
> the
> >> > > > constructors, and whether that was AQL or ADM or both.
> >> > > >
> >> > > > Ceej
> >> > > > aka Chris Hillery
> >> > > >
> >> > > > On Tue, Jun 7, 2016 at 9:34 PM, Chris Hillery
> >> > > > <chillery@hillery.land
> >> >
> >> > > wrote:
> >> > > >
> >> > > >> I started to create the current inventory of types, with
the
> forms
> >> > > >> accepted / produced by the ADM parser, AQL parser, and ADM
> >> > > serialization.
> >> > > >> (I think we all agree that ADM parser and ADM serializer
should
> be
> >> > 100%
> >> > > >> compatible.) Here it is:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
> https://docs.google.com/spreadsheets/d/1-11a9ETV1Bdh_bUm9_CszY4hEGJGbEBaVKUWrzeS-As/edit?usp=sharing
> >> > > >>
> >> > > >> I know this is not comprehensive (for instance, I'm pretty
sure
> >> that a
> >> > > >> naked integer will be parsed by both ADM and AQL as an int64,
so
> >> that
> >> > > form
> >> > > >> should be listed as an alternative) and I haven't verified
that
> >> > > >> the
> >> > AQL
> >> > > >> parser forms in particular are accurate, but I think it's
close.
> >> I've
> >> > > set
> >> > > >> it so anyone can edit that document, so please fill in the
gaps
> if
> >> you
> >> > > know
> >> > > >> of any.
> >> > > >>
> >> > > >> We should also fill in the exact accepted forms for the various
> >> > derived
> >> > > >> types like the datetime, spatial, hex, and UUID types - eg.,
the
> >> valid
> >> > > >> forms of the double-quoted string in the duration() constructor
> is
> >> as
> >> > > >> specified by XML schema, and so on.
> >> > > >>
> >> > > >> Ceej
> >> > > >> aka Chris Hillery
> >> > > >>
> >> > > >> On Tue, Jun 7, 2016 at 8:53 PM, Chris Hillery
> >> > > >> <chillery@hillery.land
> >> >
> >> > > >> wrote:
> >> > > >>
> >> > > >>> If it's possible, I think it would be least confusing
if the
> >> > serialized
> >> > > >>> ADM format was identical to the corresponding data constructors
> >> > > >>> in
> >> > > AQL. It
> >> > > >>> should be a goal IMHO that you can cut-and-paste an ADM
file
> into
> >> the
> >> > > query
> >> > > >>> box in the web UI and the result would be the same as
loading
> the
> >> > .adm.
> >> > > >>>
> >> > > >>> For more specifics, I think we need to write out for
each data
> >> > > >>> type
> >> > > what
> >> > > >>> the current ADM and AQL formats are, and then pick a
final
> answer
> >> for
> >> > > the
> >> > > >>> type (which may possibly be different from either of
the current
> >> > forms,
> >> > > >>> although I suspect not). That will he the spec, and we
can
> update
> >> the
> >> > > two
> >> > > >>> parsers (and all the test cases) accordingly.
> >> > > >>>
> >> > > >>> I started an email thread sometime last year about something
> >> > similar; I
> >> > > >>> think it was about JSON serialization, but it at least
had the
> >> > > >>> AQL
> >> > > side of
> >> > > >>> this story for all simple types, I believe.
> >> > > >>>
> >> > > >>> Ceej
> >> > > >>> aka Chris Hillery
> >> > > >>> On Jun 7, 2016 8:17 PM, "Ian Maxon" <imaxon@uci.edu>
wrote:
> >> > > >>>
> >> > > >>>> Hi all,
> >> > > >>>> After my experience with having to fix a rather large
ADM file
> >> dump
> >> > > from
> >> > > >>>> a
> >> > > >>>> query to make it load back into the system I was
compelled to
> >> > > >>>> try
> >> my
> >> > > hand
> >> > > >>>> at making that not happen again. The first thing
I tried my
> hand
> >> at
> >> > > was
> >> > > >>>> basically what I did to make the file loadable but
inside the
> >> > > >>>> type
> >> > > >>>> printers; just remove all of the 'i32' and so on
suffixes, as
> >> > > >>>> well
> >> > as
> >> > > >>>> making decimals not formatted in scientific notation.
This is
> >> pretty
> >> > > easy
> >> > > >>>> to do as well, not a huge change code-wise (but obviously
I'll
> >> have
> >> > to
> >> > > >>>> fix
> >> > > >>>> all of the tests).
> >> > > >>>>
> >> > > >>>> This got me to think though, which is the format
that we
> >> > > >>>> actually
> >> > > want?
> >> > > >>>> The
> >> > > >>>> current format that is output, or the format that
we accept in
> >> > > >>>> the
> >> > > >>>> loader?
> >> > > >>>> Since this is actually perhaps a language level change
either
> >> > > >>>> way
> >> I
> >> > > >>>> figured
> >> > > >>>> I should find consensus before spending more time
on it.
> >> > > >>>>
> >> > > >>>> Thoughts/comments are appreciated.
> >> > > >>>>
> >> > > >>>> Thanks,
> >> > > >>>> - Ian
> >> > > >>>>
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message