drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Fwd: Reading Avro Arrays
Date Wed, 13 Apr 2016 12:37:59 GMT
And again with the right dev mailing list...

---------- Forwarded message ----------
From: Johannes Schulte <johannes.schulte@gmail.com>
Date: Wed, Apr 13, 2016 at 2:21 PM
Subject: Fwd: Reading Avro Arrays
To: drill-dev@apache.org


Hi!

This pull request fixes a problem with FLATTEN on nested avro records.
Please see posts from the user list and the issue
https://issues.apache.org/jira/browse/DRILL-4574 for documentation.

I would love to get some feedback!

Johannes

https://github.com/apache/drill/pull/459



---------- Forwarded message ----------
From: Johannes Schulte <johannes.schulte@gmail.com>
Date: Tue, Apr 12, 2016 at 11:33 PM
Subject: Re: Reading Avro Arrays
To: user@drill.apache.org


After some evenings of digging into the code i more or less had a lucky
moment and was able to fix the problem. I wonder why nobody else ran into
this problem until now - for me it was a blocker to drill adoption and i am
really surprised nobody else ever encountered this issue. I hope that
somebody with more knowledge of the codebase can review this and integrate
it soon.


On Sun, Apr 3, 2016 at 11:29 AM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Alright, thanks! I created a pull request and are very open for any input
>
> https://github.com/apache/drill/pull/459
>
> Cheers,
>
> Johannes
>
> On Sun, Apr 3, 2016 at 9:10 AM, Abdel Hakim Deneche <adeneche@maprtech.com
> > wrote:
>
>> pull requests are fine. You still need a JIRA though
>>
>> On Sun, Apr 3, 2016 at 8:03 AM, Johannes Schulte <
>> johannes.schulte@gmail.com
>> > wrote:
>>
>> > I now extended the AvroFormatTest-Suite by two unit tests that show that
>> >
>> > * Flattening of primitive array works as expected
>> > * Flattening of arrays of records does not work properly
>> >
>> > I spent some time trying to find the reason but it's my first contact
>> with
>> > the drill-codebase.
>> >
>> > Is the recommended way of making this unit test available still to
>> attach a
>> > patch in an issue or is a pull-request also an option?
>> >
>> > In the context of the recent avro maturity discussion I would love to
>> fix
>> > this error myself but I would need some hints what goes wrong there
>> > internally.
>> >
>> > Johannes
>> >
>> > On Fri, Mar 25, 2016 at 10:50 PM, Johannes Schulte <
>> > johannes.schulte@gmail.com> wrote:
>> >
>> > > Hi Stefan, hi Jacques, thanks for going after this - I almost
>> resignated
>> > > but know i think it was because i accessed the data over jdbc with
>> > squirrel
>> > > and got irritated by the unknown type column there. nonetheless, if
>> the
>> > > schema looks like this:
>> > >
>> > >
>> > > {
>> > >   "type" : "record",
>> > >   "name" : "MainRecord",
>> > >   "namespace" : "drizz.WriteAvroTestFileForDrill$",
>> > >   "fields" : [ {
>> > >     "name" : "elements",
>> > >     "type" : {
>> > >       "type" : "array",
>> > >       "items" : {
>> > >         "type" : "record",
>> > >         "name" : "NestedRecord",
>> > >         "fields" : [ {
>> > >           "name" : "field1",
>> > >           "type" : "int"
>> > >         } ]
>> > >       },
>> > >       "java-class" : "java.util.List"
>> > >     }
>> > >   } ]
>> > > }
>> > >
>> > > and the contents looks like this (according to avro tojson command
>> line
>> > > utility)
>> > >
>> > >
>> > >
>> >
>> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]}
>> > >
>> > >
>> >
>> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]}
>> > >
>> > > a query like
>> > >
>> > > select flatten(elements) from
>> > > dfs.`/Users/j.schulte/data/avro-drill/no-union/`;
>> > >
>> > > yields exactly two rows:
>> > > +---------------+
>> > > |    EXPR$0     |
>> > > +---------------+
>> > > | {"field1":9}  |
>> > > | {"field1":9}  |
>> > > +---------------+
>> > >
>> > > as if only the last element in the array would survive.
>> > >
>> > > Thanks for your help so far..
>> > >
>> > > On Fri, Mar 25, 2016 at 5:45 PM, Stefán Baxter <
>> > stefan@activitystream.com>
>> > > wrote:
>> > >
>> > >> Johannes, Jacques is right.
>> > >>
>> > >> I only tested the flattening of maps and not the flattening of
>> > >> list-of-maps.
>> > >>
>> > >> -Stefan
>> > >>
>> > >> On Fri, Mar 25, 2016 at 4:12 PM, Jacques Nadeau <jacques@dremio.com>
>> > >> wrote:
>> > >>
>> > >> > I think there is some incorrect information and confusion in this
>> > >> thread.
>> > >> > Could you please share a piece of sample data and a specific query?
>> > The
>> > >> > error message shown in your original email is suggesting that
you
>> were
>> > >> > trying to flatten a map rather than an array of maps. Flatten
is
>> for
>> > >> arrays
>> > >> > only. The arrays can have scalars or complex objects in them.
>> > >> >
>> > >> > --
>> > >> > Jacques Nadeau
>> > >> > CTO and Co-Founder, Dremio
>> > >> >
>> > >> > On Fri, Mar 25, 2016 at 2:00 AM, Johannes Schulte <
>> > >> > johannes.schulte@gmail.com> wrote:
>> > >> >
>> > >> > > Hi Stefan,
>> > >> > >
>> > >> > > thanks for this information - so it seems that there is
>> currently no
>> > >> way
>> > >> > of
>> > >> > > accessing nested rich objects with drill; I somehow got that
>> wrong
>> > >> from
>> > >> > the
>> > >> > > documentation...
>> > >> > >
>> > >> > > Cheers,
>> > >> > > Johannes
>> > >> > >
>> > >> > > On Thu, Mar 24, 2016 at 2:14 PM, Stefán Baxter <
>> > >> > stefan@activitystream.com>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > FYI: flattening of embedded structures is not supported
in
>> Parquet
>> > >> > > either.
>> > >> > > >
>> > >> > > > Regards,
>> > >> > > >  -Stefan
>> > >> > > >
>> > >> > > > On Wed, Mar 23, 2016 at 8:51 PM, Johannes Schulte <
>> > >> > > > johannes.schulte@gmail.com> wrote:
>> > >> > > >
>> > >> > > > > Hi Stefan,
>> > >> > > > >
>> > >> > > > > thanks for your response and the link to your udf
repository,
>> > >> it's a
>> > >> > > good
>> > >> > > > > reference. I tried drill 1.6, the data is an array
of complex
>> > >> objects
>> > >> > > > > though. I will try to setup a drill dev environment
and see
>> if i
>> > >> can
>> > >> > > > modify
>> > >> > > > > the tests to fail.
>> > >> > > > >
>> > >> > > > > Johannes
>> > >> > > > >
>> > >> > > > > On Wed, Mar 23, 2016 at 8:13 PM, Stefán Baxter
<
>> > >> > > > stefan@activitystream.com>
>> > >> > > > > wrote:
>> > >> > > > >
>> > >> > > > > > FYI. this seems to be working in 1.6, at least
on the Avro
>> > data
>> > >> > that
>> > >> > > we
>> > >> > > > > > have.
>> > >> > > > > >
>> > >> > > > > > On Wed, Mar 23, 2016 at 6:59 PM, Stefán Baxter
<
>> > >> > > > > stefan@activitystream.com>
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Hi again,
>> > >> > > > > > >
>> > >> > > > > > > What version of Drill are you using?
>> > >> > > > > > >
>> > >> > > > > > > Regards,
>> > >> > > > > > > - Stefán
>> > >> > > > > > >
>> > >> > > > > > > On Wed, Mar 23, 2016 at 4:49 PM, Stefán
Baxter <
>> > >> > > > > > stefan@activitystream.com>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > >> Hi Johannes,
>> > >> > > > > > >>
>> > >> > > > > > >> As great as Drill is the Avro plugin
has been a source
>> of
>> > >> > > > frustration
>> > >> > > > > > for
>> > >> > > > > > >> us @activitystream.
>> > >> > > > > > >>
>> > >> > > > > > >> We have a small UDF library [1] (apache
licensed) which
>> > >> > contains a
>> > >> > > > > > >> function can return an array (List<String>)
from Avro
>> as a
>> > >> CSV
>> > >> > > list.
>> > >> > > > > > >>
>> > >> > > > > > >> You could use that to roll your own
or provide me with a
>> > >> small
>> > >> > > > sample
>> > >> > > > > > and
>> > >> > > > > > >> I can create a custom flatten function
for you.
>> > >> > > > > > >>
>> > >> > > > > > >> The best would be to wait for a fix
but this can
>> > potentially
>> > >> get
>> > >> > > you
>> > >> > > > > out
>> > >> > > > > > >> of a rough spot.
>> > >> > > > > > >>
>> > >> > > > > > >> [1] https://github.com/activitystream/asdrill
>> > >> > > > > > >>
>> > >> > > > > > >> Regards,
>> > >> > > > > > >>  -Stefán
>> > >> > > > > > >>
>> > >> > > > > > >> On Wed, Mar 23, 2016 at 9:05 AM,
Johannes Schulte <
>> > >> > > > > > >> johannes.schulte@gmail.com> wrote:
>> > >> > > > > > >>
>> > >> > > > > > >>> Hi,
>> > >> > > > > > >>>
>> > >> > > > > > >>> when trying to read simple avro
arrays with select
>> > >> > flatten(array)
>> > >> > > > > from
>> > >> > > > > > >>> dfs... i get the exception
>> > >> > > > > > >>>
>> > >> > > > > > >>> SQL Query Error: SYSTEM ERROR:
ClassCastException:
>> Cannot
>> > >> cast
>> > >> > > > > > >>> org.apache.drill.exec.vector.complex.MapVector
to
>> > >> > > > > > >>>
>> org.apache.drill.exec.vector.complex.RepeatedValueVector
>> > >> > > > > > >>> ^
>> > >> > > > > > >>>
>> > >> > > > > > >>> The type of the array is said
to be <UnknownType
>> (2,002)>
>> > >> > > > > > >>>
>> > >> > > > > > >>> Is this the expected behaviour?
The documentation
>> mostly
>> > >> talsk
>> > >> > > > about
>> > >> > > > > > json
>> > >> > > > > > >>> and parquet complex types and
i wonder if the avro
>> storage
>> > >> > plugin
>> > >> > > > > > behaves
>> > >> > > > > > >>> differently.
>> > >> > > > > > >>>
>> > >> > > > > > >>> Thanks,
>> > >> > > > > > >>>
>> > >> > > > > > >>> Johannes
>> > >> > > > > > >>>
>> > >> > > > > > >>
>> > >> > > > > > >>
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>>
>> Abdelhakim Deneche
>>
>> Software Engineer
>>
>>   <http://www.mapr.com/>
>>
>>
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message