drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Claude Cote <jcc...@gmail.com>
Subject Re: REPEATED_CONTAINS
Date Wed, 30 Mar 2016 21:01:04 GMT
Hey Jacques,

I have done some rudimentary tests and realized my use case is a bit more
complicated then stated above.

{"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
{"name":"flour","cal":300} ] }

SELECT a,b,c,d,e,f,g, max(suggar_val) FROM (
SELECT a,b,c,d,e,f,g, case when fill['name'] = 'sugar' then fill['cal']
else null as suggar_val FROM (
SELECT a,b,c,d,e,f,g, FLATTEN(t.fillings) AS fill FROM
dfs.flatten.`test.json` t
)
)
group by a,b,c,d,e,f,g

Usng this technique I can "pull" the values that interest me out of the
repeated field which results in an easy to work with schema
a,b,c,d,e,f,g,sugar,flour,butter

However as you pointed out colapsing those rows back from the flattened
might be costly. For example the group by will need to check that all the
a,b,c,d,e,f,g fields are the same. Some of these fields are quite large
VARCHARs so it would need to check every byte of the VARCHAR?

I've done some testing with the above technique and performance was so so.
I'll also try my hands at using the FieldReader so I have something to
compare with.

Cheers


On Wed, 30 Mar 2016 at 00:23 Jacques Nadeau <jacques@dremio.com> wrote:

> I think the best answer is to test it and share your findings.
> Hypothesizing about performance in complicated systems is also suspect :)
>
> That said, I'll make a guess...
>
> In general, I would expect the flatten to be faster in your example since a
> flatten without a cartesian is trivial operation and can be done in
> vectorized fashion because of the shape of how data is held in memory. This
> is different than how complex UDFs are written today (using the FieldReader
> model). These UDFs are object-based execution, record by record. So,
> vectorized and full runtime code generation
>
> That being said, if you changed your code to be something more like [select
> a,b,c,d,e,f,g, flatten(t.fillings) as fill], you might see the two be
> closer together. This is because this would then require a cartesian copy
> of all the fields abcdefg, which then have to be filtered out. In this
> case, the extra cost of the copies might be more expensive than the object
> overhead required for traversing the complex object structure.
>
> In general, start with the methodology that works. If we don't see the
> performance to satisfy your usecase, we can see if we can suggest some
> things. (For example, supporting operation pushdowns that push through
> FLATTEN would probably be very helpful.)
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Mar 29, 2016 at 6:37 PM, Jean-Claude Cote <jccote@gmail.com>
> wrote:
>
> > I've noticed drill offers a REPEATED_CONTAINS which can be applied to
> > fields which are arrays.
> >
> > https://drill.apache.org/docs/repeated-contains/
> >
> > I have a schema stored in parquet files which contain a repeated field
> > containing a key and a value. However such structures can't be queried
> > using the REPEATED_CONTAINS. I was thinking of writing a user defined
> > function to look through it.
> >
> > My question is: is it worth it? Will it be faster than doing this?
> >
> > {"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
> > {"name":"flour","cal":300} ] }
> >
> > SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
> > dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';
> >
> > Specifically what's the cost of using FLATTEN compared to iterating over
> > the array right in a UDF?
> >
> > Thanks
> > Jean-Claude
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message