drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)
Date Sun, 04 Jun 2017 11:35:30 GMT
Ok, the data is a bit sensitive.

I'll submit this when I have created a meaningful test set that I can
distribute.

- Stefán

On Sun, Jun 4, 2017 at 6:54 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Jira is always the preferrable approach. Thank You.
>
> On Sat, Jun 3, 2017 at 1:38 PM, Stefán Baxter <stefan@activitystream.com>
> wrote:
>
> > Hi Rahul,
> >
> > Sure, but can I perhaps get the files to you directly?
> >
> > Regards,
> >  -Stefán
> >
> > On Sat, Jun 3, 2017 at 8:13 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Can you please raise a jira and attach the required files? I can try to
> > > reproduce it.
> > >
> > > Rahul
> > >
> > > On Jun 3, 2017 6:19 AM, "Stefán Baxter" <stefan@activitystream.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a sample data set (a few million records) that is saved to
> > parquet
> > > > in 2 ways. A simple file structure with primary types to store
> > dimensions
> > > > and metrics (String, Double) and a using nested maps (String,String
> and
> > > > String,Double) respectively.
> > > >
> > > > Querying the data set with the simple types only:
> > > >
> > > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> > sum(metrics_price)
> > > as
> > > > price, sum(metrics_kwh) as kwh from
> > > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*`
> as s
> > > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > > >
> > > >
> > > > takes: *28.442 *sec. (dev. laptop x 1)
> > > >
> > > >
> > > > Same query against the nested structure:
> > > >
> > > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> > > sum(s.metrics.price)
> > > > as price, sum(s.metricss.kwh) as kwh from
> > > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*`
> as s
> > > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > > >
> > > > takes: *719.810* sec.
> > > >
> > > > Event counting the number of records takes very, very long if there
> is
> > a
> > > > nested structure involved. (select count(*) from)
> > > > It does not behave like this on our production servers (1.8) put I
> have
> > > not
> > > > run this particular test on them (their performance has never been an
> > > > issue)
> > > > I have these sample files available if anyone wishes to reproduces
> this
> > > > consistently.
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message