drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: Order of records read in a parquet file
Date Sat, 07 Nov 2015 01:34:41 GMT
Can you confirm if this is a display bug in sqlline or jdbc to string
versus an actual data return?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Jason,
>
> You were partly correct. We are not dropping records however we are
> corrupting dictionary encoded binary columns. I got confused that we are
> returning different records, but we are trimming (or returning unreadable
> chars) some columns which are binary. I was able to reproduce with the
> lineitem data set. I will raise a jira and I think this should be treated
> critical. Thoughts?
>
> - Rahul
>
> On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Jason,
> >
> > I missed that. Let me check whether we are dropping any records. I would
> > be surprised if our regression tests missed that :)
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> >> Rahul,
> >>
> >> Thanks for working on a reproduction of the issue. You didn't actually
> >> answer my first question, are you getting the same data out of the file,
> >> just in a different order? It seems much more likely that we are
> dropping
> >> some records at the beginning than reordering them somehow, although I
> >> would have expected an error like this to be caught by the unit or
> >> regression tests.
> >>
> >> Thanks,
> >> Jason
> >>
> >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> >> challapallirahul@gmail.com> wrote:
> >>
> >> > Thanks for your replies. The file is private and I will try to
> >> construct a
> >> > file without sensitive data which can expose this behavior.
> >> >
> >> > - Rahul
> >> >
> >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> >> altekrusejason@gmail.com>
> >> > wrote:
> >> >
> >> > > Is this a large or private parquet file? Can you share it to allow
> me
> >> to
> >> > > debug the read path for it?
> >> > >
> >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> >> > altekrusejason@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > The changes to parquet were not supposed to be functional at
all.
> We
> >> > had
> >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer
based
> >> read
> >> > > and
> >> > > > write path to reduce heap memory usage. The work done was just
> >> getting
> >> > > > these changes merged back into parquet-mr and making corresponding
> >> > > changes
> >> > > > in Drill to accommodate any interface modifications introduced
> >> since we
> >> > > > last rebased (there were mostly just package renames). There
were
> a
> >> lot
> >> > > of
> >> > > > comments on the PR, and a decent amount of refactoring that was
> >> done to
> >> > > > consolidate and otherwise clean up the code, but there shouldn't
> >> have
> >> > > been
> >> > > > any changes to the behavior of the reader or writer.
> >> > > >
> >> > > > Are you getting all of the same data out if you read the whole
> file,
> >> > just
> >> > > > in a different order?
> >> > > >
> >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> >> > > > challapallirahul@gmail.com> wrote:
> >> > > >
> >> > > >> parquet-meta command suggests that there is only one row
group
> >> > > >>
> >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <
> jacques@dremio.com
> >> >
> >> > > >> wrote:
> >> > > >>
> >> > > >> > How many row groups?
> >> > > >> >
> >> > > >> > --
> >> > > >> > Jacques Nadeau
> >> > > >> > CTO and Co-Founder, Dremio
> >> > > >> >
> >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> >> > > >> > challapallirahul@gmail.com> wrote:
> >> > > >> >
> >> > > >> > > Drillers,
> >> > > >> > >
> >> > > >> > > With the new parquet library update, can someone
throw some
> >> light
> >> > on
> >> > > >> the
> >> > > >> > > order in which the records are read from a single
parquet
> file?
> >> > > >> > >
> >> > > >> > > With the older library, when I run the below query
on a
> single
> >> > > parquet
> >> > > >> > > file, I used to get a set of records. Now after
the parquet
> >> > library
> >> > > >> > update,
> >> > > >> > > I am seeing a different set of records. Just wanted
to
> >> understand
> >> > > what
> >> > > >> > > specifically has changed.
> >> > > >> > >
> >> > > >> > > select * from `file.parquet` limit 5;
> >> > > >> > >
> >> > > >> > > - Rahul
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message