drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Order of records read in a parquet file
Date Sat, 07 Nov 2015 01:40:15 GMT
>From a previous build, I got the data for these columns just fine from
sqlline. So I think we can eliminate any display issues unless I am missing
something?

- Rahul

On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau <jacques@dremio.com> wrote:

> Can you confirm if this is a display bug in sqlline or jdbc to string
> versus an actual data return?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Jason,
> >
> > You were partly correct. We are not dropping records however we are
> > corrupting dictionary encoded binary columns. I got confused that we are
> > returning different records, but we are trimming (or returning unreadable
> > chars) some columns which are binary. I was able to reproduce with the
> > lineitem data set. I will raise a jira and I think this should be treated
> > critical. Thoughts?
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Jason,
> > >
> > > I missed that. Let me check whether we are dropping any records. I
> would
> > > be surprised if our regression tests missed that :)
> > >
> > > - Rahul
> > >
> > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > altekrusejason@gmail.com>
> > > wrote:
> > >
> > >> Rahul,
> > >>
> > >> Thanks for working on a reproduction of the issue. You didn't actually
> > >> answer my first question, are you getting the same data out of the
> file,
> > >> just in a different order? It seems much more likely that we are
> > dropping
> > >> some records at the beginning than reordering them somehow, although I
> > >> would have expected an error like this to be caught by the unit or
> > >> regression tests.
> > >>
> > >> Thanks,
> > >> Jason
> > >>
> > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > >> challapallirahul@gmail.com> wrote:
> > >>
> > >> > Thanks for your replies. The file is private and I will try to
> > >> construct a
> > >> > file without sensitive data which can expose this behavior.
> > >> >
> > >> > - Rahul
> > >> >
> > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > >> altekrusejason@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Is this a large or private parquet file? Can you share it to
allow
> > me
> > >> to
> > >> > > debug the read path for it?
> > >> > >
> > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > >> > altekrusejason@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > The changes to parquet were not supposed to be functional
at
> all.
> > We
> > >> > had
> > >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer
> based
> > >> read
> > >> > > and
> > >> > > > write path to reduce heap memory usage. The work done was
just
> > >> getting
> > >> > > > these changes merged back into parquet-mr and making
> corresponding
> > >> > > changes
> > >> > > > in Drill to accommodate any interface modifications introduced
> > >> since we
> > >> > > > last rebased (there were mostly just package renames). There
> were
> > a
> > >> lot
> > >> > > of
> > >> > > > comments on the PR, and a decent amount of refactoring that
was
> > >> done to
> > >> > > > consolidate and otherwise clean up the code, but there shouldn't
> > >> have
> > >> > > been
> > >> > > > any changes to the behavior of the reader or writer.
> > >> > > >
> > >> > > > Are you getting all of the same data out if you read the
whole
> > file,
> > >> > just
> > >> > > > in a different order?
> > >> > > >
> > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > >> > > > challapallirahul@gmail.com> wrote:
> > >> > > >
> > >> > > >> parquet-meta command suggests that there is only one
row group
> > >> > > >>
> > >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <
> > jacques@dremio.com
> > >> >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> > How many row groups?
> > >> > > >> >
> > >> > > >> > --
> > >> > > >> > Jacques Nadeau
> > >> > > >> > CTO and Co-Founder, Dremio
> > >> > > >> >
> > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli
<
> > >> > > >> > challapallirahul@gmail.com> wrote:
> > >> > > >> >
> > >> > > >> > > Drillers,
> > >> > > >> > >
> > >> > > >> > > With the new parquet library update, can someone
throw some
> > >> light
> > >> > on
> > >> > > >> the
> > >> > > >> > > order in which the records are read from a
single parquet
> > file?
> > >> > > >> > >
> > >> > > >> > > With the older library, when I run the below
query on a
> > single
> > >> > > parquet
> > >> > > >> > > file, I used to get a set of records. Now
after the parquet
> > >> > library
> > >> > > >> > update,
> > >> > > >> > > I am seeing a different set of records. Just
wanted to
> > >> understand
> > >> > > what
> > >> > > >> > > specifically has changed.
> > >> > > >> > >
> > >> > > >> > > select * from `file.parquet` limit 5;
> > >> > > >> > >
> > >> > > >> > > - Rahul
> > >> > > >> > >
> > >> > > >> >
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message