drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: Order of records read in a parquet file
Date Sat, 07 Nov 2015 01:52:26 GMT
I wouldn't jump to that conclusion. Sqlline uses toString. If we changed
the toString behavior, it could be a problem. Maybe do a ctas to a json
file to confirm.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Nov 6, 2015 at 5:40 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> From a previous build, I got the data for these columns just fine from
> sqlline. So I think we can eliminate any display issues unless I am missing
> something?
>
> - Rahul
>
> On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau <jacques@dremio.com> wrote:
>
> > Can you confirm if this is a display bug in sqlline or jdbc to string
> > versus an actual data return?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Jason,
> > >
> > > You were partly correct. We are not dropping records however we are
> > > corrupting dictionary encoded binary columns. I got confused that we
> are
> > > returning different records, but we are trimming (or returning
> unreadable
> > > chars) some columns which are binary. I was able to reproduce with the
> > > lineitem data set. I will raise a jira and I think this should be
> treated
> > > critical. Thoughts?
> > >
> > > - Rahul
> > >
> > > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > > challapallirahul@gmail.com> wrote:
> > >
> > > > Jason,
> > > >
> > > > I missed that. Let me check whether we are dropping any records. I
> > would
> > > > be surprised if our regression tests missed that :)
> > > >
> > > > - Rahul
> > > >
> > > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > > altekrusejason@gmail.com>
> > > > wrote:
> > > >
> > > >> Rahul,
> > > >>
> > > >> Thanks for working on a reproduction of the issue. You didn't
> actually
> > > >> answer my first question, are you getting the same data out of the
> > file,
> > > >> just in a different order? It seems much more likely that we are
> > > dropping
> > > >> some records at the beginning than reordering them somehow,
> although I
> > > >> would have expected an error like this to be caught by the unit or
> > > >> regression tests.
> > > >>
> > > >> Thanks,
> > > >> Jason
> > > >>
> > > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > > >> challapallirahul@gmail.com> wrote:
> > > >>
> > > >> > Thanks for your replies. The file is private and I will try to
> > > >> construct a
> > > >> > file without sensitive data which can expose this behavior.
> > > >> >
> > > >> > - Rahul
> > > >> >
> > > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > > >> altekrusejason@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Is this a large or private parquet file? Can you share it
to
> allow
> > > me
> > > >> to
> > > >> > > debug the read path for it?
> > > >> > >
> > > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > > >> > altekrusejason@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > The changes to parquet were not supposed to be functional
at
> > all.
> > > We
> > > >> > had
> > > >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer
> > based
> > > >> read
> > > >> > > and
> > > >> > > > write path to reduce heap memory usage. The work done
was just
> > > >> getting
> > > >> > > > these changes merged back into parquet-mr and making
> > corresponding
> > > >> > > changes
> > > >> > > > in Drill to accommodate any interface modifications
introduced
> > > >> since we
> > > >> > > > last rebased (there were mostly just package renames).
There
> > were
> > > a
> > > >> lot
> > > >> > > of
> > > >> > > > comments on the PR, and a decent amount of refactoring
that
> was
> > > >> done to
> > > >> > > > consolidate and otherwise clean up the code, but there
> shouldn't
> > > >> have
> > > >> > > been
> > > >> > > > any changes to the behavior of the reader or writer.
> > > >> > > >
> > > >> > > > Are you getting all of the same data out if you read
the whole
> > > file,
> > > >> > just
> > > >> > > > in a different order?
> > > >> > > >
> > > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > > >> > > > challapallirahul@gmail.com> wrote:
> > > >> > > >
> > > >> > > >> parquet-meta command suggests that there is only
one row
> group
> > > >> > > >>
> > > >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau
<
> > > jacques@dremio.com
> > > >> >
> > > >> > > >> wrote:
> > > >> > > >>
> > > >> > > >> > How many row groups?
> > > >> > > >> >
> > > >> > > >> > --
> > > >> > > >> > Jacques Nadeau
> > > >> > > >> > CTO and Co-Founder, Dremio
> > > >> > > >> >
> > > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli
<
> > > >> > > >> > challapallirahul@gmail.com> wrote:
> > > >> > > >> >
> > > >> > > >> > > Drillers,
> > > >> > > >> > >
> > > >> > > >> > > With the new parquet library update,
can someone throw
> some
> > > >> light
> > > >> > on
> > > >> > > >> the
> > > >> > > >> > > order in which the records are read from
a single parquet
> > > file?
> > > >> > > >> > >
> > > >> > > >> > > With the older library, when I run the
below query on a
> > > single
> > > >> > > parquet
> > > >> > > >> > > file, I used to get a set of records.
Now after the
> parquet
> > > >> > library
> > > >> > > >> > update,
> > > >> > > >> > > I am seeing a different set of records.
Just wanted to
> > > >> understand
> > > >> > > what
> > > >> > > >> > > specifically has changed.
> > > >> > > >> > >
> > > >> > > >> > > select * from `file.parquet` limit 5;
> > > >> > > >> > >
> > > >> > > >> > > - Rahul
> > > >> > > >> > >
> > > >> > > >> >
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message