drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Order of records read in a parquet file
Date Sat, 07 Nov 2015 00:30:17 GMT
Jason,

I missed that. Let me check whether we are dropping any records. I would be
surprised if our regression tests missed that :)

- Rahul

On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> Rahul,
>
> Thanks for working on a reproduction of the issue. You didn't actually
> answer my first question, are you getting the same data out of the file,
> just in a different order? It seems much more likely that we are dropping
> some records at the beginning than reordering them somehow, although I
> would have expected an error like this to be caught by the unit or
> regression tests.
>
> Thanks,
> Jason
>
> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Thanks for your replies. The file is private and I will try to construct
> a
> > file without sensitive data which can expose this behavior.
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> > > Is this a large or private parquet file? Can you share it to allow me
> to
> > > debug the read path for it?
> > >
> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > altekrusejason@gmail.com>
> > > wrote:
> > >
> > > > The changes to parquet were not supposed to be functional at all. We
> > had
> > > > been maintaining our fork of parquet-mr to have a ByteBuffer based
> read
> > > and
> > > > write path to reduce heap memory usage. The work done was just
> getting
> > > > these changes merged back into parquet-mr and making corresponding
> > > changes
> > > > in Drill to accommodate any interface modifications introduced since
> we
> > > > last rebased (there were mostly just package renames). There were a
> lot
> > > of
> > > > comments on the PR, and a decent amount of refactoring that was done
> to
> > > > consolidate and otherwise clean up the code, but there shouldn't have
> > > been
> > > > any changes to the behavior of the reader or writer.
> > > >
> > > > Are you getting all of the same data out if you read the whole file,
> > just
> > > > in a different order?
> > > >
> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > > > challapallirahul@gmail.com> wrote:
> > > >
> > > >> parquet-meta command suggests that there is only one row group
> > > >>
> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <jacques@dremio.com>
> > > >> wrote:
> > > >>
> > > >> > How many row groups?
> > > >> >
> > > >> > --
> > > >> > Jacques Nadeau
> > > >> > CTO and Co-Founder, Dremio
> > > >> >
> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> > > >> > challapallirahul@gmail.com> wrote:
> > > >> >
> > > >> > > Drillers,
> > > >> > >
> > > >> > > With the new parquet library update, can someone throw some
> light
> > on
> > > >> the
> > > >> > > order in which the records are read from a single parquet
file?
> > > >> > >
> > > >> > > With the older library, when I run the below query on a
single
> > > parquet
> > > >> > > file, I used to get a set of records. Now after the parquet
> > library
> > > >> > update,
> > > >> > > I am seeing a different set of records. Just wanted to
> understand
> > > what
> > > >> > > specifically has changed.
> > > >> > >
> > > >> > > select * from `file.parquet` limit 5;
> > > >> > >
> > > >> > > - Rahul
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message