arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Grove <andygrov...@gmail.com>
Subject Re: [DISCUSS] Proposal about integration test of arrow parquet reader
Date Sat, 12 Oct 2019 16:12:29 GMT
I also think that there are valid use cases for checking in binary files,
but we have to be careful not to abuse this. For example, we might want to
check in a Parquet file created by a particular version of Apache Spark to
ensure that Arrow implementations can read it successfully (hypothetical
example).

It would also be good to have a small set of Parquet files using every
possible data type that all implementations can use in their tests. I
suppose we might want one set per Arrow format version as well.

The problem we have now, in my opinion, is that we're proposing adding
files on a pretty ad-hoc basis, driven by the needs of individual
contributors in one language implementation, and this is perhaps happening
because we don't already have a good set of standard test files.

Renjie - perhaps you could comment on this. If we had these standard files
covering all data types, would that have worked for you in this instance?

Thanks,

Andy.

On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Wes,
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files.
>
>
> Can you elaborate on this? I think both generated on the fly and example
> files are useful.
>
> The checked in files catch regressions even when readers/writers can read
> their own data but they have either incorrect or undefined behavior in
> regards to the specification (for example I would imagine checking in a
> file as part of the fix for ARROW-6844
> <https://issues.apache.org/jira/browse/ARROW-6844>).
>
> Thanks,
> Micah
>
> On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <liurenjie2008@gmail.com>
> wrote:
>
> > Thanks wes. Sure I'll fix it.
> >
> > Wes McKinney <wesmckinn@gmail.com> 于 2019年10月11日周五 上午6:10写道:
> >
> > > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> > >
> > > Various aspects of this make me uncomfortable so I hope they can be
> > > addressed in follow up work
> > >
> > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <liurenjie2008@gmail.com>
> > > wrote:
> > > >
> > > > I've create ticket to track here:
> > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > >
> > > > For this moment, can we check in those pregenerated data to unblock
> > rust
> > > > version's arrow reader?
> > > >
> > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2008@gmail.com>
> > > wrote:
> > > >
> > > > > It would be fine in that case.
> > > > >
> > > > > Wes McKinney <wesmckinn@gmail.com> 于 2019年10月10日周四
下午12:58写道:
> > > > >
> > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> liurenjie2008@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > 1. There already exists a low level parquet writer which
can
> > produce
> > > > >> > parquet file, so unit test should be fine. But writer from
arrow
> > to
> > > > >> parquet
> > > > >> > doesn't exist yet, and it may take some period of time to
finish
> > it.
> > > > >> > 2. In fact my data are randomly generated and it's definitely
> > > > >> reproducible.
> > > > >> > However, I don't think it would be good idea to randomly
> generate
> > > data
> > > > >> > everytime we run ci because it would be difficult to debug.
For
> > > example
> > > > >> PR
> > > > >> > a introduced a bug, which is triggerred in other PR's build
it
> > > would be
> > > > >> > confusing for contributors.
> > > > >>
> > > > >> Presumably any random data generation would use a fixed seed
> > precisely
> > > > >> to be reproducible.
> > > > >>
> > > > >> > 3. I think it would be good idea to spend effort on integration
> > test
> > > > >> with
> > > > >> > parquet because it's an important use case of arrow. Also
> similar
> > > > >> approach
> > > > >> > could be extended to other language and other file format(avro,
> > > orc).
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > There are a number of issues worth discussion.
> > > > >> > >
> > > > >> > > 1. What is the timeline/plan for Rust implementing
a Parquet
> > > _writer_?
> > > > >> > > It's OK to be reliant on other libraries in the short
term to
> > > produce
> > > > >> > > files to test against, but does not strike me as a
sustainable
> > > > >> > > long-term plan. Fixing bugs can be a lot more difficult
than
> it
> > > needs
> > > > >> > > to be if you can't write targeted "endogenous" unit
tests
> > > > >> > >
> > > > >> > > 2. Reproducible data generation
> > > > >> > >
> > > > >> > > I think if you're going to test against a pre-generated
> corpus,
> > > you
> > > > >> > > should make sure that generating the corpus is reproducible
> for
> > > other
> > > > >> > > developers (i.e. with a Dockerfile), and can be extended
by
> > > adding new
> > > > >> > > files or random data generation.
> > > > >> > >
> > > > >> > > I additionally would prefer generating the test corpus
at test
> > > time
> > > > >> > > rather than checking in binary files. If this isn't
viable
> right
> > > now
> > > > >> > > we can create an "arrow-rust-crutch" git repository
for you to
> > > stash
> > > > >> > > binary files until some of these testing scalability
issues
> are
> > > > >> > > addressed.
> > > > >> > >
> > > > >> > > If we're going to spend energy on Parquet integration
testing
> > with
> > > > >> > > Java, this would be a good opportunity to do the work
in a way
> > > where
> > > > >> > > the C++ Parquet library can also participate (since
we ought
> to
> > be
> > > > >> > > doing integration tests with Java, and we can also
read JSON
> > > files to
> > > > >> > > Arrow).
> > > > >> > >
> > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > liurenjie2008@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > andygrove73@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > I'm very interested in helping to find a
solution to this
> > > because
> > > > >> we
> > > > >> > > really
> > > > >> > > > > do need integration tests for Rust to make
sure we're
> > > compatible
> > > > >> with
> > > > >> > > other
> > > > >> > > > > implementations... there is also the ongoing
CI
> > dockerization
> > > work
> > > > >> > > that I
> > > > >> > > > > feel is related.
> > > > >> > > > >
> > > > >> > > > > I haven't looked at the current integration
tests yet and
> > > would
> > > > >> > > appreciate
> > > > >> > > > > some pointers on how all of this works (do
we have docs?)
> or
> > > > >> where to
> > > > >> > > start
> > > > >> > > > > looking.
> > > > >> > > > >
> > > > >> > > > I have a test in my latest PR:
> > > > >> https://github.com/apache/arrow/pull/5523
> > > > >> > > > And here is the generated data:
> > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > >> > > > As with program to generate these data, it's just
a simple
> > java
> > > > >> program.
> > > > >> > > > I'm not sure whether we need to integrate it into
arrow.
> > > > >> > > >
> > > > >> > > > >
> > > > >> > > > > I imagine the integration test could follow
the approach
> > that
> > > > >> Renjie is
> > > > >> > > > > outlining where we call Java to generate
some files and
> then
> > > call
> > > > >> Rust
> > > > >> > > to
> > > > >> > > > > parse them?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Andy.
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu
<
> > > > >> liurenjie2008@gmail.com>
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi:
> > > > >> > > > > >
> > > > >> > > > > > I'm developing rust version of reader
which reads
> parquet
> > > into
> > > > >> arrow
> > > > >> > > > > array.
> > > > >> > > > > > To verify the correct of this reader,
I use the
> following
> > > > >> approach:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >    1. Define schema with protobuf.
> > > > >> > > > > >    2. Generate json data of this schema
using other
> > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g.
java)
> > > > >> > > > > >    3. Generate parquet data of this
schema using other
> > > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g.
java)
> > > > >> > > > > >    4. Write tests to read json file,
and parquet file
> into
> > > > >> memory
> > > > >> > > (arrow
> > > > >> > > > > >    array), then compare json data with
arrow data.
> > > > >> > > > > >
> > > > >> > > > > >  I think with this method we can guarantee
the
> correctness
> > > of
> > > > >> arrow
> > > > >> > > > > reader
> > > > >> > > > > > because json format is ubiquitous and
their
> implementation
> > > are
> > > > >> more
> > > > >> > > > > stable.
> > > > >> > > > > >
> > > > >> > > > > > Any comment is appreciated.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Renjie Liu
> > > > >> > > > Software Engineer, MVAD
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Renjie Liu
> > > > >> > Software Engineer, MVAD
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Renjie Liu
> > > > Software Engineer, MVAD
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message