arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [C++] Failing constructors and internal state
Date Sun, 10 Mar 2019 21:01:15 GMT
I agree there should always be a path to avoid the validation but I think
there should also be an easy way to have validation included and a clear
way to tell the difference.  IMO, having strong naming convention so
callers can tell the difference, and code reviewers can focus more on less
safe method usage, is important.  I will help new-comers to the project
write safer code.  Which can either be refactored or called out in code
review for performance issues.  It also provides a cue for all developers
to consider if they are meeting the necessary requirements when using less
safe methods.

A straw-man proposal for naming conventions:
- Constructors are always unvalidated (should still have appropriate
DCHECKs)
- "Make" calls are always unvalidated (should still have appropriate
DHCECKs)
- "MakeValidated" ensures proper structural validation occur (but not data
is validation).
- "MakeSanitized" ensures proper structural and data is validations occur

As noted above it might only pay to refactor a small amount of current
usage to the safer APIs.

We could potentially go even further down the rabbit hole and try to define
standard for a Hungarian notation [1] to make it more obvious what
invariants are expected for a particular data-structure variable (I'm
actually -.5 on this).

As a personal bias, I would rather have slower code that has lower risk of
crashing in production than faster code that does.  Obviously, there is a
tradeoff here, and the ideal is faster code that won't segfault.

Thoughts?

-Micah

[1] https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/

On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi folks,
>
> I think some issues are being conflated here, so let me try to dig
> through them. Let's first look at the two cited bugs that were fixed,
> if I have this right:
>
> * ARROW-4766: root cause dereferencing a null pointer
> * ARROW-4774: root cause unsanitized Python user input
>
> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
> since we currently allow for null buffers (the object, not the pointer
> inside) in ArrayData. This has been discussed on the mailing list in
> the past; "sanitizing" ArrayData to be free of null objects would be
> expensive and my general attitude in the C++ library is that we should
> not be in the business of paying extra CPU cycles for the 1-5% case
> when it is unneeded in the 95-99% of cases. We have DCHECK assertions
> to check these issues in debug builds while avoiding the costs in
> release builds. In the case of checking edge cases in computational
> kernels, suffice to say that we should check every kernel on length-0
> input with null buffers to make sure this case is properly handled
>
> In the case of ARROW-4774, we should work at the language binding
> interface to make sure we have convenient "validating" constructors
> that check user input for common problems. This can prevent the
> duplication of this code across the binding layers (GLib, Python, R,
> MATLAB, etc.)
>
>  On the specific 4 steps mentioned by Francois, here are my thoughts:
>
> 1. Having StatusOr would be useful, but this is a programming convenience
>
> 2. There are a couple purposes of the static factory methods that
> exist now, like Table::Make and RecordBatch::Make. One of the reasons
> that I added them initially was because of the implicit constructor
> behavior of std::vector inside a call to std::make_shared. If you have
> a std::vector<T> argument in a class's constructor, then
> std::make_shared<Klass>(..., {...}) will not result in the initializer
> list constructing the std::vector<T>. This means some awkwardness like
> having to assign a std::vector<T> lvalue and _then_ pass that to
> std::make_shared<Klass>(..., vector_arg, ...).
>
> I do not agree with refactoring these methods to use "validating"
> constructors. Users of these C++ APIs should know what their
> requirements are, and we provide in some cases a Validate() to spend
> the extra cycles to assert preconditions. Therefore:
>
> 3. -1 on this
> 4. -1 also
>
> Thanks
> Wes
>
> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> >
> > HI Fran├žois,
> > This sounds great.  I would hope that as part of this we document the
> > invariants (and possible sharp edges like zero length/all null and no
> null
> > Arrays).
> >
> > Is your intent to allow languages binding to the C++ library go through
> the
> > new API or will they still be able to use the "dangerous" ones?
> >
> > -Micah
> >
> > On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
> > fsaintjacques@gmail.com> wrote:
> >
> > > Greetings,
> > >
> > > I noted that the current C++ API permits constructing core objects
> breaking
> > > said classes invariants. The following recent issues were affected by
> this:
> > >
> > > - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
> > > - ARROW-4774: segfault due to invalid Table with columns of different
> > > length
> > >
> > > Consumers often assumes that the objects respect the invariants, e.g.
> by
> > > dereferencing `array_data->buffers[i]->data()` or
> > > `Array::GetValue(table.n_rows - 1)` .
> > >
> > > Sample of classes which requires invariant in the constructor:
> > > - ArrayData/Array: number and size of buffers depending on type
> > > - ChunkedArray: Arrays of same type
> > > - Column: same as ChunkedArray and Field must match array's type
> > > - RecordBatch: number of columns and schema must match, columns of same
> > > size
> > > - Table: columns must be of same size
> > >
> > > Some classes provide static factory methods, notably:
> > > - Most `shared_ptr<T> *::Make` methods, but they lack the Status return
> > > type to indicate
> > >   failure, the consumer can at least check for nullptr
> > > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good
> > > candidate to follow
> > >
> > > I suspect that mis-usage is only going to grow with the number of
> users and
> > > language that binds to C++. I would like to propose a plan to tackle
> for
> > > the
> > > 0.14.0 release
> > >
> > > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by
> minimizing
> > >    output parameters.
> > > 2. Refactor normalized factory methods for each core object (ArrayData,
> > >    ChunkedArray, Column, RecordBatch, Table)
> > >     - Common naming, I suggest we stay with `Make`.
> > >     - Common return type, `StatusOr<shared_ptr<T>>`
> > > 3. Refactor existing Make methods to use new methods but preserve
> original
> > >    signature by losing error message, on top of marking them
> deprecated.
> > > 4. Mark non-validating constructors as deprecated and ideailly make
> every
> > > "dangerous" constructor non-public.
> > >
> > > We'd give 1-2 release for consumers to stop using the deprecated
> > > methods/constructors.
> > >
> > > Fran├žois
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message