arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject [13/18] arrow git commit: ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas
Date Tue, 03 Oct 2017 12:59:55 GMT
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas

This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off:

```
In [8]: arr = np.random.randn(10000000)

In [9]: arr[::3] = np.nan

In [10]: arr2 = pa.array(arr)

In [11]: arr2.null_count
Out[11]: 0

In [12]: %timeit arr2 = pa.array(arr)
The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 68.4 ┬Ás per loop

In [13]: arr2 = pa.array(arr, from_pandas=True)

In [14]: arr2.null_count
Out[14]: 3333334

In [15]: %timeit arr2 = pa.array(arr, from_pandas=True)
1 loop, best of 3: 228 ms per loop
```

When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated.

This also permits sequence reads into integers smaller than int64:

```
In [17]: pa.array([1, 2, 3, 4], type='i1')
Out[17]:
<pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8>
[
  1,
  2,
  3,
  4
]
```

Oh, I also added NumPy-like string type aliases:

```
In [18]: pa.int32() == 'i4'
Out[18]: True
```

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1146 from wesm/expand-py-array-method and squashes the following commits:

1570e525 [Wes McKinney] Code review comments
d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too
797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array
f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions
587c575a [Wes McKinney] Add direct types sequence converters for more data types
cf40b767 [Wes McKinney] Add type aliases, some unit tests
7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/ccbf6446
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/ccbf6446
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/ccbf6446

Branch: refs/heads/master
Commit: ccbf6446bccda9856f7e86f5d9ccccd80273eba2
Parents: a03e093
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Fri Sep 29 23:02:58 2017 -0500
Committer: Wes McKinney <wes.mckinney@twosigma.com>
Committed: Tue Oct 3 08:59:22 2017 -0400

----------------------------------------------------------------------
 cpp/src/arrow/python/CMakeLists.txt         |    4 +-
 cpp/src/arrow/python/api.h                  |    2 +-
 cpp/src/arrow/python/builtin_convert.cc     |  223 ++--
 cpp/src/arrow/python/numpy_to_arrow.cc      | 1228 ++++++++++++++++++++++
 cpp/src/arrow/python/numpy_to_arrow.h       |   56 +
 cpp/src/arrow/python/pandas_to_arrow.cc     | 1215 ---------------------
 cpp/src/arrow/python/pandas_to_arrow.h      |   59 --
 python/pyarrow/__init__.py                  |    2 +-
 python/pyarrow/array.pxi                    |  279 ++---
 python/pyarrow/includes/libarrow.pxd        |   11 +-
 python/pyarrow/pandas_compat.py             |   22 +-
 python/pyarrow/scalar.pxi                   |    8 +-
 python/pyarrow/table.pxi                    |   13 +-
 python/pyarrow/tests/test_array.py          |   58 +-
 python/pyarrow/tests/test_convert_pandas.py |   95 +-
 python/pyarrow/tests/test_parquet.py        |   42 +-
 python/pyarrow/tests/test_schema.py         |   50 +
 python/pyarrow/types.pxi                    |   72 +-
 18 files changed, 1841 insertions(+), 1598 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/ccbf6446/cpp/src/arrow/python/CMakeLists.txt
----------------------------------------------------------------------
diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt
index 84aad82..7938d84 100644
--- a/cpp/src/arrow/python/CMakeLists.txt
+++ b/cpp/src/arrow/python/CMakeLists.txt
@@ -57,7 +57,7 @@ set(ARROW_PYTHON_SRCS
   init.cc
   io.cc
   numpy_convert.cc
-  pandas_to_arrow.cc
+  numpy_to_arrow.cc
   python_to_arrow.cc
   pyarrow.cc
 )
@@ -100,7 +100,7 @@ install(FILES
   io.h
   numpy_convert.h
   numpy_interop.h
-  pandas_to_arrow.h
+  numpy_to_arrow.h
   python_to_arrow.h
   platform.h
   pyarrow.h

http://git-wip-us.apache.org/repos/asf/arrow/blob/ccbf6446/cpp/src/arrow/python/api.h
----------------------------------------------------------------------
diff --git a/cpp/src/arrow/python/api.h b/cpp/src/arrow/python/api.h
index 4ceb3f1..a000ac5 100644
--- a/cpp/src/arrow/python/api.h
+++ b/cpp/src/arrow/python/api.h
@@ -25,7 +25,7 @@
 #include "arrow/python/helpers.h"
 #include "arrow/python/io.h"
 #include "arrow/python/numpy_convert.h"
-#include "arrow/python/pandas_to_arrow.h"
+#include "arrow/python/numpy_to_arrow.h"
 #include "arrow/python/python_to_arrow.h"
 
 #endif  // ARROW_PYTHON_API_H

http://git-wip-us.apache.org/repos/asf/arrow/blob/ccbf6446/cpp/src/arrow/python/builtin_convert.cc
----------------------------------------------------------------------
diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc
index 747b872..f9d7361 100644
--- a/cpp/src/arrow/python/builtin_convert.cc
+++ b/cpp/src/arrow/python/builtin_convert.cc
@@ -20,6 +20,7 @@
 #include <datetime.h>
 
 #include <algorithm>
+#include <limits>
 #include <sstream>
 #include <string>
 
@@ -359,7 +360,11 @@ class TypedConverterVisitor : public TypedConverter<BuilderType> {
     if (PySequence_Check(obj)) {
       for (int64_t i = 0; i < size; ++i) {
         OwnedRef ref(PySequence_GetItem(obj, i));
-        RETURN_NOT_OK(static_cast<Derived*>(this)->AppendItem(ref));
+        if (ref.obj() == Py_None) {
+          RETURN_NOT_OK(this->typed_builder_->AppendNull());
+        } else {
+          RETURN_NOT_OK(static_cast<Derived*>(this)->AppendItem(ref));
+        }
       }
     } else if (PyObject_HasAttrString(obj, "__iter__")) {
       PyObject* iter = PyObject_GetIter(obj);
@@ -370,7 +375,11 @@ class TypedConverterVisitor : public TypedConverter<BuilderType> {
       // consuming at size.
       while ((item = PyIter_Next(iter)) && i < size) {
         OwnedRef ref(item);
-        RETURN_NOT_OK(static_cast<Derived*>(this)->AppendItem(ref));
+        if (ref.obj() == Py_None) {
+          RETURN_NOT_OK(this->typed_builder_->AppendNull());
+        } else {
+          RETURN_NOT_OK(static_cast<Derived*>(this)->AppendItem(ref));
+        }
         ++i;
       }
       if (size != i) {
@@ -388,52 +397,136 @@ class TypedConverterVisitor : public TypedConverter<BuilderType> {
 class NullConverter : public TypedConverterVisitor<NullBuilder, NullConverter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      return Status::Invalid("NullConverter: passed non-None value");
-    }
+    return Status::Invalid("NullConverter: passed non-None value");
   }
 };
 
 class BoolConverter : public TypedConverterVisitor<BooleanBuilder, BoolConverter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      if (item.obj() == Py_True) {
-        return typed_builder_->Append(true);
-      } else {
-        return typed_builder_->Append(false);
-      }
+    return typed_builder_->Append(item.obj() == Py_True);
+  }
+};
+
+class Int8Converter : public TypedConverterVisitor<Int8Builder, Int8Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    int64_t val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<int8_t>::max() ||
+                            val < std::numeric_limits<int8_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
     }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<int8_t>(val));
+  }
+};
+
+class Int16Converter : public TypedConverterVisitor<Int16Builder, Int16Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    int64_t val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<int16_t>::max() ||
+                            val < std::numeric_limits<int16_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
+    }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<int16_t>(val));
+  }
+};
+
+class Int32Converter : public TypedConverterVisitor<Int32Builder, Int32Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    int64_t val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<int32_t>::max() ||
+                            val < std::numeric_limits<int32_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
+    }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<int32_t>(val));
   }
 };
 
 class Int64Converter : public TypedConverterVisitor<Int64Builder, Int64Converter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    int64_t val;
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
-      RETURN_IF_PYERROR();
-      return typed_builder_->Append(val);
+    int64_t val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(val);
+  }
+};
+
+class UInt8Converter : public TypedConverterVisitor<UInt8Builder, UInt8Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    uint64_t val = static_cast<uint64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<uint8_t>::max() ||
+                            val < std::numeric_limits<uint8_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
     }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<uint8_t>(val));
   }
 };
 
-class DateConverter : public TypedConverterVisitor<Date64Builder, DateConverter> {
+class UInt16Converter : public TypedConverterVisitor<UInt16Builder, UInt16Converter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      PyDateTime_Date* pydate = reinterpret_cast<PyDateTime_Date*>(item.obj());
-      return typed_builder_->Append(PyDate_to_ms(pydate));
+    uint64_t val = static_cast<uint64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<uint16_t>::max() ||
+                            val < std::numeric_limits<uint16_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
     }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<uint16_t>(val));
+  }
+};
+
+class UInt32Converter : public TypedConverterVisitor<UInt32Builder, UInt32Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    uint64_t val = static_cast<uint64_t>(PyLong_AsLongLong(item.obj()));
+
+    if (ARROW_PREDICT_FALSE(val > std::numeric_limits<uint32_t>::max() ||
+                            val < std::numeric_limits<uint32_t>::min())) {
+      return Status::Invalid(
+          "Cannot coerce values to array type that would "
+          "lose data");
+    }
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(static_cast<uint32_t>(val));
+  }
+};
+
+class UInt64Converter : public TypedConverterVisitor<UInt64Builder, UInt64Converter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    int64_t val = static_cast<int64_t>(PyLong_AsLongLong(item.obj()));
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(val);
+  }
+};
+
+class DateConverter : public TypedConverterVisitor<Date64Builder, DateConverter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    auto pydate = reinterpret_cast<PyDateTime_Date*>(item.obj());
+    return typed_builder_->Append(PyDate_to_ms(pydate));
   }
 };
 
@@ -441,27 +534,17 @@ class TimestampConverter
     : public TypedConverterVisitor<Date64Builder, TimestampConverter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      PyDateTime_DateTime* pydatetime =
-          reinterpret_cast<PyDateTime_DateTime*>(item.obj());
-      return typed_builder_->Append(PyDateTime_to_us(pydatetime));
-    }
+    auto pydatetime = reinterpret_cast<PyDateTime_DateTime*>(item.obj());
+    return typed_builder_->Append(PyDateTime_to_us(pydatetime));
   }
 };
 
 class DoubleConverter : public TypedConverterVisitor<DoubleBuilder, DoubleConverter> {
  public:
   inline Status AppendItem(const OwnedRef& item) {
-    double val;
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      val = PyFloat_AsDouble(item.obj());
-      RETURN_IF_PYERROR();
-      return typed_builder_->Append(val);
-    }
+    double val = PyFloat_AsDouble(item.obj());
+    RETURN_IF_PYERROR();
+    return typed_builder_->Append(val);
   }
 };
 
@@ -473,10 +556,7 @@ class BytesConverter : public TypedConverterVisitor<BinaryBuilder, BytesConverte
     Py_ssize_t length;
     OwnedRef tmp;
 
-    if (item.obj() == Py_None) {
-      RETURN_NOT_OK(typed_builder_->AppendNull());
-      return Status::OK();
-    } else if (PyUnicode_Check(item.obj())) {
+    if (PyUnicode_Check(item.obj())) {
       tmp.reset(PyUnicode_AsUTF8String(item.obj()));
       RETURN_IF_PYERROR();
       bytes_obj = tmp.obj();
@@ -504,10 +584,7 @@ class FixedWidthBytesConverter
     Py_ssize_t expected_length =
         std::dynamic_pointer_cast<FixedSizeBinaryType>(typed_builder_->type())
             ->byte_width();
-    if (item.obj() == Py_None) {
-      RETURN_NOT_OK(typed_builder_->AppendNull());
-      return Status::OK();
-    } else if (PyUnicode_Check(item.obj())) {
+    if (PyUnicode_Check(item.obj())) {
       tmp.reset(PyUnicode_AsUTF8String(item.obj()));
       RETURN_IF_PYERROR();
       bytes_obj = tmp.obj();
@@ -535,9 +612,7 @@ class UTF8Converter : public TypedConverterVisitor<StringBuilder, UTF8Converter>
     Py_ssize_t length;
 
     PyObject* obj = item.obj();
-    if (obj == Py_None) {
-      return typed_builder_->AppendNull();
-    } else if (PyBytes_Check(obj)) {
+    if (PyBytes_Check(obj)) {
       tmp.reset(
           PyUnicode_FromStringAndSize(PyBytes_AS_STRING(obj), PyBytes_GET_SIZE(obj)));
       RETURN_IF_PYERROR();
@@ -565,14 +640,10 @@ class ListConverter : public TypedConverterVisitor<ListBuilder, ListConverter> {
   Status Init(ArrayBuilder* builder) override;
 
   inline Status AppendItem(const OwnedRef& item) override {
-    if (item.obj() == Py_None) {
-      return typed_builder_->AppendNull();
-    } else {
-      RETURN_NOT_OK(typed_builder_->Append());
-      PyObject* item_obj = item.obj();
-      int64_t list_size = static_cast<int64_t>(PySequence_Size(item_obj));
-      return value_converter_->AppendData(item_obj, list_size);
-    }
+    RETURN_NOT_OK(typed_builder_->Append());
+    PyObject* item_obj = item.obj();
+    int64_t list_size = static_cast<int64_t>(PySequence_Size(item_obj));
+    return value_converter_->AppendData(item_obj, list_size);
   }
 
  protected:
@@ -584,16 +655,12 @@ class DecimalConverter
  public:
   inline Status AppendItem(const OwnedRef& item) {
     /// TODO(phillipc): Check for nan?
-    if (item.obj() != Py_None) {
-      std::string string;
-      RETURN_NOT_OK(PythonDecimalToString(item.obj(), &string));
-
-      Decimal128 value;
-      RETURN_NOT_OK(Decimal128::FromString(string, &value));
-      return typed_builder_->Append(value);
-    }
+    std::string string;
+    RETURN_NOT_OK(PythonDecimalToString(item.obj(), &string));
 
-    return typed_builder_->AppendNull();
+    Decimal128 value;
+    RETURN_NOT_OK(Decimal128::FromString(string, &value));
+    return typed_builder_->Append(value);
   }
 };
 
@@ -604,8 +671,22 @@ std::shared_ptr<SeqConverter> GetConverter(const std::shared_ptr<DataType>& type
       return std::make_shared<NullConverter>();
     case Type::BOOL:
       return std::make_shared<BoolConverter>();
+    case Type::INT8:
+      return std::make_shared<Int8Converter>();
+    case Type::INT16:
+      return std::make_shared<Int16Converter>();
+    case Type::INT32:
+      return std::make_shared<Int32Converter>();
     case Type::INT64:
       return std::make_shared<Int64Converter>();
+    case Type::UINT8:
+      return std::make_shared<UInt8Converter>();
+    case Type::UINT16:
+      return std::make_shared<UInt16Converter>();
+    case Type::UINT32:
+      return std::make_shared<UInt32Converter>();
+    case Type::UINT64:
+      return std::make_shared<UInt64Converter>();
     case Type::DATE64:
       return std::make_shared<DateConverter>();
     case Type::TIMESTAMP:

http://git-wip-us.apache.org/repos/asf/arrow/blob/ccbf6446/cpp/src/arrow/python/numpy_to_arrow.cc
----------------------------------------------------------------------
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc b/cpp/src/arrow/python/numpy_to_arrow.cc
new file mode 100644
index 0000000..7151c94
--- /dev/null
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -0,0 +1,1228 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Functions for pandas conversion via NumPy
+
+#define ARROW_NO_DEFAULT_MEMORY_POOL
+
+#include "arrow/python/numpy_to_arrow.h"
+#include "arrow/python/numpy_interop.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdint>
+#include <limits>
+#include <memory>
+#include <sstream>
+#include <string>
+#include <vector>
+
+#include "arrow/array.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/type_fwd.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit-util.h"
+#include "arrow/util/decimal.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/macros.h"
+#include "arrow/visitor_inline.h"
+
+#include "arrow/compute/cast.h"
+#include "arrow/compute/context.h"
+
+#include "arrow/python/builtin_convert.h"
+#include "arrow/python/common.h"
+#include "arrow/python/config.h"
+#include "arrow/python/helpers.h"
+#include "arrow/python/numpy-internal.h"
+#include "arrow/python/numpy_convert.h"
+#include "arrow/python/type_traits.h"
+#include "arrow/python/util/datetime.h"
+
+namespace arrow {
+namespace py {
+
+using internal::NumPyTypeSize;
+
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits<int32_t>::max();
+
+// ----------------------------------------------------------------------
+// Conversion utilities
+
+namespace {
+
+inline bool PyFloat_isnan(const PyObject* obj) {
+  if (PyFloat_Check(obj)) {
+    double val = PyFloat_AS_DOUBLE(obj);
+    return val != val;
+  } else {
+    return false;
+  }
+}
+
+inline bool PandasObjectIsNull(const PyObject* obj) {
+  return obj == Py_None || obj == numpy_nan || PyFloat_isnan(obj);
+}
+
+inline bool PyObject_is_string(const PyObject* obj) {
+#if PY_MAJOR_VERSION >= 3
+  return PyUnicode_Check(obj) || PyBytes_Check(obj);
+#else
+  return PyString_Check(obj) || PyUnicode_Check(obj);
+#endif
+}
+
+inline bool PyObject_is_float(const PyObject* obj) { return PyFloat_Check(obj); }
+
+inline bool PyObject_is_integer(const PyObject* obj) {
+  return (!PyBool_Check(obj)) && PyArray_IsIntegerScalar(obj);
+}
+
+template <int TYPE>
+inline int64_t ValuesToBitmap(PyArrayObject* arr, uint8_t* bitmap) {
+  typedef internal::npy_traits<TYPE> traits;
+  typedef typename traits::value_type T;
+
+  int64_t null_count = 0;
+
+  Ndarray1DIndexer<T> values(arr);
+  for (int i = 0; i < values.size(); ++i) {
+    if (traits::isnull(values[i])) {
+      ++null_count;
+    } else {
+      BitUtil::SetBit(bitmap, i);
+    }
+  }
+
+  return null_count;
+}
+
+// Returns null count
+int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) {
+  int64_t null_count = 0;
+
+  Ndarray1DIndexer<uint8_t> mask_values(mask);
+  for (int i = 0; i < length; ++i) {
+    if (mask_values[i]) {
+      ++null_count;
+    } else {
+      BitUtil::SetBit(bitmap, i);
+    }
+  }
+  return null_count;
+}
+
+Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) {
+  if (PyArray_NDIM(numpy_array) != 1) {
+    return Status::Invalid("only handle 1-dimensional arrays");
+  }
+
+  const int received_type = PyArray_DESCR(numpy_array)->type_num;
+  if (received_type != np_type) {
+    std::stringstream ss;
+    ss << "trying to convert NumPy type " << GetNumPyTypeName(np_type) << " but got "
+       << GetNumPyTypeName(received_type);
+    return Status::Invalid(ss.str());
+  }
+
+  return Status::OK();
+}
+
+}  // namespace
+
+/// Append as many string objects from NumPy arrays to a `StringBuilder` as we
+/// can fit
+///
+/// \param[in] offset starting offset for appending
+/// \param[out] values_consumed ending offset where we stopped appending. Will
+/// be length of arr if fully consumed
+/// \param[out] have_bytes true if we encountered any PyBytes object
+static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, int64_t offset,
+                                  StringBuilder* builder, int64_t* end_offset,
+                                  bool* have_bytes) {
+  PyObject* obj;
+
+  Ndarray1DIndexer<PyObject*> objects(arr);
+  Ndarray1DIndexer<uint8_t> mask_values;
+
+  bool have_mask = false;
+  if (mask != nullptr) {
+    mask_values.Init(mask);
+    have_mask = true;
+  }
+
+  for (; offset < objects.size(); ++offset) {
+    OwnedRef tmp_obj;
+    obj = objects[offset];
+    if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder->AppendNull());
+      continue;
+    } else if (PyUnicode_Check(obj)) {
+      obj = PyUnicode_AsUTF8String(obj);
+      if (obj == NULL) {
+        PyErr_Clear();
+        return Status::Invalid("failed converting unicode to UTF8");
+      }
+      tmp_obj.reset(obj);
+    } else if (PyBytes_Check(obj)) {
+      *have_bytes = true;
+    } else {
+      std::stringstream ss;
+      ss << "Error converting to Python objects to String/UTF8: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "str, bytes", &ss));
+      return Status::Invalid(ss.str());
+    }
+
+    const int32_t length = static_cast<int32_t>(PyBytes_GET_SIZE(obj));
+    if (ARROW_PREDICT_FALSE(builder->value_data_length() + length > kBinaryMemoryLimit)) {
+      break;
+    }
+    RETURN_NOT_OK(builder->Append(PyBytes_AS_STRING(obj), length));
+  }
+
+  // If we consumed the whole array, this will be the length of arr
+  *end_offset = offset;
+  return Status::OK();
+}
+
+static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mask,
+                                          int byte_width, int64_t offset,
+                                          FixedSizeBinaryBuilder* builder,
+                                          int64_t* end_offset) {
+  PyObject* obj;
+
+  Ndarray1DIndexer<PyObject*> objects(arr);
+  Ndarray1DIndexer<uint8_t> mask_values;
+
+  bool have_mask = false;
+  if (mask != nullptr) {
+    mask_values.Init(mask);
+    have_mask = true;
+  }
+
+  for (; offset < objects.size(); ++offset) {
+    OwnedRef tmp_obj;
+    obj = objects[offset];
+    if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder->AppendNull());
+      continue;
+    } else if (PyUnicode_Check(obj)) {
+      obj = PyUnicode_AsUTF8String(obj);
+      if (obj == NULL) {
+        PyErr_Clear();
+        return Status::Invalid("failed converting unicode to UTF8");
+      }
+
+      tmp_obj.reset(obj);
+    } else if (!PyBytes_Check(obj)) {
+      std::stringstream ss;
+      ss << "Error converting to Python objects to FixedSizeBinary: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "str, bytes", &ss));
+      return Status::Invalid(ss.str());
+    }
+
+    RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width));
+    if (ARROW_PREDICT_FALSE(builder->value_data_length() + byte_width >
+                            kBinaryMemoryLimit)) {
+      break;
+    }
+    RETURN_NOT_OK(
+        builder->Append(reinterpret_cast<const uint8_t*>(PyBytes_AS_STRING(obj))));
+  }
+
+  // If we consumed the whole array, this will be the length of arr
+  *end_offset = offset;
+  return Status::OK();
+}
+
+// ----------------------------------------------------------------------
+// Conversion from NumPy-in-Pandas to Arrow
+
+class NumPyConverter {
+ public:
+  NumPyConverter(MemoryPool* pool, PyObject* ao, PyObject* mo,
+                 const std::shared_ptr<DataType>& type, bool use_pandas_null_sentinels)
+      : pool_(pool),
+        type_(type),
+        arr_(reinterpret_cast<PyArrayObject*>(ao)),
+        mask_(nullptr),
+        use_pandas_null_sentinels_(use_pandas_null_sentinels) {
+    if (mo != nullptr && mo != Py_None) {
+      mask_ = reinterpret_cast<PyArrayObject*>(mo);
+    }
+    length_ = static_cast<int64_t>(PyArray_SIZE(arr_));
+  }
+
+  bool is_strided() const {
+    npy_intp* astrides = PyArray_STRIDES(arr_);
+    return astrides[0] != PyArray_DESCR(arr_)->elsize;
+  }
+
+  Status Convert();
+
+  const std::vector<std::shared_ptr<Array>>& result() const { return out_arrays_; }
+
+  template <typename T>
+  typename std::enable_if<std::is_base_of<PrimitiveCType, T>::value ||
+                              std::is_same<BooleanType, T>::value,
+                          Status>::type
+  Visit(const T& type) {
+    return VisitNative<T>();
+  }
+
+  Status Visit(const Date32Type& type) { return VisitNative<Date32Type>(); }
+  Status Visit(const Date64Type& type) { return VisitNative<Int64Type>(); }
+  Status Visit(const TimestampType& type) { return VisitNative<TimestampType>(); }
+  Status Visit(const Time32Type& type) { return VisitNative<Int32Type>(); }
+  Status Visit(const Time64Type& type) { return VisitNative<Int64Type>(); }
+
+  Status Visit(const NullType& type) { return TypeNotImplemented(type.ToString()); }
+
+  Status Visit(const BinaryType& type) { return TypeNotImplemented(type.ToString()); }
+
+  Status Visit(const FixedSizeBinaryType& type) {
+    return TypeNotImplemented(type.ToString());
+  }
+
+  Status Visit(const DecimalType& type) { return TypeNotImplemented(type.ToString()); }
+
+  Status Visit(const DictionaryType& type) { return TypeNotImplemented(type.ToString()); }
+
+  Status Visit(const NestedType& type) { return TypeNotImplemented(type.ToString()); }
+
+ protected:
+  Status InitNullBitmap() {
+    int64_t null_bytes = BitUtil::BytesForBits(length_);
+
+    null_bitmap_ = std::make_shared<PoolBuffer>(pool_);
+    RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+    null_bitmap_data_ = null_bitmap_->mutable_data();
+    memset(null_bitmap_data_, 0, static_cast<size_t>(null_bytes));
+
+    return Status::OK();
+  }
+
+  // ----------------------------------------------------------------------
+  // Traditional visitor conversion for non-object arrays
+
+  template <typename ArrowType>
+  Status ConvertData(std::shared_ptr<Buffer>* data);
+
+  template <typename T>
+  Status PushBuilderResult(T* builder) {
+    std::shared_ptr<Array> out;
+    RETURN_NOT_OK(builder->Finish(&out));
+    out_arrays_.emplace_back(out);
+    return Status::OK();
+  }
+
+  template <int TYPE, typename BuilderType>
+  Status AppendNdarrayToBuilder(PyArrayObject* array, BuilderType* builder) {
+    typedef internal::npy_traits<TYPE> traits;
+    typedef typename traits::value_type T;
+
+    const bool null_sentinels_possible =
+        (use_pandas_null_sentinels_ && traits::supports_nulls);
+
+    // TODO(wesm): Vector append when not strided
+    Ndarray1DIndexer<T> values(array);
+    if (null_sentinels_possible) {
+      for (int64_t i = 0; i < values.size(); ++i) {
+        if (traits::isnull(values[i])) {
+          RETURN_NOT_OK(builder->AppendNull());
+        } else {
+          RETURN_NOT_OK(builder->Append(values[i]));
+        }
+      }
+    } else {
+      for (int64_t i = 0; i < values.size(); ++i) {
+        RETURN_NOT_OK(builder->Append(values[i]));
+      }
+    }
+    return Status::OK();
+  }
+
+  Status PushArray(const std::shared_ptr<ArrayData>& data) {
+    std::shared_ptr<Array> result;
+    RETURN_NOT_OK(MakeArray(data, &result));
+    out_arrays_.emplace_back(std::move(result));
+    return Status::OK();
+  }
+
+  template <typename ArrowType>
+  Status VisitNative() {
+    using traits = internal::arrow_traits<ArrowType::type_id>;
+
+    const bool null_sentinels_possible =
+        (use_pandas_null_sentinels_ && traits::supports_nulls);
+
+    if (mask_ != nullptr || null_sentinels_possible) {
+      RETURN_NOT_OK(InitNullBitmap());
+    }
+
+    std::shared_ptr<Buffer> data;
+    RETURN_NOT_OK(ConvertData<ArrowType>(&data));
+
+    int64_t null_count = 0;
+    if (mask_ != nullptr) {
+      null_count = MaskToBitmap(mask_, length_, null_bitmap_data_);
+    } else if (null_sentinels_possible) {
+      // TODO(wesm): this presumes the NumPy C type and arrow C type are the
+      // same
+      null_count = ValuesToBitmap<traits::npy_type>(arr_, null_bitmap_data_);
+    }
+
+    BufferVector buffers = {null_bitmap_, data};
+    auto arr_data =
+        std::make_shared<ArrayData>(type_, length_, std::move(buffers), null_count, 0);
+    return PushArray(arr_data);
+  }
+
+  Status TypeNotImplemented(std::string type_name) {
+    std::stringstream ss;
+    ss << "NumPyConverter doesn't implement <" << type_name << "> conversion. ";
+    return Status::NotImplemented(ss.str());
+  }
+
+  // ----------------------------------------------------------------------
+  // Conversion logic for various object dtype arrays
+
+  Status ConvertObjects();
+
+  template <int ITEM_TYPE, typename ArrowType>
+  Status ConvertTypedLists(const std::shared_ptr<DataType>& type, ListBuilder* builder,
+                           PyObject* list);
+
+  template <typename ArrowType>
+  Status ConvertDates();
+
+  Status ConvertBooleans();
+  Status ConvertObjectStrings();
+  Status ConvertObjectFloats();
+  Status ConvertObjectFixedWidthBytes(const std::shared_ptr<DataType>& type);
+  Status ConvertObjectIntegers();
+  Status ConvertLists(const std::shared_ptr<DataType>& type);
+  Status ConvertLists(const std::shared_ptr<DataType>& type, ListBuilder* builder,
+                      PyObject* list);
+  Status ConvertDecimals();
+  Status ConvertTimes();
+  Status ConvertObjectsInfer();
+  Status ConvertObjectsInferAndCast();
+
+  MemoryPool* pool_;
+  std::shared_ptr<DataType> type_;
+  PyArrayObject* arr_;
+  PyArrayObject* mask_;
+  int64_t length_;
+
+  bool use_pandas_null_sentinels_;
+
+  // Used in visitor pattern
+  std::vector<std::shared_ptr<Array>> out_arrays_;
+
+  std::shared_ptr<ResizableBuffer> null_bitmap_;
+  uint8_t* null_bitmap_data_;
+};
+
+Status NumPyConverter::Convert() {
+  if (PyArray_NDIM(arr_) != 1) {
+    return Status::Invalid("only handle 1-dimensional arrays");
+  }
+
+  if (PyArray_DESCR(arr_)->type_num == NPY_OBJECT) {
+    return ConvertObjects();
+  }
+
+  if (type_ == nullptr) {
+    return Status::Invalid("Must pass data type for non-object arrays");
+  }
+
+  // Visit the type to perform conversion
+  return VisitTypeInline(*type_, this);
+}
+
+template <typename T, typename T2>
+void CopyStrided(T* input_data, int64_t length, int64_t stride, T2* output_data) {
+  // Passing input_data as non-const is a concession to PyObject*
+  int64_t j = 0;
+  for (int64_t i = 0; i < length; ++i) {
+    output_data[i] = static_cast<T2>(input_data[j]);
+    j += stride;
+  }
+}
+
+template <>
+void CopyStrided<PyObject*, PyObject*>(PyObject** input_data, int64_t length,
+                                       int64_t stride, PyObject** output_data) {
+  int64_t j = 0;
+  for (int64_t i = 0; i < length; ++i) {
+    output_data[i] = input_data[j];
+    if (output_data[i] != nullptr) {
+      Py_INCREF(output_data[i]);
+    }
+    j += stride;
+  }
+}
+
+static Status CastBuffer(const std::shared_ptr<Buffer>& input, const int64_t length,
+                         const std::shared_ptr<DataType>& in_type,
+                         const std::shared_ptr<DataType>& out_type, MemoryPool* pool,
+                         std::shared_ptr<Buffer>* out) {
+  // Must cast
+  std::vector<std::shared_ptr<Buffer>> buffers = {nullptr, input};
+  auto tmp_data = std::make_shared<ArrayData>(in_type, length, buffers, 0);
+
+  std::shared_ptr<Array> tmp_array, casted_array;
+  RETURN_NOT_OK(MakeArray(tmp_data, &tmp_array));
+
+  compute::FunctionContext context(pool);
+  compute::CastOptions cast_options;
+  cast_options.allow_int_overflow = false;
+
+  RETURN_NOT_OK(
+      compute::Cast(&context, *tmp_array, out_type, cast_options, &casted_array));
+  *out = casted_array->data()->buffers[1];
+  return Status::OK();
+}
+
+template <typename ArrowType>
+inline Status NumPyConverter::ConvertData(std::shared_ptr<Buffer>* data) {
+  using traits = internal::arrow_traits<ArrowType::type_id>;
+  using T = typename traits::T;
+
+  if (is_strided()) {
+    // Strided, must copy into new contiguous memory
+    const int64_t stride = PyArray_STRIDES(arr_)[0];
+    const int64_t stride_elements = stride / sizeof(T);
+
+    auto new_buffer = std::make_shared<PoolBuffer>(pool_);
+    RETURN_NOT_OK(new_buffer->Resize(sizeof(T) * length_));
+    CopyStrided(reinterpret_cast<T*>(PyArray_DATA(arr_)), length_, stride_elements,
+                reinterpret_cast<T*>(new_buffer->mutable_data()));
+    *data = new_buffer;
+  } else {
+    // Can zero-copy
+    *data = std::make_shared<NumPyBuffer>(reinterpret_cast<PyObject*>(arr_));
+  }
+
+  std::shared_ptr<DataType> input_type;
+  RETURN_NOT_OK(
+      NumPyDtypeToArrow(reinterpret_cast<PyObject*>(PyArray_DESCR(arr_)), &input_type));
+
+  if (!input_type->Equals(*type_)) {
+    RETURN_NOT_OK(CastBuffer(*data, length_, input_type, type_, pool_, data));
+  }
+
+  return Status::OK();
+}
+
+template <>
+inline Status NumPyConverter::ConvertData<Date32Type>(std::shared_ptr<Buffer>* data) {
+  // Handle LONGLONG->INT64 and other fun things
+  int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num);
+  int type_size = NumPyTypeSize(type_num_compat);
+
+  if (type_size == 4) {
+    // Source and target are INT32, so can refer to the main implementation.
+    return ConvertData<Int32Type>(data);
+  } else if (type_size == 8) {
+    // We need to scale down from int64 to int32
+    auto new_buffer = std::make_shared<PoolBuffer>(pool_);
+    RETURN_NOT_OK(new_buffer->Resize(sizeof(int32_t) * length_));
+
+    auto input = reinterpret_cast<const int64_t*>(PyArray_DATA(arr_));
+    auto output = reinterpret_cast<int32_t*>(new_buffer->mutable_data());
+
+    if (is_strided()) {
+      // Strided, must copy into new contiguous memory
+      const int64_t stride = PyArray_STRIDES(arr_)[0];
+      const int64_t stride_elements = stride / sizeof(int64_t);
+      CopyStrided(input, length_, stride_elements, output);
+    } else {
+      // TODO(wesm): int32 overflow checks
+      for (int64_t i = 0; i < length_; ++i) {
+        *output++ = static_cast<int32_t>(*input++);
+      }
+    }
+    *data = new_buffer;
+  } else {
+    std::stringstream ss;
+    ss << "Cannot convert NumPy array of element size ";
+    ss << type_size << " to a Date32 array";
+    return Status::NotImplemented(ss.str());
+  }
+
+  return Status::OK();
+}
+
+template <>
+inline Status NumPyConverter::ConvertData<BooleanType>(std::shared_ptr<Buffer>* data) {
+  int64_t nbytes = BitUtil::BytesForBits(length_);
+  auto buffer = std::make_shared<PoolBuffer>(pool_);
+  RETURN_NOT_OK(buffer->Resize(nbytes));
+
+  Ndarray1DIndexer<uint8_t> values(arr_);
+
+  uint8_t* bitmap = buffer->mutable_data();
+
+  memset(bitmap, 0, nbytes);
+  for (int i = 0; i < length_; ++i) {
+    if (values[i] > 0) {
+      BitUtil::SetBit(bitmap, i);
+    }
+  }
+
+  *data = buffer;
+  return Status::OK();
+}
+
+template <typename T>
+struct UnboxDate {};
+
+template <>
+struct UnboxDate<Date32Type> {
+  static int32_t Unbox(PyObject* obj) {
+    return PyDate_to_days(reinterpret_cast<PyDateTime_Date*>(obj));
+  }
+};
+
+template <>
+struct UnboxDate<Date64Type> {
+  static int64_t Unbox(PyObject* obj) {
+    return PyDate_to_ms(reinterpret_cast<PyDateTime_Date*>(obj));
+  }
+};
+
+template <typename ArrowType>
+Status NumPyConverter::ConvertDates() {
+  PyAcquireGIL lock;
+
+  using BuilderType = typename TypeTraits<ArrowType>::BuilderType;
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+
+  if (mask_ != nullptr) {
+    return Status::NotImplemented("mask not supported in object conversions yet");
+  }
+
+  BuilderType builder(pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  /// We have to run this in this compilation unit, since we cannot use the
+  /// datetime API otherwise
+  PyDateTime_IMPORT;
+
+  PyObject* obj;
+  for (int64_t i = 0; i < length_; ++i) {
+    obj = objects[i];
+    if (PyDate_CheckExact(obj)) {
+      RETURN_NOT_OK(builder.Append(UnboxDate<ArrowType>::Unbox(obj)));
+    } else if (PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder.AppendNull());
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Date: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "datetime.date", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+
+  return PushBuilderResult(&builder);
+}
+
+Status NumPyConverter::ConvertDecimals() {
+  PyAcquireGIL lock;
+
+  // Import the decimal module and Decimal class
+  OwnedRef decimal;
+  OwnedRef Decimal;
+  RETURN_NOT_OK(ImportModule("decimal", &decimal));
+  RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal));
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+  PyObject* object = objects[0];
+
+  int precision;
+  int scale;
+
+  RETURN_NOT_OK(InferDecimalPrecisionAndScale(object, &precision, &scale));
+
+  type_ = std::make_shared<DecimalType>(precision, scale);
+
+  DecimalBuilder builder(type_, pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  for (int64_t i = 0; i < length_; ++i) {
+    object = objects[i];
+    if (PyObject_IsInstance(object, Decimal.obj())) {
+      std::string string;
+      RETURN_NOT_OK(PythonDecimalToString(object, &string));
+
+      Decimal128 value;
+      RETURN_NOT_OK(Decimal128::FromString(string, &value));
+      RETURN_NOT_OK(builder.Append(value));
+    } else if (PandasObjectIsNull(object)) {
+      RETURN_NOT_OK(builder.AppendNull());
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Decimal: ";
+      RETURN_NOT_OK(InvalidConversion(object, "decimal.Decimal", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+  return PushBuilderResult(&builder);
+}
+
+Status NumPyConverter::ConvertTimes() {
+  // Convert array of datetime.time objects to Arrow
+  PyAcquireGIL lock;
+  PyDateTime_IMPORT;
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+
+  // datetime.time stores microsecond resolution
+  Time64Builder builder(::arrow::time64(TimeUnit::MICRO), pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  PyObject* obj;
+  for (int64_t i = 0; i < length_; ++i) {
+    obj = objects[i];
+    if (PyTime_Check(obj)) {
+      RETURN_NOT_OK(builder.Append(PyTime_to_us(obj)));
+    } else if (PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder.AppendNull());
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Time: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "datetime.time", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+  return PushBuilderResult(&builder);
+}
+
+Status NumPyConverter::ConvertObjectStrings() {
+  PyAcquireGIL lock;
+
+  // The output type at this point is inconclusive because there may be bytes
+  // and unicode mixed in the object array
+  StringBuilder builder(pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  bool global_have_bytes = false;
+  int64_t offset = 0;
+  while (offset < length_) {
+    bool chunk_have_bytes = false;
+    RETURN_NOT_OK(
+        AppendObjectStrings(arr_, mask_, offset, &builder, &offset, &chunk_have_bytes));
+
+    global_have_bytes = global_have_bytes | chunk_have_bytes;
+    std::shared_ptr<Array> chunk;
+    RETURN_NOT_OK(builder.Finish(&chunk));
+    out_arrays_.emplace_back(std::move(chunk));
+  }
+
+  // If we saw PyBytes, convert everything to BinaryArray
+  if (global_have_bytes) {
+    for (size_t i = 0; i < out_arrays_.size(); ++i) {
+      auto binary_data = out_arrays_[i]->data()->ShallowCopy();
+      binary_data->type = ::arrow::binary();
+      out_arrays_[i] = std::make_shared<BinaryArray>(binary_data);
+    }
+  }
+  return Status::OK();
+}
+
+Status NumPyConverter::ConvertObjectFloats() {
+  PyAcquireGIL lock;
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+  Ndarray1DIndexer<uint8_t> mask_values;
+
+  bool have_mask = false;
+  if (mask_ != nullptr) {
+    mask_values.Init(mask_);
+    have_mask = true;
+  }
+
+  DoubleBuilder builder(pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  PyObject* obj;
+  for (int64_t i = 0; i < objects.size(); ++i) {
+    obj = objects[i];
+    if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder.AppendNull());
+    } else if (PyFloat_Check(obj)) {
+      double val = PyFloat_AsDouble(obj);
+      RETURN_IF_PYERROR();
+      RETURN_NOT_OK(builder.Append(val));
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Double: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "float", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+
+  return PushBuilderResult(&builder);
+}
+
+Status NumPyConverter::ConvertObjectIntegers() {
+  PyAcquireGIL lock;
+
+  Int64Builder builder(pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+  Ndarray1DIndexer<uint8_t> mask_values;
+
+  bool have_mask = false;
+  if (mask_ != nullptr) {
+    mask_values.Init(mask_);
+    have_mask = true;
+  }
+
+  PyObject* obj;
+  for (int64_t i = 0; i < objects.size(); ++i) {
+    obj = objects[i];
+    if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) {
+      RETURN_NOT_OK(builder.AppendNull());
+    } else if (PyObject_is_integer(obj)) {
+      const int64_t val = static_cast<int64_t>(PyLong_AsLong(obj));
+      RETURN_IF_PYERROR();
+      RETURN_NOT_OK(builder.Append(val));
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Int64: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "integer", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+
+  return PushBuilderResult(&builder);
+}
+
+Status NumPyConverter::ConvertObjectFixedWidthBytes(
+    const std::shared_ptr<DataType>& type) {
+  PyAcquireGIL lock;
+
+  int32_t byte_width = static_cast<const FixedSizeBinaryType&>(*type).byte_width();
+
+  // The output type at this point is inconclusive because there may be bytes
+  // and unicode mixed in the object array
+  FixedSizeBinaryBuilder builder(type, pool_);
+  RETURN_NOT_OK(builder.Resize(length_));
+
+  int64_t offset = 0;
+  while (offset < length_) {
+    RETURN_NOT_OK(
+        AppendObjectFixedWidthBytes(arr_, mask_, byte_width, offset, &builder, &offset));
+
+    std::shared_ptr<Array> chunk;
+    RETURN_NOT_OK(builder.Finish(&chunk));
+    out_arrays_.emplace_back(std::move(chunk));
+  }
+  return Status::OK();
+}
+
+Status NumPyConverter::ConvertBooleans() {
+  PyAcquireGIL lock;
+
+  Ndarray1DIndexer<PyObject*> objects(arr_);
+  Ndarray1DIndexer<uint8_t> mask_values;
+
+  bool have_mask = false;
+  if (mask_ != nullptr) {
+    mask_values.Init(mask_);
+    have_mask = true;
+  }
+
+  int64_t nbytes = BitUtil::BytesForBits(length_);
+  auto data = std::make_shared<PoolBuffer>(pool_);
+  RETURN_NOT_OK(data->Resize(nbytes));
+  uint8_t* bitmap = data->mutable_data();
+  memset(bitmap, 0, nbytes);
+
+  int64_t null_count = 0;
+  PyObject* obj;
+  for (int64_t i = 0; i < length_; ++i) {
+    obj = objects[i];
+    if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) {
+      ++null_count;
+    } else if (obj == Py_True) {
+      BitUtil::SetBit(bitmap, i);
+      BitUtil::SetBit(null_bitmap_data_, i);
+    } else if (obj == Py_False) {
+      BitUtil::SetBit(null_bitmap_data_, i);
+    } else {
+      std::stringstream ss;
+      ss << "Error converting from Python objects to Boolean: ";
+      RETURN_NOT_OK(InvalidConversion(obj, "bool", &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+
+  out_arrays_.push_back(
+      std::make_shared<BooleanArray>(length_, data, null_bitmap_, null_count));
+  return Status::OK();
+}
+
+Status NumPyConverter::ConvertObjectsInfer() {
+  Ndarray1DIndexer<PyObject*> objects;
+
+  PyAcquireGIL lock;
+  objects.Init(arr_);
+  PyDateTime_IMPORT;
+
+  OwnedRef decimal;
+  OwnedRef Decimal;
+  RETURN_NOT_OK(ImportModule("decimal", &decimal));
+  RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal));
+
+  for (int64_t i = 0; i < length_; ++i) {
+    PyObject* obj = objects[i];
+    if (PandasObjectIsNull(obj)) {
+      continue;
+    } else if (PyObject_is_string(obj)) {
+      return ConvertObjectStrings();
+    } else if (PyObject_is_float(obj)) {
+      return ConvertObjectFloats();
+    } else if (PyBool_Check(obj)) {
+      return ConvertBooleans();
+    } else if (PyObject_is_integer(obj)) {
+      return ConvertObjectIntegers();
+    } else if (PyDate_CheckExact(obj)) {
+      // We could choose Date32 or Date64
+      return ConvertDates<Date32Type>();
+    } else if (PyTime_Check(obj)) {
+      return ConvertTimes();
+    } else if (PyObject_IsInstance(const_cast<PyObject*>(obj), Decimal.obj())) {
+      return ConvertDecimals();
+    } else if (PyList_Check(obj) || PyArray_Check(obj)) {
+      std::shared_ptr<DataType> inferred_type;
+      RETURN_NOT_OK(InferArrowType(obj, &inferred_type));
+      return ConvertLists(inferred_type);
+    } else {
+      const std::string supported_types =
+          "string, bool, float, int, date, time, decimal, list, array";
+      std::stringstream ss;
+      ss << "Error inferring Arrow type for Python object array. ";
+      RETURN_NOT_OK(InvalidConversion(obj, supported_types, &ss));
+      return Status::Invalid(ss.str());
+    }
+  }
+  out_arrays_.push_back(std::make_shared<NullArray>(length_));
+  return Status::OK();
+}
+
+Status NumPyConverter::ConvertObjectsInferAndCast() {
+  size_t position = out_arrays_.size();
+  RETURN_NOT_OK(ConvertObjectsInfer());
+
+  std::shared_ptr<Array> arr = out_arrays_[position];
+
+  // Perform cast
+  compute::FunctionContext context(pool_);
+  compute::CastOptions options;
+  options.allow_int_overflow = false;
+
+  std::shared_ptr<Array> casted;
+  RETURN_NOT_OK(compute::Cast(&context, *arr, type_, options, &casted));
+
+  // Replace with casted values
+  out_arrays_[position] = casted;
+
+  return Status::OK();
+}
+
+Status NumPyConverter::ConvertObjects() {
+  // Python object arrays are annoying, since we could have one of:
+  //
+  // * Strings
+  // * Booleans with nulls
+  // * decimal.Decimals
+  // * Mixed type (not supported at the moment by arrow format)
+  //
+  // Additionally, nulls may be encoded either as np.nan or None. So we have to
+  // do some type inference and conversion
+
+  RETURN_NOT_OK(InitNullBitmap());
+
+  // This means we received an explicit type from the user
+  if (type_) {
+    switch (type_->id()) {
+      case Type::STRING:
+        return ConvertObjectStrings();
+      case Type::FIXED_SIZE_BINARY:
+        return ConvertObjectFixedWidthBytes(type_);
+      case Type::BOOL:
+        return ConvertBooleans();
+      case Type::DATE32:
+        return ConvertDates<Date32Type>();
+      case Type::DATE64:
+        return ConvertDates<Date64Type>();
+      case Type::LIST: {
+        const auto& list_field = static_cast<const ListType&>(*type_);
+        return ConvertLists(list_field.value_field()->type());
+      }
+      case Type::DECIMAL:
+        return ConvertDecimals();
+      default:
+        return ConvertObjectsInferAndCast();
+    }
+  } else {
+    // Re-acquire GIL
+    return ConvertObjectsInfer();
+  }
+}
+
+template <typename T>
+Status LoopPySequence(PyObject* sequence, T func) {
+  if (PySequence_Check(sequence)) {
+    OwnedRef ref;
+    Py_ssize_t size = PySequence_Size(sequence);
+    if (PyArray_Check(sequence)) {
+      auto array = reinterpret_cast<PyArrayObject*>(sequence);
+      Ndarray1DIndexer<PyObject*> objects(array);
+      for (int64_t i = 0; i < size; ++i) {
+        RETURN_NOT_OK(func(objects[i]));
+      }
+    } else {
+      for (int64_t i = 0; i < size; ++i) {
+        ref.reset(PySequence_GetItem(sequence, i));
+        RETURN_NOT_OK(func(ref.obj()));
+      }
+    }
+  } else if (PyObject_HasAttrString(sequence, "__iter__")) {
+    OwnedRef iter = OwnedRef(PyObject_GetIter(sequence));
+    PyObject* item;
+    while ((item = PyIter_Next(iter.obj()))) {
+      OwnedRef ref = OwnedRef(item);
+      RETURN_NOT_OK(func(ref.obj()));
+    }
+  } else {
+    return Status::TypeError("Object is not a sequence or iterable");
+  }
+
+  return Status::OK();
+}
+
+template <int ITEM_TYPE, typename ArrowType>
+inline Status NumPyConverter::ConvertTypedLists(const std::shared_ptr<DataType>& type,
+                                                ListBuilder* builder, PyObject* list) {
+  typedef internal::npy_traits<ITEM_TYPE> traits;
+  typedef typename traits::BuilderClass BuilderT;
+
+  PyAcquireGIL lock;
+
+  // TODO: mask not supported here
+  if (mask_ != nullptr) {
+    return Status::NotImplemented("mask not supported in object conversions yet");
+  }
+
+  BuilderT* value_builder = static_cast<BuilderT*>(builder->value_builder());
+
+  auto foreach_item = [&](PyObject* object) {
+    if (PandasObjectIsNull(object)) {
+      return builder->AppendNull();
+    } else if (PyArray_Check(object)) {
+      auto numpy_array = reinterpret_cast<PyArrayObject*>(object);
+      RETURN_NOT_OK(builder->Append(true));
+
+      // TODO(uwe): Support more complex numpy array structures
+      RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE));
+
+      return AppendNdarrayToBuilder<ITEM_TYPE, BuilderT>(numpy_array, value_builder);
+    } else if (PyList_Check(object)) {
+      int64_t size;
+      std::shared_ptr<DataType> inferred_type;
+      RETURN_NOT_OK(builder->Append(true));
+      RETURN_NOT_OK(InferArrowTypeAndSize(object, &size, &inferred_type));
+      if (inferred_type->id() != Type::NA && inferred_type->id() != type->id()) {
+        std::stringstream ss;
+        ss << inferred_type->ToString() << " cannot be converted to " << type->ToString();
+        return Status::TypeError(ss.str());
+      }
+      return AppendPySequence(object, size, type, value_builder);
+    } else {
+      return Status::TypeError("Unsupported Python type for list items");
+    }
+  };
+
+  return LoopPySequence(list, foreach_item);
+}
+
+template <>
+inline Status NumPyConverter::ConvertTypedLists<NPY_OBJECT, NullType>(
+    const std::shared_ptr<DataType>& type, ListBuilder* builder, PyObject* list) {
+  PyAcquireGIL lock;
+
+  // TODO: mask not supported here
+  if (mask_ != nullptr) {
+    return Status::NotImplemented("mask not supported in object conversions yet");
+  }
+
+  auto value_builder = static_cast<NullBuilder*>(builder->value_builder());
+
+  auto foreach_item = [&](PyObject* object) {
+    if (PandasObjectIsNull(object)) {
+      return builder->AppendNull();
+    } else if (PyArray_Check(object)) {
+      auto numpy_array = reinterpret_cast<PyArrayObject*>(object);
+      RETURN_NOT_OK(builder->Append(true));
+
+      // TODO(uwe): Support more complex numpy array structures
+      RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
+
+      for (int64_t i = 0; i < static_cast<int64_t>(PyArray_SIZE(numpy_array)); ++i) {
+        RETURN_NOT_OK(value_builder->AppendNull());
+      }
+      return Status::OK();
+    } else if (PyList_Check(object)) {
+      RETURN_NOT_OK(builder->Append(true));
+      const Py_ssize_t size = PySequence_Size(object);
+      for (Py_ssize_t i = 0; i < size; ++i) {
+        RETURN_NOT_OK(value_builder->AppendNull());
+      }
+      return Status::OK();
+    } else {
+      return Status::TypeError("Unsupported Python type for list items");
+    }
+  };
+
+  return LoopPySequence(list, foreach_item);
+}
+
+template <>
+inline Status NumPyConverter::ConvertTypedLists<NPY_OBJECT, StringType>(
+    const std::shared_ptr<DataType>& type, ListBuilder* builder, PyObject* list) {
+  PyAcquireGIL lock;
+  // TODO: If there are bytes involed, convert to Binary representation
+  bool have_bytes = false;
+
+  // TODO: mask not supported here
+  if (mask_ != nullptr) {
+    return Status::NotImplemented("mask not supported in object conversions yet");
+  }
+
+  auto value_builder = static_cast<StringBuilder*>(builder->value_builder());
+
+  auto foreach_item = [&](PyObject* object) {
+    if (PandasObjectIsNull(object)) {
+      return builder->AppendNull();
+    } else if (PyArray_Check(object)) {
+      auto numpy_array = reinterpret_cast<PyArrayObject*>(object);
+      RETURN_NOT_OK(builder->Append(true));
+
+      // TODO(uwe): Support more complex numpy array structures
+      RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
+
+      int64_t offset = 0;
+      RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, value_builder, &offset,
+                                        &have_bytes));
+      if (offset < PyArray_SIZE(numpy_array)) {
+        return Status::Invalid("Array cell value exceeded 2GB");
+      }
+      return Status::OK();
+    } else if (PyList_Check(object)) {
+      int64_t size;
+      std::shared_ptr<DataType> inferred_type;
+      RETURN_NOT_OK(builder->Append(true));
+      RETURN_NOT_OK(InferArrowTypeAndSize(object, &size, &inferred_type));
+      if (inferred_type->id() != Type::NA && inferred_type->id() != Type::STRING) {
+        std::stringstream ss;
+        ss << inferred_type->ToString() << " cannot be converted to STRING.";
+        return Status::TypeError(ss.str());
+      }
+      return AppendPySequence(object, size, inferred_type, value_builder);
+    } else {
+      return Status::TypeError("Unsupported Python type for list items");
+    }
+  };
+
+  return LoopPySequence(list, foreach_item);
+}
+
+#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType)                            \
+  case Type::TYPE: {                                                      \
+    return ConvertTypedLists<NUMPY_TYPE, ArrowType>(type, builder, list); \
+  }
+
+Status NumPyConverter::ConvertLists(const std::shared_ptr<DataType>& type,
+                                    ListBuilder* builder, PyObject* list) {
+  switch (type->id()) {
+    LIST_CASE(NA, NPY_OBJECT, NullType)
+    LIST_CASE(UINT8, NPY_UINT8, UInt8Type)
+    LIST_CASE(INT8, NPY_INT8, Int8Type)
+    LIST_CASE(UINT16, NPY_UINT16, UInt16Type)
+    LIST_CASE(INT16, NPY_INT16, Int16Type)
+    LIST_CASE(UINT32, NPY_UINT32, UInt32Type)
+    LIST_CASE(INT32, NPY_INT32, Int32Type)
+    LIST_CASE(UINT64, NPY_UINT64, UInt64Type)
+    LIST_CASE(INT64, NPY_INT64, Int64Type)
+    LIST_CASE(TIMESTAMP, NPY_DATETIME, TimestampType)
+    LIST_CASE(FLOAT, NPY_FLOAT, FloatType)
+    LIST_CASE(DOUBLE, NPY_DOUBLE, DoubleType)
+    LIST_CASE(STRING, NPY_OBJECT, StringType)
+    case Type::LIST: {
+      const ListType& list_type = static_cast<const ListType&>(*type);
+      auto value_builder = static_cast<ListBuilder*>(builder->value_builder());
+
+      auto foreach_item = [&](PyObject* object) {
+        if (PandasObjectIsNull(object)) {
+          return builder->AppendNull();
+        } else {
+          RETURN_NOT_OK(builder->Append(true));
+          return ConvertLists(list_type.value_type(), value_builder, object);
+        }
+      };
+
+      return LoopPySequence(list, foreach_item);
+    }
+    default: {
+      std::stringstream ss;
+      ss << "Unknown list item type: ";
+      ss << type->ToString();
+      return Status::TypeError(ss.str());
+    }
+  }
+}
+
+Status NumPyConverter::ConvertLists(const std::shared_ptr<DataType>& type) {
+  std::unique_ptr<ArrayBuilder> array_builder;
+  RETURN_NOT_OK(MakeBuilder(pool_, arrow::list(type), &array_builder));
+  ListBuilder* list_builder = static_cast<ListBuilder*>(array_builder.get());
+  RETURN_NOT_OK(ConvertLists(type, list_builder, reinterpret_cast<PyObject*>(arr_)));
+  return PushBuilderResult(list_builder);
+}
+
+Status NdarrayToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo,
+                      bool use_pandas_null_sentinels,
+                      const std::shared_ptr<DataType>& type,
+                      std::shared_ptr<ChunkedArray>* out) {
+  NumPyConverter converter(pool, ao, mo, type, use_pandas_null_sentinels);
+  RETURN_NOT_OK(converter.Convert());
+  DCHECK(converter.result()[0]);
+  *out = std::make_shared<ChunkedArray>(converter.result());
+  return Status::OK();
+}
+
+}  // namespace py
+}  // namespace arrow

http://git-wip-us.apache.org/repos/asf/arrow/blob/ccbf6446/cpp/src/arrow/python/numpy_to_arrow.h
----------------------------------------------------------------------
diff --git a/cpp/src/arrow/python/numpy_to_arrow.h b/cpp/src/arrow/python/numpy_to_arrow.h
new file mode 100644
index 0000000..4a70b4b
--- /dev/null
+++ b/cpp/src/arrow/python/numpy_to_arrow.h
@@ -0,0 +1,56 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Converting from pandas memory representation to Arrow data structures
+
+#ifndef ARROW_PYTHON_NUMPY_TO_ARROW_H
+#define ARROW_PYTHON_NUMPY_TO_ARROW_H
+
+#include "arrow/python/platform.h"
+
+#include <memory>
+
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+class Array;
+class ChunkedArray;
+class DataType;
+class MemoryPool;
+class Status;
+
+namespace py {
+
+/// Convert NumPy arrays to Arrow. If target data type is not known, pass a
+/// type with nullptr
+///
+/// \param[in] pool Memory pool for any memory allocations
+/// \param[in] ao an ndarray with the array data
+/// \param[in] mo an ndarray with a null mask (True is null), optional
+/// \param[in] type
+/// \param[out] out a ChunkedArray, to accommodate chunked output
+ARROW_EXPORT
+Status NdarrayToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo,
+                      bool use_pandas_null_sentinels,
+                      const std::shared_ptr<DataType>& type,
+                      std::shared_ptr<ChunkedArray>* out);
+
+}  // namespace py
+}  // namespace arrow
+
+#endif  // ARROW_PYTHON_NUMPY_TO_ARROW_H


Mime
View raw message