arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject arrow git commit: ARROW-446: [Python] Expand Sphinx documentation for 0.3
Date Mon, 08 May 2017 04:49:42 GMT
Repository: arrow
Updated Branches:
  refs/heads/master d7a2a1e18 -> cb5e7b6fa


ARROW-446: [Python] Expand Sphinx documentation for 0.3

I am going to finish the data model section and revamp the Parquet section, so we can get
this pushed out with the release announcement tomorrow. We should continue to add a lot of
new documentation over the coming weeks

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #656 from wesm/ARROW-446 and squashes the following commits:

b92c6d2 [Wes McKinney] Make pass over Parquet docs
a46f846 [Wes McKinney] Make a pass over Parquet documentation
066d0b9 [Wes McKinney] Finish first cut at data model section
4f510fb [Wes McKinney] Install IPython before building docs
4885222 [Wes McKinney] Start on a data model section
1d512e9 [Wes McKinney] Add barebones IPC section
0f800d8 [Wes McKinney] Add section on OSFile, MemoryMappedFile
aabf5b2 [Wes McKinney] Add draft about memory/io
5968847 [Wes McKinney] More on Memory/IO section


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/cb5e7b6f
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/cb5e7b6f
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/cb5e7b6f

Branch: refs/heads/master
Commit: cb5e7b6fa7d75e14e163ce43cb333b02e9fe1c03
Parents: d7a2a1e
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Mon May 8 00:49:37 2017 -0400
Committer: Wes McKinney <wes.mckinney@twosigma.com>
Committed: Mon May 8 00:49:37 2017 -0400

----------------------------------------------------------------------
 ci/travis_script_python.sh        |   2 +-
 python/doc/requirements.txt       |   2 +
 python/doc/source/api.rst         |  19 +-
 python/doc/source/conf.py         |  14 +-
 python/doc/source/data.rst        | 316 +++++++++++++++++++++++++++++++++
 python/doc/source/filesystems.rst |   8 +-
 python/doc/source/index.rst       |   5 +-
 python/doc/source/ipc.rst         | 136 ++++++++++++++
 python/doc/source/jemalloc.rst    |   9 +-
 python/doc/source/memory.rst      | 235 ++++++++++++++++++++++++
 python/doc/source/pandas.rst      |  36 ++--
 python/doc/source/parquet.rst     | 243 +++++++++++++++++++------
 python/pyarrow/_io.pyx            |   1 +
 13 files changed, 936 insertions(+), 90 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/ci/travis_script_python.sh
----------------------------------------------------------------------
diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh
index 20b0f2a..ce5f7ec 100755
--- a/ci/travis_script_python.sh
+++ b/ci/travis_script_python.sh
@@ -117,7 +117,7 @@ python_version_tests() {
   # Build documentation once
   if [[ "$PYTHON_VERSION" == "3.6" ]]
   then
-      pip install -r doc/requirements.txt
+      conda install -y -q --file=doc/requirements.txt
       python setup.py build_sphinx -s doc/source
   fi
 }

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/requirements.txt
----------------------------------------------------------------------
diff --git a/python/doc/requirements.txt b/python/doc/requirements.txt
index ce0793c..f3c3414 100644
--- a/python/doc/requirements.txt
+++ b/python/doc/requirements.txt
@@ -1,3 +1,5 @@
+ipython
+matplotlib
 numpydoc
 sphinx
 sphinx_rtd_theme

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/api.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst
index 08a0694..a8dd8c5 100644
--- a/python/doc/source/api.rst
+++ b/python/doc/source/api.rst
@@ -22,7 +22,7 @@
 API Reference
 *************
 
-.. _api.functions:
+.. _api.types:
 
 Type and Schema Factory Functions
 ---------------------------------
@@ -58,6 +58,8 @@ Type and Schema Factory Functions
    schema
    from_numpy_dtype
 
+.. _api.value:
+
 Scalar Value Types
 ------------------
 
@@ -88,6 +90,7 @@ Scalar Value Types
    TimestampValue
    DecimalValue
 
+.. _api.array:
 
 Array Types and Constructors
 ----------------------------
@@ -122,6 +125,8 @@ Array Types and Constructors
    DecimalArray
    ListArray
 
+.. _api.table:
+
 Tables and Record Batches
 -------------------------
 
@@ -134,6 +139,8 @@ Tables and Record Batches
    Table
    get_record_batch_size
 
+.. _api.tensor:
+
 Tensor type and Functions
 -------------------------
 
@@ -145,6 +152,8 @@ Tensor type and Functions
    get_tensor_size
    read_tensor
 
+.. _api.io:
+
 Input / Output and Shared Memory
 --------------------------------
 
@@ -160,6 +169,8 @@ Input / Output and Shared Memory
    create_memory_map
    PythonFile
 
+.. _api.ipc:
+
 Interprocess Communication and Messaging
 ----------------------------------------
 
@@ -171,6 +182,8 @@ Interprocess Communication and Messaging
    StreamReader
    StreamWriter
 
+.. _api.memory_pool:
+
 Memory Pools
 ------------
 
@@ -183,6 +196,8 @@ Memory Pools
    total_allocated_bytes
    set_memory_pool
 
+.. _api.type_classes:
+
 Type Classes
 ------------
 
@@ -201,6 +216,8 @@ Type Classes
 
 .. currentmodule:: pyarrow.parquet
 
+.. _api.parquet:
+
 Apache Parquet
 --------------
 

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/conf.py
----------------------------------------------------------------------
diff --git a/python/doc/source/conf.py b/python/doc/source/conf.py
index a9262bf..7f98979 100644
--- a/python/doc/source/conf.py
+++ b/python/doc/source/conf.py
@@ -25,19 +25,11 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
-import inspect
 import os
 import sys
 
 import sphinx_rtd_theme
 
-on_rtd = os.environ.get('READTHEDOCS') == 'True'
-
-if not on_rtd:
-    # Hack: On RTD we use the pyarrow package from conda-forge as we cannot
-    # build pyarrow there.
-    sys.path.insert(0, os.path.abspath('..'))
-
 sys.path.extend([
     os.path.join(os.path.dirname(__file__),
                  '..', '../..')
@@ -60,6 +52,8 @@ extensions = [
     'sphinx.ext.mathjax',
     'sphinx.ext.viewcode',
     'sphinx.ext.napoleon',
+    'IPython.sphinxext.ipython_directive',
+    'IPython.sphinxext.ipython_console_highlighting'
 ]
 
 # numpydoc configuration
@@ -86,7 +80,7 @@ master_doc = 'index'
 
 # General information about the project.
 project = u'pyarrow'
-copyright = u'2016 Apache Software Foundation'
+copyright = u'2016-2017 Apache Software Foundation'
 author = u'Apache Software Foundation'
 
 # The version info for the project you're documenting, acts as replacement for
@@ -156,7 +150,7 @@ todo_include_todos = False
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
-html_theme = 'sphinx_rtd_theme'
+html_theme = 'sphinxdoc'
 
 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/data.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/data.rst b/python/doc/source/data.rst
new file mode 100644
index 0000000..04e74ae
--- /dev/null
+++ b/python/doc/source/data.rst
@@ -0,0 +1,316 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _data:
+
+In-Memory Data Model
+====================
+
+Apache Arrow defines columnar array data structures by composing type metadata
+with memory buffers, like the ones explained in the documentation on
+:ref:`Memory and IO <io>`. These data structures are exposed in Python through
+a series of interrelated classes:
+
+* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical
+  array type
+* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named
+  collection of types. These can be thought of as the column types in a
+  table-like object.
+* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous
+  columnar data structures composed from Arrow Buffer objects
+* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a
+  collection of Array objects with a particular Schema
+* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in
+  which each column consists of one or more ``pyarrow.Array`` objects of the
+  same type.
+
+We will examine these in the sections below in a series of examples.
+
+.. _data.types:
+
+Type Metadata
+-------------
+
+Apache Arrow defines language agnostic column-oriented data structures for
+array data. These include:
+
+* **Fixed-length primitive types**: numbers, booleans, date and times, fixed
+  size binary, decimals, and other values that fit into a given number
+* **Variable-length primitive types**: binary, string
+* **Nested types**: list, struct, and union
+* **Dictionary type**: An encoded categorical type (more on this later)
+
+Each logical data type in Arrow has a corresponding factory function for
+creating an instance of that type object in Python:
+
+.. ipython:: python
+
+   import pyarrow as pa
+   t1 = pa.int32()
+   t2 = pa.string()
+   t3 = pa.binary()
+   t4 = pa.binary(10)
+   t5 = pa.timestamp('ms')
+
+   t1
+   print(t1)
+   print(t4)
+   print(t5)
+
+We use the name **logical type** because the **physical** storage may be the
+same for one or more types. For example, ``int64``, ``float64``, and
+``timestamp[ms]`` all occupy 64 bits per value.
+
+These objects are `metadata`; they are used for describing the data in arrays,
+schemas, and record batches. In Python, they can be used in functions where the
+input data (e.g. Python objects) may be coerced to more than one Arrow type.
+
+The :class:`~pyarrow.Field` type is a type plus a name and optional
+user-defined metadata:
+
+.. ipython:: python
+
+   f0 = pa.field('int32_field', t1)
+   f0
+   f0.name
+   f0.type
+
+Arrow supports **nested value types** like list, struct, and union. When
+creating these, you must pass types or fields to indicate the data types of the
+types' children. For example, we can define a list of int32 values with:
+
+.. ipython:: python
+
+   t6 = pa.list_(t1)
+   t6
+
+A `struct` is a collection of named fields:
+
+.. ipython:: python
+
+   fields = [
+       pa.field('s0', t1),
+       pa.field('s1', t2),
+       pa.field('s2', t4),
+       pa.field('s3', t6)
+   ]
+
+   t7 = pa.struct(fields)
+   print(t7)
+
+See :ref:`Data Types API <api.types>` for a full listing of data type
+functions.
+
+.. _data.schema:
+
+Schemas
+-------
+
+The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it
+defines the column names and types in a record batch or table data
+structure. The ``pyarrow.schema`` factory function makes new Schema objects in
+Python:
+
+.. ipython:: python
+
+   fields = [
+       pa.field('s0', t1),
+       pa.field('s1', t2),
+       pa.field('s2', t4),
+       pa.field('s3', t6)
+   ]
+
+   my_schema = pa.schema(fields)
+   my_schema
+
+In some applications, you may not create schemas directly, only using the ones
+that are embedded in :ref:`IPC messages <ipc>`.
+
+.. _data.array:
+
+Arrays
+------
+
+For each data type, there is an accompanying array data structure for holding
+memory buffers that define a single contiguous chunk of columnar array
+data. When you are using PyArrow, this data may come from IPC tools, though it
+can also be created from various types of Python sequences (lists, NumPy
+arrays, pandas data).
+
+A simple way to create arrays is with ``pyarrow.array``, which is similar to
+the ``numpy.array`` function:
+
+.. ipython:: python
+
+   arr = pa.array([1, 2, None, 3])
+   arr
+
+The array's ``type`` attribute is the corresponding piece of type metadata:
+
+.. ipython:: python
+
+   arr.type
+
+Each in-memory array has a known length and null count (which will be 0 if
+there are no null values):
+
+.. ipython:: python
+
+   len(arr)
+   arr.null_count
+
+Scalar values can be selected with normal indexing.  ``pyarrow.array`` converts
+``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for
+nulls:
+
+.. ipython:: python
+
+   arr[0]
+   arr[2]
+
+Arrow data is immutable, so values can be selected but not assigned.
+
+Arrays can be sliced without copying:
+
+.. ipython:: python
+
+   arr[3]
+
+``pyarrow.array`` can create simple nested data structures like lists:
+
+.. ipython:: python
+
+   nested_arr = pa.array([[], None, [1, 2], [None, 1]])
+   print(nested_arr.type)
+
+Dictionary Arrays
+~~~~~~~~~~~~~~~~~
+
+The **Dictionary** type in PyArrow is a special array type that is similar to a
+factor in R or a ``pandas.Categorical``. It enables one or more record batches
+in a file or stream to transmit integer *indices* referencing a shared
+**dictionary** containing the distinct values in the logical array. This is
+particularly often used with strings to save memory and improve performance.
+
+The way that dictionaries are handled in the Apache Arrow format and the way
+they appear in C++ and Python is slightly different. We define a special
+:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's
+consider an example:
+
+.. ipython:: python
+
+   indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
+   dictionary = pa.array(['foo', 'bar', 'baz'])
+
+   dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
+   dict_array
+
+Here we have:
+
+.. ipython:: python
+
+   print(dict_array.type)
+   dict_array.indices
+   dict_array.dictionary
+
+When using :class:`~.DictionaryArray` with pandas, the analogue is
+``pandas.Categorical`` (more on this later):
+
+.. ipython:: python
+
+   dict_array.to_pandas()
+
+.. _data.record_batch:
+
+Record Batches
+--------------
+
+A **Record Batch** in Apache Arrow is a collection of equal-length array
+instances. Let's consider a collection of arrays:
+
+.. ipython:: python
+
+   data = [
+       pa.array([1, 2, 3, 4]),
+       pa.array(['foo', 'bar', 'baz', None]),
+       pa.array([True, None, False, True])
+   ]
+
+A record batch can be created from this list of arrays using
+``RecordBatch.from_arrays``:
+
+.. ipython:: python
+
+   batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
+   batch.num_columns
+   batch.num_rows
+   batch.schema
+
+   batch[1]
+
+A record batch can be sliced without copying memory like an array:
+
+.. ipython:: python
+
+   batch2 = batch.slice(1, 3)
+   batch2[1]
+
+.. _data.table:
+
+Tables
+------
+
+The PyArrow :class:`~.Table` type is not part of the Apache Arrow
+specification, but is rather a tool to help with wrangling multiple record
+batches and array pieces as a single logical dataset. As a relevant example, we
+may receive multiple small record batches in a socket stream, then need to
+concatenate them into contiguous memory for use in NumPy or pandas. The Table
+object makes this efficient without requiring additional memory copying.
+
+Considering the record batch we created above, we can create a Table containing
+one or more copies of the batch using ``Table.from_batches``:
+
+.. ipython:: python
+
+   batches = [batch] * 5
+   table = pa.Table.from_batches(batches)
+   table
+   table.num_rows
+
+The table's columns are instances of :class:`~.Column`, which is a container
+for one or more arrays of the same type.
+
+.. ipython:: python
+
+   c = table[0]
+   c
+   c.data
+   c.data.num_chunks
+   c.data.chunk(0)
+
+As you'll see in the :ref:`pandas section <pandas>`, we can convert thee
+objects to contiguous NumPy arrays for use in pandas:
+
+.. ipython:: python
+
+   c.to_pandas()
+
+Custom Schema and Field Metadata
+--------------------------------
+
+TODO

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/filesystems.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/filesystems.rst b/python/doc/source/filesystems.rst
index 9e00ddd..61c03c5 100644
--- a/python/doc/source/filesystems.rst
+++ b/python/doc/source/filesystems.rst
@@ -15,10 +15,12 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-File interfaces and Memory Maps
-===============================
+Filesystem Interfaces
+=====================
 
-PyArrow features a number of file-like interfaces
+In this section, we discuss filesystem-like interfaces in PyArrow.
+
+.. _hdfs:
 
 Hadoop File System (HDFS)
 -------------------------

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/index.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/index.rst b/python/doc/source/index.rst
index 55b4efc..4bfbe44 100644
--- a/python/doc/source/index.rst
+++ b/python/doc/source/index.rst
@@ -36,8 +36,11 @@ structures.
 
    install
    development
-   pandas
+   memory
+   data
+   ipc
    filesystems
+   pandas
    parquet
    api
    getting_involved

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/ipc.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/ipc.rst b/python/doc/source/ipc.rst
new file mode 100644
index 0000000..e63e745
--- /dev/null
+++ b/python/doc/source/ipc.rst
@@ -0,0 +1,136 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+
+.. _ipc:
+
+IPC: Fast Streaming and Serialization
+=====================================
+
+Arrow defines two types of binary formats for serializing record batches:
+
+* **Streaming format**: for sending an arbitrary length sequence of record
+  batches. The format must be processed from start to end, and does not support
+  random access
+
+* **File or Random Access format**: for serializing a fixed number of record
+  batches. Supports random access, and thus is very useful when used with
+  memory maps
+
+To follow this section, make sure to first read the section on :ref:`Memory and
+IO <io>`.
+
+Writing and Reading Streams
+---------------------------
+
+First, let's create a small record batch:
+
+.. ipython:: python
+
+   import pyarrow as pa
+
+   data = [
+       pa.array([1, 2, 3, 4]),
+       pa.array(['foo', 'bar', 'baz', None]),
+       pa.array([True, None, False, True])
+   ]
+
+   batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
+   batch.num_rows
+   batch.num_columns
+
+Now, we can begin writing a stream containing some number of these batches. For
+this we use :class:`~pyarrow.StreamWriter`, which can write to a writeable
+``NativeFile`` object or a writeable Python object:
+
+.. ipython:: python
+
+   sink = pa.InMemoryOutputStream()
+   writer = pa.StreamWriter(sink, batch.schema)
+
+Here we used an in-memory Arrow buffer stream, but this could have been a
+socket or some other IO sink.
+
+When creating the ``StreamWriter``, we pass the schema, since the schema
+(column names and types) must be the same for all of the batches sent in this
+particular stream. Now we can do:
+
+.. ipython:: python
+
+   for i in range(5):
+      writer.write_batch(batch)
+   writer.close()
+
+   buf = sink.get_result()
+   buf.size
+
+Now ``buf`` contains the complete stream as an in-memory byte buffer. We can
+read such a stream with :class:`~pyarrow.StreamReader`:
+
+.. ipython:: python
+
+   reader = pa.StreamReader(buf)
+   reader.schema
+
+   batches = [b for b in reader]
+   len(batches)
+
+We can check the returned batches are the same as the original input:
+
+.. ipython:: python
+
+   batches[0].equals(batch)
+
+An important point is that if the input source supports zero-copy reads
+(e.g. like a memory map, or ``pyarrow.BufferReader``), then the returned
+batches are also zero-copy and do not allocate any new memory on read.
+
+Writing and Reading Random Access Files
+---------------------------------------
+
+The :class:`~pyarrow.FileWriter` has the same API as
+:class:`~pyarrow.StreamWriter`:
+
+.. ipython:: python
+
+   sink = pa.InMemoryOutputStream()
+   writer = pa.FileWriter(sink, batch.schema)
+
+   for i in range(10):
+      writer.write_batch(batch)
+   writer.close()
+
+   buf = sink.get_result()
+   buf.size
+
+The difference between :class:`~pyarrow.FileReader` and
+:class:`~pyarrow.StreamReader` is that the input source must have a ``seek``
+method for random access. The stream reader only requires read operations:
+
+.. ipython:: python
+
+   reader = pa.FileReader(buf)
+
+Because we have access to the entire payload, we know the number of record
+batches in the file, and can read any at random:
+
+.. ipython:: python
+
+   reader.num_record_batches
+   b = reader.get_batch(3)
+   b.equals(batch)

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/jemalloc.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/jemalloc.rst b/python/doc/source/jemalloc.rst
index 8d7a5dc..9389dcb 100644
--- a/python/doc/source/jemalloc.rst
+++ b/python/doc/source/jemalloc.rst
@@ -18,7 +18,7 @@
 jemalloc MemoryPool
 ===================
 
-Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator
+Arrow's default :class:`~pyarrow.MemoryPool` uses the system's allocator
 through the POSIX APIs. Although this already provides aligned allocation, the
 POSIX interface doesn't support aligned reallocation. The default reallocation
 strategy is to allocate a new region, copy over the old data and free the
@@ -27,10 +27,9 @@ the existing memory allocation to the requested size. While this may still
be
 linear in the size of allocated memory, it is magnitudes faster as only the page
 mapping in the kernel is touched, not the actual data.
 
-The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the
-use of the system allocator and/or other allocators like ``tcmalloc``. You can
-either explicitly make it the default allocator or pass it only to single
-operations.
+The jemalloc-based allocator is not enabled by default to allow the use of the
+system allocator and/or other allocators like ``tcmalloc``. You can either
+explicitly make it the default allocator or pass it only to single operations.
 
 .. code:: python
 

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/memory.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/memory.rst b/python/doc/source/memory.rst
new file mode 100644
index 0000000..d1020da
--- /dev/null
+++ b/python/doc/source/memory.rst
@@ -0,0 +1,235 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _io:
+
+Memory and IO Interfaces
+========================
+
+This section will introduce you to the major concepts in PyArrow's memory
+management and IO systems:
+
+* Buffers
+* File-like and stream-like objects
+* Memory pools
+
+pyarrow.Buffer
+--------------
+
+The :class:`~pyarrow.Buffer` object wraps the C++ ``arrow::Buffer`` type and is
+the primary tool for memory management in Apache Arrow in C++. It permits
+higher-level array classes to safely interact with memory which they may or may
+not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply
+reference other Buffers, while preserving memory lifetime and clean
+parent-child relationships.
+
+There are many implementations of ``arrow::Buffer``, but they all provide a
+standard interface: a data pointer and length. This is similar to Python's
+built-in `buffer protocol` and ``memoryview`` objects.
+
+A :class:`~pyarrow.Buffer` can be created from any Python object which
+implements the buffer protocol. Let's consider a bytes object:
+
+.. ipython:: python
+
+   import pyarrow as pa
+
+   data = b'abcdefghijklmnopqrstuvwxyz'
+   buf = pa.frombuffer(data)
+   buf
+   buf.size
+
+Creating a Buffer in this way does not allocate any memory; it is a zero-copy
+view on the memory exported from the ``data`` bytes object.
+
+The Buffer's ``to_pybytes`` method can convert to a Python byte string:
+
+.. ipython:: python
+
+   buf.to_pybytes()
+
+Buffers can be used in circumstances where a Python buffer or memoryview is
+required, and such conversions are also zero-copy:
+
+.. ipython:: python
+
+   memoryview(buf)
+
+.. _io.native_file:
+
+Native Files
+------------
+
+The Arrow C++ libraries have several abstract interfaces for different kinds of
+IO objects:
+
+* Read-only streams
+* Read-only files supporting random access
+* Write-only streams
+* Write-only files supporting random access
+* File supporting reads, writes, and random access
+
+In the the interest of making these objects behave more like Python's built-in
+``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class
+which is intended to mimic Python files and able to be used in functions where
+a Python file (such as ``file`` or ``BytesIO``) is expected.
+
+:class:`~pyarrow.NativeFile` has some important features which make it
+preferable to using Python files with PyArrow where possible:
+
+* Other Arrow classes can access the internal C++ IO objects natively, and do
+  not need to acquire the Python GIL
+* Native C++ IO may be able to do zero-copy IO, such as with memory maps
+
+There are several kinds of :class:`~pyarrow.NativeFile` options available:
+
+* :class:`~pyarrow.OSFile`, a native file that uses your operating system's
+  file descriptors
+* :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with
+  memory maps
+* :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects
+  as a file
+* :class:`~pyarrow.InMemoryOutputStream`, for writing data in-memory, producing
+  a Buffer at the end
+* :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem
+* :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++
+
+We will discuss these in the following sections after explaining memory pools.
+
+Memory Pools
+------------
+
+All memory allocations and deallocations (like ``malloc`` and ``free`` in C)
+are tracked in an instance of ``arrow::MemoryPool``. This means that we can
+then precisely track amount of memory that has been allocated:
+
+.. ipython:: python
+
+   pa.total_allocated_bytes()
+
+PyArrow uses a default built-in memory pool, but in the future there may be
+additional memory pools (and subpools) to choose from. Let's consider an
+``InMemoryOutputStream``, which is like a ``BytesIO``:
+
+.. ipython:: python
+
+   stream = pa.InMemoryOutputStream()
+   stream.write(b'foo')
+   pa.total_allocated_bytes()
+   for i in range(1024): stream.write(b'foo')
+   pa.total_allocated_bytes()
+
+The default allocator requests memory in a minimum increment of 64 bytes. If
+the stream is garbaged-collected, all of the memory is freed:
+
+.. ipython:: python
+
+   stream = None
+   pa.total_allocated_bytes()
+
+Classes and functions that may allocate memory will often have an option to
+pass in a custom memory pool:
+
+.. ipython:: python
+
+   my_pool = pa.jemalloc_memory_pool()
+   my_pool
+   my_pool.bytes_allocated()
+   stream = pa.InMemoryOutputStream(my_pool)
+   stream.write(b'foo')
+   my_pool.bytes_allocated()
+
+On-Disk and Memory Mapped Files
+-------------------------------
+
+PyArrow includes two ways to interact with data on disk: standard operating
+system-level file APIs, and memory-mapped files. In regular Python we can
+write:
+
+.. ipython:: python
+
+   with open('example.dat', 'wb') as f:
+       f.write(b'some example data')
+
+Using pyarrow's :class:`~pyarrow.OSFile` class, you can write:
+
+.. ipython:: python
+
+   with pa.OSFile('example2.dat', 'wb') as f:
+       f.write(b'some example data')
+
+For reading files, you can use ``OSFile`` or
+:class:`~pyarrow.MemoryMappedFile`. The difference between these is that
+:class:`~pyarrow.OSFile` allocates new memory on each read, like Python file
+objects. In reads from memory maps, the library constructs a buffer referencing
+the mapped memory without any memory allocation or copying:
+
+.. ipython:: python
+
+   file_obj = pa.OSFile('example.dat')
+   mmap = pa.memory_map('example.dat')
+   file_obj.read(4)
+   mmap.read(4)
+
+The ``read`` method implements the standard Python file ``read`` API. To read
+into Arrow Buffer objects, use ``read_buffer``:
+
+.. ipython:: python
+
+   mmap.seek(0)
+   buf = mmap.read_buffer(4)
+   print(buf)
+   buf.to_pybytes()
+
+Many tools in PyArrow, particular the Apache Parquet interface and the file and
+stream messaging tools, are more efficient when used with these ``NativeFile``
+types than with normal Python file objects.
+
+.. ipython:: python
+   :suppress:
+
+   buf = mmap = file_obj = None
+   !rm example.dat
+   !rm example2.dat
+
+In-Memory Reading and Writing
+-----------------------------
+
+To assist with serialization and deserialization of in-memory data, we have
+file interfaces that can read and write to Arrow Buffers.
+
+.. ipython:: python
+
+   writer = pa.InMemoryOutputStream()
+   writer.write(b'hello, friends')
+
+   buf = writer.get_result()
+   buf
+   buf.size
+   reader = pa.BufferReader(buf)
+   reader.seek(7)
+   reader.read(7)
+
+These have similar semantics to Python's built-in ``io.BytesIO``.
+
+Hadoop Filesystem
+-----------------
+
+:class:`~pyarrow.HdfsFile` is an implementation of :class:`~pyarrow.NativeFile`
+that can read and write to the Hadoop filesytem. Read more in the
+:ref:`Filesystems Section <hdfs>`.

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/pandas.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/pandas.rst b/python/doc/source/pandas.rst
index 34445ae..cb7a56d 100644
--- a/python/doc/source/pandas.rst
+++ b/python/doc/source/pandas.rst
@@ -15,17 +15,17 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-Pandas Interface
-================
+Using PyArrow with pandas
+=========================
 
-To interface with Pandas, PyArrow provides various conversion routines to
-consume Pandas structures and convert back to them.
+To interface with pandas, PyArrow provides various conversion routines to
+consume pandas structures and convert back to them.
 
 DataFrames
 ----------
 
-The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
-Both consist of a set of named columns of equal length. While Pandas only
+The equivalent to a pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
+Both consist of a set of named columns of equal length. While pandas only
 supports flat columns, the Table also provides nested columns, thus it can
 represent more data than a DataFrame, so a full conversion is not always possible.
 
@@ -33,9 +33,9 @@ Conversion from a Table to a DataFrame is done by calling
 :meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using
 :meth:`pyarrow.Table.from_pandas`. This conversion routine provides the
 convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of
-different resolutions, Pandas only supports nanosecond timestamps and most
+different resolutions, pandas only supports nanosecond timestamps and most
 other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
-can be used to already do the time conversion during the Pandas to Arrow
+can be used to already do the time conversion during the pandas to Arrow
 conversion.
 
 .. code-block:: python
@@ -44,35 +44,35 @@ conversion.
     import pandas as pd
 
     df = pd.DataFrame({"a": [1, 2, 3]})
-    # Convert from Pandas to Arrow
+    # Convert from pandas to Arrow
     table = pa.Table.from_pandas(df)
-    # Convert back to Pandas
+    # Convert back to pandas
     df_new = table.to_pandas()
 
 
 Series
 ------
 
-In Arrow, the most similar structure to a Pandas Series is an Array.
+In Arrow, the most similar structure to a pandas Series is an Array.
 It is a vector that contains data of the same type as linear memory. You can
-convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
+convert a pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
 As Arrow Arrays are always nullable, you can supply an optional mask using
 the ``mask`` parameter to mark all null-entries.
 
 Type differences
 ----------------
 
-With the current design of Pandas and Arrow, it is not possible to convert all
-column types unmodified. One of the main issues here is that Pandas has no
+With the current design of pandas and Arrow, it is not possible to convert all
+column types unmodified. One of the main issues here is that pandas has no
 support for nullable columns of arbitrary type. Also ``datetime64`` is currently
 fixed to nanosecond resolution. On the other side, Arrow might be still missing
 support for some types.
 
-Pandas -> Arrow Conversion
+pandas -> Arrow Conversion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 +------------------------+--------------------------+
-| Source Type (Pandas)   | Destination Type (Arrow) |
+| Source Type (pandas)   | Destination Type (Arrow) |
 +========================+==========================+
 | ``bool``               | ``BOOL``                 |
 +------------------------+--------------------------+
@@ -91,11 +91,11 @@ Pandas -> Arrow Conversion
 | ``datetime.date``      | ``DATE``                 |
 +------------------------+--------------------------+
 
-Arrow -> Pandas Conversion
+Arrow -> pandas Conversion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 +-------------------------------------+--------------------------------------------------------+
-| Source Type (Arrow)                 | Destination Type (Pandas)                       
      |
+| Source Type (Arrow)                 | Destination Type (pandas)                       
      |
 +=====================================+========================================================+
 | ``BOOL``                            | ``bool``                                        
      |
 +-------------------------------------+--------------------------------------------------------+

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/doc/source/parquet.rst
----------------------------------------------------------------------
diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index 8e011e4..3317b99 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -15,77 +15,218 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-Reading/Writing Parquet files
-=============================
+.. currentmodule:: pyarrow
+.. _parquet:
 
-If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was
-found during the build, you can read files in the Parquet format to/from Arrow
-memory structures. The Parquet support code is located in the
-:mod:`pyarrow.parquet` module and your package needs to be built with the
-``--with-parquet`` flag for ``build_ext``.
+Reading and Writing the Apache Parquet Format
+=============================================
 
-Reading Parquet
----------------
+The `Apache Parquet <http://parquet.apache.org/>`_ project provides a
+standardized open-source columnar storage format for use in data analysis
+systems. It was created originally for use in `Apache Hadoop
+<http://hadoop.apache.org/>`_ with systems like `Apache Drill
+<http://drill.apache.org>`_, `Apache Hive <http://hive.apache.org>`_, `Apache
+Impala (incubating) <http://impala.apache.org>`_, and `Apache Spark
+<http://spark.apache.org>`_ adopting it as a shared standard for high
+performance data IO.
 
-To read a Parquet file into Arrow memory, you can use the following code
-snippet. It will read the whole Parquet file into memory as an
-:class:`~pyarrow.table.Table`.
+Apache Arrow is an ideal in-memory transport layer for data that is being read
+or written with Parquet files. We have been concurrently developing the `C++
+implementation of Apache Parquet <http://github.com/apache/parquet-cpp>`_,
+which includes a native, multithreaded C++ adapter to and from in-memory Arrow
+data. PyArrow includes Python bindings to this code, which thus enables reading
+and writing Parquet files with pandas as well.
 
-.. code-block:: python
+Obtaining PyArrow with Parquet Support
+--------------------------------------
+
+If you installed ``pyarrow`` with pip or conda, it should be built with Parquet
+support bundled:
+
+.. ipython:: python
+
+   import pyarrow.parquet as pq
+
+If you are building ``pyarrow`` from source, you must also build `parquet-cpp
+<http://github.com/apache/parquet-cpp>`_ and enable the Parquet extensions when
+building ``pyarrow``. See the :ref:`Development <development>` page for more
+details.
+
+Reading and Writing Single Files
+--------------------------------
+
+The functions :func:`~.parquet.read_table` and :func:`~.parquet.write_table`
+read and write the :ref:`pyarrow.Table <data.table>` objects, respectively.
+
+Let's look at a simple table:
+
+.. ipython:: python
+
+   import numpy as np
+   import pandas as pd
+   import pyarrow as pa
+
+   df = pd.DataFrame({'one': [-1, np.nan, 2.5],
+                      'two': ['foo', 'bar', 'baz'],
+                      'three': [True, False, True]})
+   table = pa.Table.from_pandas(df)
+
+We write this to Parquet format with ``write_table``:
+
+.. ipython:: python
+
+   import pyarrow.parquet as pq
+   pq.write_table(table, 'example.parquet')
+
+This creates a single Parquet file. In practice, a Parquet dataset may consist
+of many files in many directories. We can read a single file back with
+``read_table``:
+
+.. ipython:: python
+
+   table2 = pq.read_table('example.parquet')
+   table2.to_pandas()
+
+You can pass a subset of columns to read, which can be much faster than reading
+the whole file (due to the columnar layout):
+
+.. ipython:: python
+
+   pq.read_table('example.parquet', columns=['one', 'three'])
+
+We need not use a string to specify the origin of the file. It can be any of:
+
+* A file path as a string
+* A :ref:`NativeFile <io.native_file>` from PyArrow
+* A Python file object
+
+In general, a Python file object will have the worst read performance, while a
+string file path or an instance of :class:`~.NativeFIle` (especially memory
+maps) will perform the best.
 
-    import pyarrow.parquet as pq
+Finer-grained Reading and Writing
+---------------------------------
 
-    table = pq.read_table('<filename>')
+``read_table`` uses the :class:`~.ParquetFile` class, which has other features:
 
-As DataFrames stored as Parquet are often stored in multiple files, a
-convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided.
+.. ipython:: python
 
-If you already have the Parquet available in memory or get it via non-file
-source, you can utilize :class:`pyarrow.io.BufferReader` to read it from
-memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply
-a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`.
+   parquet_file = pq.ParquetFile('example.parquet')
+   parquet_file.metadata
+   parquet_file.schema
 
-.. code:: python
+As you can learn more in the `Apache Parquet format
+<https://github.com/apache/parquet-format>`_, a Parquet file consists of
+multiple row groups. ``read_table`` will read all of the row groups and
+concatenate them into a single table. You can read individual row groups with
+``read_row_group``:
 
-    import pyarrow.io as paio
-    import pyarrow.parquet as pq
+.. ipython:: python
 
-    buf = ... # either bytes or paio.Buffer
-    reader = paio.BufferReader(buf)
-    table = pq.read_table(reader)
+   parquet_file.num_row_groups
+   parquet_file.read_row_group(0)
 
-Writing Parquet
----------------
+We can similarly write a Parquet file with multiple row groups by using
+``ParquetWriter``:
 
-Given an instance of :class:`pyarrow.table.Table`, the most simple way to
-persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table`
-method.
+.. ipython:: python
+
+   writer = pq.ParquetWriter('example2.parquet', table.schema)
+   for i in range(3):
+       writer.write_table(table)
+   writer.close()
+
+   pf2 = pq.ParquetFile('example2.parquet')
+   pf2.num_row_groups
+
+.. ipython:: python
+   :suppress:
+
+   !rm example.parquet
+   !rm example2.parquet
+
+Compression, Encoding, and File Compatibility
+---------------------------------------------
+
+The most commonly used Parquet implementations use dictionary encoding when
+writing files; if the dictionaries grow too large, then they "fall back" to
+plain encoding. Whether dictionary encoding is used can be toggled using the
+``use_dictionary`` option:
 
 .. code-block:: python
 
-    import pyarrow as pa
-    import pyarrow.parquet as pq
+   pq.write_table(table, where, use_dictionary=False)
 
-    table = pa.Table(..)
-    pq.write_table(table, '<filename>')
+The data pages within a column in a row group can be compressed after the
+encoding passes (dictionary, RLE encoding). In PyArrow we use Snappy
+compression by default, but Brotli, Gzip, and uncompressed are also supported:
 
-By default this will write the Table as a single RowGroup using ``DICTIONARY``
-encoding. To increase the potential of parallelism a query engine can process
-a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows.
+.. code-block:: python
+
+   pq.write_table(table, where, compression='snappy')
+   pq.write_table(table, where, compression='gzip')
+   pq.write_table(table, where, compression='brotli')
+   pq.write_table(table, where, compression='none')
+
+Snappy generally results in better performance, while Gzip may yield smaller
+files.
+
+These settings can also be set on a per-column basis:
+
+.. code-block:: python
 
-If you also want to compress the columns, you can select a compression
-method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
-you want to minimize size and ``SNAPPY`` for performance.
+   pa.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'},
+                  use_dictionary=['foo', 'bar'])
 
-Instead of writing to a file, you can also write to Python ``bytes`` by
-utilizing an :class:`pyarrow.io.InMemoryOutputStream()`:
+Reading Multiples Files and Partitioned Datasets
+------------------------------------------------
 
-.. code:: python
+Multiple Parquet files constitute a Parquet *dataset*. These may present in a
+number of ways:
 
-    import pyarrow.io as paio
-    import pyarrow.parquet as pq
+* A list of Parquet absolute file paths
+* A directory name containing nested directories defining a partitioned dataset
+
+A dataset partitioned by year and month may look like on disk:
+
+.. code-block:: text
+
+   dataset_name/
+     year=2007/
+       month=01/
+          0.parq
+          1.parq
+          ...
+       month=02/
+          0.parq
+          1.parq
+          ...
+       month=03/
+       ...
+     year=2008/
+       month=01/
+       ...
+     ...
+
+The :class:`~.ParquetDataset` class accepts either a directory name or a list
+or file paths, and can discover and infer some common partition structures,
+such as those produced by Hive:
+
+.. code-block:: python
+
+   dataset = pq.ParquetDataset('dataset_name/')
+   table = dataset.read()
+
+Multithreaded Reads
+-------------------
+
+Each of the reading functions have an ``nthreads`` argument which will read
+columns with the indicated level of parallelism. Depending on the speed of IO
+and how expensive it is to decode the columns in a particular file
+(particularly with GZIP compression), this can yield significantly higher data
+throughput:
+
+.. code-block:: python
 
-    table = ...
-    output = paio.InMemoryOutputStream()
-    pq.write_table(table, output)
-    pybytes = output.get_result().to_pybytes()
+   pq.read_table(where, nthreads=4)
+   pq.ParquetDataset(where).read(nthreads=4)

http://git-wip-us.apache.org/repos/asf/arrow/blob/cb5e7b6f/python/pyarrow/_io.pyx
----------------------------------------------------------------------
diff --git a/python/pyarrow/_io.pyx b/python/pyarrow/_io.pyx
index 40c76f8..e9e2ba0 100644
--- a/python/pyarrow/_io.pyx
+++ b/python/pyarrow/_io.pyx
@@ -522,6 +522,7 @@ cdef class Buffer:
         buffer.strides = self.strides
         buffer.suboffsets = NULL
 
+
 cdef shared_ptr[PoolBuffer] allocate_buffer(CMemoryPool* pool):
     cdef shared_ptr[PoolBuffer] result
     result.reset(new PoolBuffer(pool))


Mime
View raw message