quickstep-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jianqiao <...@git.apache.org>
Subject [GitHub] incubator-quickstep pull request #19: Improve text scan operator
Date Thu, 09 Jun 2016 08:44:29 GMT
GitHub user jianqiao opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/19

    Improve text scan operator

    This PR updates the `TextScanOperator` to improve its performance.
    
    There are three main changes:
    (1) Pass `text_offset` and `text_segment_size` as parameters to each `TextScanWorkOrder`
instead of really loading the data. Then each `TextScanWorkOrder` reads the corresponding
piece of data directly from disk.
    (2) Avoid extra string copying by passing `const char **` buffer pointers into `parseRow()`
and `extractFieldString()`.
    (3) Use `ColumnVectorsValueAccessor` as the temporary container to store the parsed tuples.
Then call `output_destination_->bulkInsertTuples()` to bulk insert the tuples.
    
    **Note 1:** This updated version follows the semantics of the old `TextScanOperator` except
that it does not support the backslash + newline escaping, e.g.
    (a)
    ```
    aaaa\
    bbbb
    ```
    which is semantically equivalent to
    (b)
    ```
    aaaa\nbbbb
    ```
    The updated version supports (b) but not (a). As (a) incurs extra logic that complicates
code. Meanwhile, format (a) seems to be specific to PostgreSQL, and the [documentation](http://www.postgresql.org/docs/9.6/static/sql-copy.html)
of PostgreSQL 9.6 says:
    _It is strongly recommended that applications generating COPY data convert data newlines
and carriage returns to the \n and \r sequences respectively. At present it is possible to
represent a data carriage return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might not be accepted in
future releases. They are also highly vulnerable to corruption if the COPY file is transferred
across different machines (for example, from Unix to Windows or vice versa)._
    
    **Note 2:** This PR relies on the fix from #18 to work correctly for loading `TYPE compressed_columnstore`
tables.
    
    **Note 3:** Using 40 workers, the expected loading time on cloudlab machines with current
SQL-benchmark settings are ~465s for SSB SF100 and ~1050s for TPCH SF100.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-quickstep improve-text-scan-operator-column-vectors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19
    
----
commit 55b06fab1bd336f2cc7ee4bd557d3328a428e4ab
Author: Jianqiao Zhu <jianqiao@cs.wisc.edu>
Date:   2016-06-09T08:18:37Z

    Improve text scan operator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message