beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-840) Add Java SDK extension to support non-distributed sorting
Date Wed, 09 Nov 2016 22:27:58 GMT

    [ https://issues.apache.org/jira/browse/BEAM-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652236#comment-15652236
] 

ASF GitHub Bot commented on BEAM-840:
-------------------------------------

GitHub user mizitch opened a pull request:

    https://github.com/apache/incubator-beam/pull/1327

    [BEAM-840] Some minor changes and fixes for sorter module. 

    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
    
     - [x] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [x] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [x] If this contribution is large, please file an Apache
           [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt).
    
    ---
    Includes:
    * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent
int overflow within Hadoop's sorting library
    * Fix int overflow for large memory values in InMemorySorter
    * Add note about estimated disk use to README.MD
    * Fix to make Hadoop's sorting library put all temp files under the specified directory
    * Have Hadoop clean up the temp directory on exit
    * Stop shading hadoop dependencies. Some context:
    ** The existing shading is broken (modules that depend on this one cannot use it successfully).
    ** Hadoop's use of reflection in several instances makes shading the dependency "in a
good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients
that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet
its intended goal of preventing conflicts anyway.
    ** From what I can tell, there's no good way to shade this to make it universally usable,
so leaving it unshaded seems like a reasonable default.
    ** Without shading Hadoop, this module can be successfully used from Beam's wordcount
example (which actually does have pre-existing hadoop dependencies already).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mizitch/incubator-beam sorter-gcs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-beam/pull/1327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1327
    
----
commit d07c4ce9349abac4d0c53223072f1c84a1dc98c6
Author: Mitch Shanklin <mshanklin@google.com>
Date:   2016-11-09T22:09:49Z

    Some minor changes and fixes for sorter module. Includes:
    
    * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent
int overflow within Hadoop's sorting library
    * Fix int overflow for large memory values in InMemorySorter
    * Add note about estimated disk use to README.MD
    * Fix to make Hadoop's sorting library put all temp files under the specified directory
    * Have Hadoop clean up the temp directory on exit
    * Stop shading hadoop dependencies. Some context:
    ** The existing shading is broken (modules that depend on this one cannot use it successfully).
    ** Hadoop's use of reflection in several instances makes shading the dependency "in a
good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients
that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet
its intended goal of preventing conflicts anyway.
    ** From what I can tell, there's no good way to shade this to make it universally usable,
so leaving it unshaded seems like a reasonable default.
    ** Without shading Hadoop, this module can be successfully used from Beam's wordcount
example (which actually does have pre-existing hadoop dependencies already).

----


> Add Java SDK extension to support non-distributed sorting
> ---------------------------------------------------------
>
>                 Key: BEAM-840
>                 URL: https://issues.apache.org/jira/browse/BEAM-840
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>    Affects Versions: 0.4.0-incubating
>            Reporter: Mitch Shanklin
>            Assignee: Mitch Shanklin
>            Priority: Minor
>
> Add an extension that provides a PTransform which performs local(non-distributed) sorting.
It will sort in memory until the buffer is full, then flush to disk and use external sorting.
>     
> Consumes a PCollection of KVs from primary key to iterable of secondary key and value
KVs and sorts the iterables. Would probably be called after a GroupByKey. Uses coders to convert
secondary keys and values into byte arrays and does a lexicographical comparison on the secondary
keys.
> Uses Hadoop as an external sorting library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message