incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-540) Create distributed sort BSP
Date Fri, 13 Apr 2012 11:47:17 GMT

    [ https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253302#comment-13253302
] 

Thomas Jungblut commented on HAMA-540:
--------------------------------------

Here's my first prototype:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/SamplingSort.java

I am really astonished that this works- 

You see that the pivotting is a bit naive, because the distribution is totally not even (that
-> mapping between the logs). 

{noformat}
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total input paths to process : 1
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total # of splits: 7
12/04/13 13:39:19 WARN bsp.BSPJobClient: No job jar file set.  User classes may not be found.
See BSPJob#setJar(String) or check Your jar file.
12/04/13 13:39:20 INFO bsp.BSPJobClient: Running job: job_localrunner_0001
12/04/13 13:39:22 INFO bsp.LocalBSPRunner: Setting up a new barrier for 7 tasks!
local:6 -> 176
local:2 -> 133
local:5 -> 189
local:0 -> 113
local:3 -> 92
local:4 -> 29
local:1 -> 78
12/04/13 13:39:23 INFO bsp.BSPJobClient: Current supersteps number: 1
12/04/13 13:39:23 INFO bsp.BSPJobClient: The total number of supersteps: 1
from file:/tmp/hama-sampling-out/part-00000
-2145373038 -2135777393 -2127418941 -2127349118 -2116694526 -2112753401 -2111019858 -2109843938
-2109467658 
-1775154178 -1771096268 -1768609402 -1767599475 -1753155542 -1744884630 -1736545907 -1734220768
-1727656934 
-1727161209 -1724429198 -1712603905 -1711206669 -1693536736 
from file:/tmp/hama-sampling-out/part-00001
-1684778946 -1683715271 -1677988183 -1673772158 -1672941153 -1669199897 -1661791404 -1660526886
-1658572801 
-1579967204 -1577470192 -1569276585
<rest omitted>
{noformat}

However I very much doubt that the algorithm is faster than MapReduce. I think we can use
the Quicksort class in Hadoop to further optimize, I used Java7's new Timsort in an Arrays.sort()
because it is in-place. To get there, I have a huge collections overhead and RAM usage. 
But the idea of the algorithm is very cool.
                
> Create distributed sort BSP
> ---------------------------
>
>                 Key: HAMA-540
>                 URL: https://issues.apache.org/jira/browse/HAMA-540
>             Project: Hama
>          Issue Type: New Feature
>          Components: bsp, examples
>            Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks this could
be as well practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message