Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3ED2A9E07 for ; Thu, 20 Oct 2011 01:46:33 +0000 (UTC) Received: (qmail 22225 invoked by uid 500); 20 Oct 2011 01:46:33 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 22186 invoked by uid 500); 20 Oct 2011 01:46:32 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 22178 invoked by uid 99); 20 Oct 2011 01:46:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 01:46:32 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 01:46:31 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B5844312CCD for ; Thu, 20 Oct 2011 01:46:11 +0000 (UTC) Date: Thu, 20 Oct 2011 01:46:11 +0000 (UTC) From: "Todd Lipcon (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1540890187.13704.1319075171744.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-1639) Grouping using hashing instead of sorting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131259#comment-13131259 ] Todd Lipcon commented on MAPREDUCE-1639: ---------------------------------------- The JNI cost makes sense, but the linked HBase JIRA doesn't use JNI. It uses sun.misc.unsafe calls which are actually JVM intrinsics (ie they get directly compiled into assembly, rather than going through the whole calling-convention + safepoint shenanigans that JNI does) > Grouping using hashing instead of sorting > ----------------------------------------- > > Key: MAPREDUCE-1639 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Joydeep Sen Sarma > > most applications of map-reduce care about grouping and not sorting. Sorting is a (relatively expensive) way to achieve grouping. In order to achieve just grouping - one can: > - replace the sort on the Mappers with a HashTable - and maintain lists of key-values against each hash-bucket. > - key-value tuples inside each hash bucket are sorted - before spilling or sending to Reducer. Anytime this is done - Combiner can be invoked. > - HashTable is serialized by hash-bucketid. So merges (of either spills or Map Outputs) works similar to today (at least there's no change in overall compute complexity of merge) > Of course this hashtable has nothing to do with partitioning. it's just a replacement for map-side sort. > -- > this is (pretty much) straight from the MARS project paper: http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf. They report a 45% speedup in inverted index calculation using hashing instead of sorting (reference implementation is NOT against Hadoop though). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira