Return-Path: X-Original-To: apmail-giraph-dev-archive@www.apache.org Delivered-To: apmail-giraph-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5FD0ADD98 for ; Thu, 12 Jul 2012 18:13:35 +0000 (UTC) Received: (qmail 4430 invoked by uid 500); 12 Jul 2012 18:13:35 -0000 Delivered-To: apmail-giraph-dev-archive@giraph.apache.org Received: (qmail 4372 invoked by uid 500); 12 Jul 2012 18:13:35 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Received: (qmail 4363 invoked by uid 500); 12 Jul 2012 18:13:35 -0000 Delivered-To: apmail-incubator-giraph-dev@incubator.apache.org Received: (qmail 4360 invoked by uid 99); 12 Jul 2012 18:13:35 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 18:13:35 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id C1F5714281B for ; Thu, 12 Jul 2012 18:13:34 +0000 (UTC) Date: Thu, 12 Jul 2012 18:13:34 +0000 (UTC) From: "Eli Reisman (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: <855955299.43055.1342116814796.JavaMail.jiratomcat@issues-vm> In-Reply-To: <2109313802.38096.1342041635191.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (GIRAPH-247) Introduce edge based partitioning for InputSplits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated GIRAPH-247: ------------------------------- Attachment: GIRAPH-247-3.patch just a quick change to the patch to make it apply more nicely with GIRAPH-246 if you want to try them both (I recommend it) :) > Introduce edge based partitioning for InputSplits > ------------------------------------------------- > > Key: GIRAPH-247 > URL: https://issues.apache.org/jira/browse/GIRAPH-247 > Project: Giraph > Issue Type: Improvement > Components: graph > Affects Versions: 0.2.0 > Reporter: Eli Reisman > Assignee: Eli Reisman > Priority: Minor > Labels: patch > Fix For: 0.2.0 > > Attachments: GIRAPH-247-1.patch, GIRAPH-247-2.patch, GIRAPH-247-3.patch > > > Experiments on larger data input sets while maintaining low memory profile has revealed that typical social graph data is very lumpy and partitioning by vertices can easily overload some unlucky worker nodes who end up with partitions containing highly-connected vertices while other nodes process partitions with the same number of vertices but far fewer out-edges per vertex. This often results in cascading failures during data load-in even on tiny data sets. > By partitioning using edges (the default I set in GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the old default # of vertices, whichever the user's input format reaches first when reading InputSplits) I have seen dramatic "de-lumpification" of data, allow the processing of 8x larger data sets before memory problems occur at a given configuration setting. > This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that can be set to more edges/partition as your data sets grow or memory limitations shrink. This might be considered a first attempt, perhaps simply allowing us to default to this type of partitioning or the old version would be more compatible with existing users' needs? That would not be a hard feature to add to this. But I think this method of partition production has merit for typical large-scale graph data that Giraph is designed to process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira