Return-Path: X-Original-To: apmail-giraph-dev-archive@www.apache.org Delivered-To: apmail-giraph-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40A41C8D9 for ; Mon, 5 Aug 2013 22:51:49 +0000 (UTC) Received: (qmail 64435 invoked by uid 500); 5 Aug 2013 22:51:48 -0000 Delivered-To: apmail-giraph-dev-archive@giraph.apache.org Received: (qmail 64376 invoked by uid 500); 5 Aug 2013 22:51:48 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Received: (qmail 64239 invoked by uid 500); 5 Aug 2013 22:51:48 -0000 Delivered-To: apmail-incubator-giraph-dev@incubator.apache.org Received: (qmail 64218 invoked by uid 99); 5 Aug 2013 22:51:48 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Aug 2013 22:51:48 +0000 Date: Mon, 5 Aug 2013 22:51:48 +0000 (UTC) From: "Nitay Joffe (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (GIRAPH-684) Improve Writable API MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GIRAPH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitay Joffe updated GIRAPH-684: ------------------------------- Description: While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion. The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase. I think we need to change e.g. Vertex to just Vertex. We store for each type a SerDe that knows how to serialize/deserialize that type. If the user passes us a Writable then we use a WritableSerDe. This means no changes required to existing code. Note that the SerDe interface does not allow for using a type like Long directly. This is by design since immutable types don't work with Giraph. The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following: 1) Be a type we know how to serialize, e.g. LongWritable. 2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible. 3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable. With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive. I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers. The change is insignificant: 319 seconds total time vs 311. The new version is actually faster (but I think that is mostly just variance noise). Here is the code: https://reviews.apache.org/r/13306/ was: While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion. The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase. I think we need to change e.g. Vertex to just Vertex. We keep a Map that tells us how to serialize classes. This map can be initialized with things we know how to serialize, e.g. Long, Double, and String. So then the I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following: 1) Be a type we know how to serialize, e.g. Long. 2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible. 3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable. With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive. > Improve Writable API > -------------------- > > Key: GIRAPH-684 > URL: https://issues.apache.org/jira/browse/GIRAPH-684 > Project: Giraph > Issue Type: Bug > Reporter: Nitay Joffe > Assignee: Nitay Joffe > > While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion. > The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase. > I think we need to change e.g. Vertex to just Vertex. > We store for each type a SerDe that knows how to serialize/deserialize that type. If the user passes us a Writable then we use a WritableSerDe. This means no changes required to existing code. > Note that the SerDe interface does not allow for using a type like Long directly. This is by design since immutable types don't work with Giraph. > The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following: > 1) Be a type we know how to serialize, e.g. LongWritable. > 2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible. > 3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable. > With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive. > I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers. The change is insignificant: 319 seconds total time vs 311. The new version is actually faster (but I think that is mostly just variance noise). > Here is the code: https://reviews.apache.org/r/13306/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira