Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@giraph.apache.org
Date: Mon, 5 Aug 2013 22:51:48 +0000 (UTC)
From: "Nitay Joffe (JIRA)" <jira@apache.org>
To: giraph-dev@incubator.apache.org
Message-ID: <JIRA.12651656.1370599750450.4555.1375743108153@arcas>
In-Reply-To: <JIRA.12651656.1370599750450@arcas>
References: <JIRA.12651656.1370599750450@arcas>
Subject: [jira] [Updated] (GIRAPH-684) Improve Writable API
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/GIRAPH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nitay Joffe updated GIRAPH-684:
-------------------------------

    Description: 
While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion.

The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase.

I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable, E extends Writable> to just Vertex<I, V, E>.

We store for each type a SerDe that knows how to serialize/deserialize that type. If the user passes us a Writable then we use a WritableSerDe. This means no changes required to existing code.

Note that the SerDe interface does not allow for using a type like Long directly. This is by design since immutable types don't work with Giraph.

The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following:
1) Be a type we know how to serialize, e.g. LongWritable.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible.
3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable.

With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive.

I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers. The change is insignificant: 319 seconds total time vs 311. The new version is actually faster (but I think that is mostly just variance noise).

Here is the code: https://reviews.apache.org/r/13306/

  was:
While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion.

The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase.

I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable, E extends Writable> to just Vertex<I extends Comparable, V, E>.

We keep a Map<Class, Serializer> that tells us how to serialize classes. This map can be initialized with things we know how to serialize, e.g. Long, Double, and String.

So then the I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following:
1) Be a type we know how to serialize, e.g. Long.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible.
3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable.

With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive.

    
> Improve Writable API
> --------------------
>
>                 Key: GIRAPH-684
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-684
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>
> While working on GIRAPH-683 I realized something: The python code the user has to write is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)). This is incredibly ugly in my opinion.
> The problem is that we have a tight coupling between user types and their serialization, so the "everything must be Writable" spreads throughout the codebase.
> I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable, E extends Writable> to just Vertex<I, V, E>.
> We store for each type a SerDe that knows how to serialize/deserialize that type. If the user passes us a Writable then we use a WritableSerDe. This means no changes required to existing code.
> Note that the SerDe interface does not allow for using a type like Long directly. This is by design since immutable types don't work with Giraph.
> The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following:
> 1) Be a type we know how to serialize, e.g. LongWritable.
> 2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if it is and if so we use their code. This makes everything backwards compatible.
> 3) The user has registered his own serializer. This lets them serialize completely new types, for example a fastutil map, without having to subclass that type to make it Writable.
> With this improved API in place, all computation code (and user code in general) would be much cleaner and simpler. It will also make things like Jython much more intuitive.
> I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers. The change is insignificant: 319 seconds total time vs 311. The new version is actually faster (but I think that is mostly just variance noise).
> Here is the code: https://reviews.apache.org/r/13306/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira