giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nitay Joffe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-684) Improve Writable API
Date Mon, 05 Aug 2013 22:51:48 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nitay Joffe updated GIRAPH-684:
-------------------------------

    Description: 
While working on GIRAPH-683 I realized something: The python code the user has to write is
fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)).
This is incredibly ugly in my opinion.

The problem is that we have a tight coupling between user types and their serialization, so
the "everything must be Writable" spreads throughout the codebase.

I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable,
E extends Writable> to just Vertex<I, V, E>.

We store for each type a SerDe that knows how to serialize/deserialize that type. If the user
passes us a Writable then we use a WritableSerDe. This means no changes required to existing
code.

Note that the SerDe interface does not allow for using a type like Long directly. This is
by design since immutable types don't work with Giraph.

The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the following:
1) Be a type we know how to serialize, e.g. LongWritable.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if
it is and if so we use their code. This makes everything backwards compatible.
3) The user has registered his own serializer. This lets them serialize completely new types,
for example a fastutil map, without having to subclass that type to make it Writable.

With this improved API in place, all computation code (and user code in general) would be
much cleaner and simpler. It will also make things like Jython much more intuitive.

I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers. The
change is insignificant: 319 seconds total time vs 311. The new version is actually faster
(but I think that is mostly just variance noise).

Here is the code: https://reviews.apache.org/r/13306/

  was:
While working on GIRAPH-683 I realized something: The python code the user has to write is
fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)).
This is incredibly ugly in my opinion.

The problem is that we have a tight coupling between user types and their serialization, so
the "everything must be Writable" spreads throughout the codebase.

I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable,
E extends Writable> to just Vertex<I extends Comparable, V, E>.

We keep a Map<Class, Serializer> that tells us how to serialize classes. This map can
be initialized with things we know how to serialize, e.g. Long, Double, and String.

So then the I,V,E,M parameters, in order to get serialized, would need to adhere to one of
the following:
1) Be a type we know how to serialize, e.g. Long.
2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check if
it is and if so we use their code. This makes everything backwards compatible.
3) The user has registered his own serializer. This lets them serialize completely new types,
for example a fastutil map, without having to subclass that type to make it Writable.

With this improved API in place, all computation code (and user code in general) would be
much cleaner and simpler. It will also make things like Jython much more intuitive.

    
> Improve Writable API
> --------------------
>
>                 Key: GIRAPH-684
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-684
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>
> While working on GIRAPH-683 I realized something: The python code the user has to write
is fairly cumbersome, because they cant just say setValue(4), they have to say setValue(IntWritable(4)).
This is incredibly ugly in my opinion.
> The problem is that we have a tight coupling between user types and their serialization,
so the "everything must be Writable" spreads throughout the codebase.
> I think we need to change e.g. Vertex<I extends WritableComparable, V extends Writable,
E extends Writable> to just Vertex<I, V, E>.
> We store for each type a SerDe that knows how to serialize/deserialize that type. If
the user passes us a Writable then we use a WritableSerDe. This means no changes required
to existing code.
> Note that the SerDe interface does not allow for using a type like Long directly. This
is by design since immutable types don't work with Giraph.
> The I,V,E,M parameters, in order to get serialized, would need to adhere to one of the
following:
> 1) Be a type we know how to serialize, e.g. LongWritable.
> 2) Be Writable. The key is we don't _require_ it on the generic parameter, but we check
if it is and if so we use their code. This makes everything backwards compatible.
> 3) The user has registered his own serializer. This lets them serialize completely new
types, for example a fastutil map, without having to subclass that type to make it Writable.
> With this improved API in place, all computation code (and user code in general) would
be much cleaner and simpler. It will also make things like Jython much more intuitive.
> I ran PageRankBenchmark with this diff using 100M vertices, 10B edges, and 10 workers.
The change is insignificant: 319 seconds total time vs 311. The new version is actually faster
(but I think that is mostly just variance noise).
> Here is the code: https://reviews.apache.org/r/13306/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message