giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nitay Joffe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-717) HiveJythonRunner with support for pure Jython value types.
Date Wed, 17 Jul 2013 16:24:49 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nitay Joffe updated GIRAPH-717:
-------------------------------

    Description: 
This adds support for pure Jython jobs. Currently this runner is hooked up to work with Hive.
I'll make it more generic later.

Running a Jython job is simply:

HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar> org.apache.giraph.hive.jython.HiveJythonRunner
[jython1.py] [jython2.py]

You can pass in any number of scripts. They will be parsed in order and sent to all the workers
using DistributedCache.

There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36

There are a few pieces to a Jython job, I'll go over each part here.

The launcher defines the graph types (those IVEMM writables) and sets up the Hive vertex/edge
inputs and output. Each graph type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message value implements
Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable wrapper that serializes
Jython values using Pickle (jython IO framework).

For Hive usage - if your value type is a primitive e.g. IntWritable or LongWritable, then
you need not do anything. The Java code will automatically read/write the Hive table specified
and convert between Hive types and the primitive Writable. The vertex_id type in the example
works like this.
If your value is a custom Jython type, you must create classes which implement JythonHiveReader/JythonHiveWriter
(or JythonHiveIO which is both). These objects read/write Jython types from Hive. There are
wrappers in the Java code which take HiveIO data normally used in giraph-hive and turns them
into Jython types. This means, for example, that getMap() will return a Jython dictionary
instead of a Java Map.

There is also a PageRankBenchmark (from previous diff) implemented in Jython. Here's a run
for comparison / sanity check:

PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
  https://gist.github.com/nitay/3170fa3b575d4d2e22a9
  total time: 302466
with this diff:
  https://gist.github.com/nitay/a52b6d1d64e50ab9829e
  total time: 306517
in jython:
  https://gist.github.com/nitay/3f2e758b2933c3521727
  total time: 434730

So we see that existing things are not affected (is there something else I should test?) and
that Jython has around 40% overhead.

ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to split up :/)

  was:
This adds support for pure Jython jobs. Currently this runner is hooked up to work with Hive.
I'll make it more generic later.

A Jython job is made up of two Jython scripts:

1) launcher - this script is used to configure the job, it is only interpreted locally.
2) worker - this script is distributed to every worker and is used there.

Running a Jython job is simply:

HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar> org.apache.giraph.hive.jython.HiveJythonRunner
jython --launcher <launcher.py> --worker <worker.py>

There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36

There are a few pieces to a Jython job, I'll go over each part here.

The launcher defines the graph types (those IVEMM writables) and sets up the Hive vertex/edge
inputs and output. Each graph type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message value implements
Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable wrapper that serializes
Jython values using Pickle (jython IO framework).

For Hive usage - if your value type is a primitive e.g. IntWritable or LongWritable, then
you need not do anything. The Java code will automatically read/write the Hive table specified
and convert between Hive types and the primitive Writable. The vertex_id type in the example
works like this.
If your value is a custom Jython type, you must create classes which implement JythonHiveReader/JythonHiveWriter
(or JythonHiveIO which is both). These objects read/write Jython types from Hive. There are
wrappers in the Java code which take HiveIO data normally used in giraph-hive and turns them
into Jython types. This means, for example, that getMap() will return a Jython dictionary
instead of a Java Map.

There is also a PageRankBenchmark (from previous diff) implemented in Jython. Here's a run
for comparison / sanity check:

PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
  https://gist.github.com/nitay/3170fa3b575d4d2e22a9
  total time: 302466
with this diff:
  https://gist.github.com/nitay/a52b6d1d64e50ab9829e
  total time: 306517
in jython:
  https://gist.github.com/nitay/3f2e758b2933c3521727
  total time: 434730

So we see that existing things are not affected (is there something else I should test?) and
that Jython has around 40% overhead.

ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to split up :/)

    
> HiveJythonRunner with support for pure Jython value types.
> ----------------------------------------------------------
>
>                 Key: GIRAPH-717
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-717
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>
> This adds support for pure Jython jobs. Currently this runner is hooked up to work with
Hive. I'll make it more generic later.
> Running a Jython job is simply:
> HIVE_HOME=<x>
> HADOOP_HOME=<y>
> $HIVE_HOME/bin/hive --service jar <giraph-hive-jar> org.apache.giraph.hive.jython.HiveJythonRunner
[jython1.py] [jython2.py]
> You can pass in any number of scripts. They will be parsed in order and sent to all the
workers using DistributedCache.
> There are examples and tests in the diff. Here is one example:
> launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
> worker: https://gist.github.com/nitay/7834fd2b059527e65a36
> There are a few pieces to a Jython job, I'll go over each part here.
> The launcher defines the graph types (those IVEMM writables) and sets up the Hive vertex/edge
inputs and output. Each graph type is one of the following:
> 1) A Java type. For example the user can specify simply IntWritable
> 2) A Jython type that implements Writable. In the example above the message value implements
Writable.
> 3) A pure Jython type. The Java code will wrap these objects in a Writable wrapper that
serializes Jython values using Pickle (jython IO framework).
> For Hive usage - if your value type is a primitive e.g. IntWritable or LongWritable,
then you need not do anything. The Java code will automatically read/write the Hive table
specified and convert between Hive types and the primitive Writable. The vertex_id type in
the example works like this.
> If your value is a custom Jython type, you must create classes which implement JythonHiveReader/JythonHiveWriter
(or JythonHiveIO which is both). These objects read/write Jython types from Hive. There are
wrappers in the Java code which take HiveIO data normally used in giraph-hive and turns them
into Jython types. This means, for example, that getMap() will return a Jython dictionary
instead of a Java Map.
> There is also a PageRankBenchmark (from previous diff) implemented in Jython. Here's
a run for comparison / sanity check:
> PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
> trunk:
>   https://gist.github.com/nitay/3170fa3b575d4d2e22a9
>   total time: 302466
> with this diff:
>   https://gist.github.com/nitay/a52b6d1d64e50ab9829e
>   total time: 306517
> in jython:
>   https://gist.github.com/nitay/3f2e758b2933c3521727
>   total time: 434730
> So we see that existing things are not affected (is there something else I should test?)
and that Jython has around 40% overhead.
> ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to split
up :/)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message