hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient
Date Tue, 11 Dec 2012 09:07:20 GMT

    [ https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528827#comment-13528827

Thomas Jungblut commented on HAMA-531:

Okay Edward, you haven't understood how it works.

Maybe you know it better when I show you some code.
bq.Mindist search example

If you define in the job:

And you define your reader like this:

public static class MindistSearchCountReader extends
  VertexInputReader<Text, TextArrayWritable, Text, NullWritable, Text> {

 public boolean parseVertex(Text key, TextArrayWritable value,
		Vertex<Text, NullWritable, Text> vertex) {
	for(Text edgeName : value.get()){
		vertex.addEdge(new Edge<Text, NullWritable>(new Text(edgeName), null));
	return true;


Then you can support your binary sequencefile format as well as well as all other damn formats
that exists in the whole world.

If you want to have a binary sequencefile format, then do this. But I will quit committing
to Hama then, because I'm not going to support ONLY a binary format. This is not what I build
a framework for.

What if you want to change this binary format? Do you want to recreate every file on the whole
planet? You must take care about versioning then, and that is because we need a proxy between
the inputformat and our vertex API.
> Data re-partitioning in BSPJobClient
> ------------------------------------
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.1
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, HAMA-531_final.patch, patch.txt,
patch_v02.txt, patch_v03.txt, patch_v04.txt
> The re-partitioning the data is a very expensive operation. By the way, currently, we
processes read/write operations sequentially using HDFS api in BSPJobClient from client-side.
This causes potential too many open files error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message