hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apurv Verma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-700) BSPPartitioner should be configurable to be optional and allow input format conversion
Date Tue, 15 Jan 2013 03:52:13 GMT

    [ https://issues.apache.org/jira/browse/HAMA-700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553465#comment-13553465
] 

Apurv Verma commented on HAMA-700:
----------------------------------

Hey,
 Although this has been committed, but here is my review. I also have a few questions.

1)Why did you choose to introduce getPartitionId() method in PartitioningRunner.RecordConverter.
Doesn't it do exactly what Partitioner's getPartition does? Can we avoid this method here.

  In the current implementation Partitioner's partition method just gets the input key and
value and numTasks, based on which it has to decide to the output partition. If you argue
that partitioner might need some additional information to decide the partition, as has been
used in your patch at some places, I would say we should implement the Partitioner by making
it JobConfigurable. See Partitioner implementation of hadoop to get a feel of the implementation.

2) I abide by these simple rules for the partitioning runner.

   a) PartiotionerRunner job executes before the actual bsp job gets executed.
   b) Its role is to partition the input and now additionally to convert the input records
to a record format   that will be fed into the main bsp job.
   
   So abiding by these rules, earlier this used to happen for graph jobs.

   input records were partitioned by PartitioniningRunner using HashPartitioner.
   these records were fed into GraphJob which parsed the records into Vertices.
     

   wherease now, with your implementation this happens.
   input records get parsed into vertices
   vertices are fed as input to GraphJob.
   
   We should maintain consistency. Let's have record partitioning as a separate feature and
graph partitioning as separate. In a way I can see you have tried to use record partitioning
for graph partitioning also. Do I understand it properly?
   
   So what I suggest is as follows. Although you have already implemented it and its just
a suggestion. Sorry for being a pain here.
   input records ---(x)---> intermediate records --(y)--> vertex

   (x) Will be done by partitioning runner. This will also do record conversion.
   (y) Will be done by VertexInputReader. (IMO we shouldn't make VertexInputReader implement
RecordConverter)




(3) abstract V createVertexIDObject(); 
    abstract E createEdgeCostObject();
    abstract M createVertexValue();
    Can't we use reflection to avoid these methods. Anyways user is specifying the VertexID.class,
Edge.class and also their types. So we should be able to create them via reflection instead
of delegating this responsibility to each implementing class everytime.




(4) if(outKeyClass == null && outputValueClass == null){
       outputKeyClass = //initialize here
       //...
    }
    This is a nice trick but IMO it would be good to make RecordConverter generic and also
specify the .class for these in the bsp job configuration parameters. See line 148 PartitioningRunner.


(5) Line 415, BSPJobClient
    
        if (numTasks == 0) {
          numTasks = numSplits;
        }

   This numTasks is actually the numTasks for the PartitioningRunner job, do you intend to
do it for the main job or the partitioning job that will execute prior to the main bsp job?

                
> BSPPartitioner should be configurable to be optional and allow input format conversion
> --------------------------------------------------------------------------------------
>
>                 Key: HAMA-700
>                 URL: https://issues.apache.org/jira/browse/HAMA-700
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp core
>            Reporter: Suraj Menon
>            Assignee: Suraj Menon
>             Fix For: 0.6.1
>
>         Attachments: HAMA-700.patch_Jan7, HAMA-700.patch.v2, HAMA-700-v1.patch
>
>
> There should be a provisioning for skipping the PartitionRunner if needed. Also we can
have a RecordConverter interface so that the PartitionRunner can read the input in any format
and create new splits. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message