hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leonidas Fegaras <fega...@cse.uta.edu>
Subject Re: Question about FileInputFormat splits
Date Mon, 20 Oct 2014 21:31:14 GMT
Hi Edward,
OK. It works now. I used the following in hama-site.xml:

   <property>
     <name>bsp.input.runtime.partitioning</name>
     <value>false</value>
   </property>

and re-started bspd. The correct code for the Job is:

job.setNumBspTask(10);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

Maybe you should explain this in the Hama Wiki.
Thanks.
Leonidas

On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
> Hi Edward,
> Thank you for the reply.
> But I want the opposite: I want to create more tasks than blocks, not
> fewer tasks than blocks.
> That is, I want to be able to send less than one block to each task (for
> example, only 10000 bytes). Sending less data to a task will speed-up
> execution and will require less memory at each node. Hadoop map-reduce,
> Spark, and Flink allow you to use a split size smaller than a block.
> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
> 0.6.4. Did you remove this capability because it is a bad idea or
> because it is very hard to implement?
>
> Based on your instructions, I tried the following:
>
>       job.setNumBspTask(10);
>       job.setBoolean("bsp.input.runtime.partitioning",false);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>
> I get the following error:
>
> java.lang.ArrayIndexOutOfBoundsException: 1
>       at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>       at
> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>       at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>       at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>       at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>
> Thanks.
> Leonidas
>
>
> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>> Hi Leonidas,
>>
>> The bsp.min.split.size property is used to prevent to create too many
>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>> size then 1 block is sent to each task).
>>
>> I guess this will work fine. BTW, if you set the input partitioner
>> then input partitioner creates the new partitions as you specified in
>> the setNumBspTask() method (graph job pre-processes the (hash) input
>> partition by default).
>>
>> Thanks.
>>
>> --
>> Best Regards, Edward J. Yoon
>> Chief Executive Officer
>> DataSayer Co., Ltd.
>>
>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>> <mailto:fegaras@cse.uta.edu>> 작성:
>>>
>>> Dear Hama developers,
>>> I still have a problem setting the split size of an HDFS input file
>>> using Hama 0.6.4.  For example, when I use:
>>>
>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>> job.setNumBspTask(10);
>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>
>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>> for each block), instead of 10.
>>> This used to work in Hama 0.5.0.
>>> Any suggestions?
>>> Thanks.
>>> Leonidas Fegaras
>>>
>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>> Hello,
>>>>
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>> You're right. So, we're working on partitioning issues now.
>>>>
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>
>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>> <fegaras@cse.uta.edu <mailto:fegaras@cse.uta.edu>> wrote:
>>>>> Dear Hama developers,
>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>> split size
>>>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>> smaller
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Thanks for your help,
>>>>> Leonidas
>>>>>
>>>>


Mime
View raw message