hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <surajsme...@apache.org>
Subject Re: Partitioner in Hama
Date Tue, 08 Jan 2013 17:37:38 GMT
Please explain the nature of problems you are facing with Partitioner?

>Any reasons for deciding to move the
> PartitioningJob inside BSPJobClient from BSPJob?

Twofold, BSPJob was just a configuration holder object, didn't want to add
the partitioning responsibility to the class.
And also I wanted to know the number of splits, before taking the decision
whether to repartition or not.
Repartitioning is done if :
- the number of splits found are not equal to the number of BSP tasks
configured for the job. OR
- the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
- user has specified a Runtime Partitioner class and enabled runtime
partitioning

Thanks,
Suraj

On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <dapurv5@gmail.com> wrote:

> Thanks, let me have a careful look at it. On a cursory look, I seem to
> understand the basic idea. Any reasons for deciding to move the
> PartitioningJob inside BSPJobClient from BSPJob?
> BTW the current partitioner doesn't work as intended, only the default
> partitioner HashPartitioner works fine, if I try to put some custom
> partitioner there are problems.
>
> Let's resolve the partitioning completely before the spilling message
> queue.
>
>
> --
> Regards,
> Apurv Verma
>
>
>
>
> On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <surajsmenon@apache.org>
> wrote:
>
> > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> > suggestions or even work on it.
> >
> > Thanks,
> > Suraj
> >
> > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <dapurv5@gmail.com> wrote:
> >
> > > Hey Edward,
> > >  There was a compile bug which i fixed temporarily. isPartitioned was
> not
> > > being initialized. Could you please check the last commit. I have
> > currently
> > > initialized it to false but I guess this should be configurable.
> > > There was some jira where we wanted partitioning to be skipped if user
> > > thinks his data is already partitioned.
> > >
> > > Thanks again.
> > >
> > >
> > > --
> > > Regards,
> > > Apurv Verma
> > >
> > >
> > >
> > >
> > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <edwardyoon@apache.org
> > > >wrote:
> > >
> > > > Thanks, then I'll finish tomorrow. Please feel free to comment there.
> > > >
> > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > > <tommaso.teofili@gmail.com> wrote:
> > > > > thanks Edward, it looks good.
> > > > > Tommaso
> > > > >
> > > > >
> > > > > 2013/1/8 Edward J. Yoon <edwardyoon@apache.org>
> > > > >
> > > > >> Please review this:
> > > > >>
> > > > >> http://wiki.apache.org/hama/Partitioning
> > > > >>
> > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> > edwardyoon@apache.org
> > > >
> > > > >> wrote:
> > > > >> > I mean, the pre-partitioning or resizing partitions is really
> > > > important.
> > > > >> >
> > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> > > edwardyoon@apache.org
> > > > >
> > > > >> wrote:
> > > > >> >> This is another talk ...
> > > > >> >>
> > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks
that
> input
> > is
> > > > >> >> small in size but large in computational complexity,
such as
> > graph,
> > > > >> >> sparse matrix, machine learning algorithms.
> > > > >> >>
> > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > > > edwardyoon@apache.org>
> > > > >> wrote:
> > > > >> >>> Even though the numbers of splits and tasks are
the same,
> > > > user-defined
> > > > >> >>> partitioning job should be run (because it is not
only for
> > > resizing
> > > > >> >>> partitions. For example, range partitioning of unsorted
data
> set
> > > or
> > > > >> >>> hash key partitioning, ..., etc).
> > > > >> >>>
> > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> > > surajsmenon@apache.org
> > > > >
> > > > >> wrote:
> > > > >> >>>>>    1. I am referring to
> > org.apache.hama.bsp.PartitioningRunner,
> > > > it's
> > > > >> named
> > > > >> >>>>>    as so in the HEAD (1429573) of trunk.
It isn't removed
> but
> > it
> > > > >> isn't
> > > > >> >>>>>    referred to anywhere else. I can't find
any references to
> > it
> > > in
> > > > >> the
> > > > >> >>>>>    workspace.
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>> It is referred in BSPJob#waitForCompletion function
as a
> > separate
> > > > BSP
> > > > >> job
> > > > >> >>>> to create the specified splits.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>>    2. job.setPartitioner is the same as
setting
> > > > >> >>>>>    "bsp.input.partitioner.class" . Anyways
, So acc. to me
> > > > >> partitions are
> > > > >> >>>>> not
> > > > >> >>>>>    being created because of which the following
happens.
> > > > >> >>>>>    If I am running the task on local fs
and not hdfs,
> there's
> > > just
> > > > >> one
> > > > >> >>>>>    input split and even if I set a partitioner
to create two
> > > > >> partitions and
> > > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden
and only one
> > task
> > > is
> > > > >> >>>>> executed.
> > > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > > >> >>>>>    where it does the following
> > > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> > > maxTasks));
> > > > >> Line
> > > > >> >>>>>    326.
> > > > >> >>>>>
> > > > >> >>>>> This job is set to run if the number of
splits != number of
> > > Tasks
> > > > or
> > > > >> if
> > > > >> >>>> forced by the configuration. I can share my
HAMA-700 current
> > > state
> > > > of
> > > > >> patch
> > > > >> >>>> with you.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>>    3. So here is what I think is happening,
Partitioner is
> not
> > > in
> > > > the
> > > > >> >>>>>    codepath (try putting a breakpoint inside
the partitioner
> > and
> > > > >> executing
> > > > >> >>>>> and
> > > > >> >>>>>    non graph bsp task), so partitions are
not being created
> > and
> > > > >> >>>>> writeSplits()
> > > > >> >>>>>    is returning 1.
> > > > >> >>>>>    [ writeSplits() returns the number of
splits in the
> input.
> > ]
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>> Probably because it is running as a separate
process?
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> --
> > > > >> >>> Best Regards, Edward J. Yoon
> > > > >> >>> @eddieyoon
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best Regards, Edward J. Yoon
> > > > >> >> @eddieyoon
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Best Regards, Edward J. Yoon
> > > > >> > @eddieyoon
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Best Regards, Edward J. Yoon
> > > > >> @eddieyoon
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Edward J. Yoon
> > > > @eddieyoon
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message