hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <surajsme...@apache.org>
Subject Re: Partitioner in Hama
Date Tue, 08 Jan 2013 22:07:47 GMT
Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner is
expecting the input in the format of Vertex, but we have input files as
well as record key, values defined as Text. I have fixed only one unit test
case yet.

On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <dapurv5@gmail.com> wrote:

> Hey all,
>  I got the problem, the partitioner was not being set for the
> PartitionerRunner bsp task. :P I have fixed the partitioner with portions
> from your patch Suraj. Now after this commit partitioner will obey what you
> specified earlier, just to recapitulate.
>
> Repartitioning is done if :
> - the number of splits found are not equal to the number of BSP tasks
> configured for the job. OR
> - the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
> - user has specified a Runtime Partitioner class and enabled runtime
> partitioning
>
> There was one special thing that I discovered about partitioner , just
> sharing with you guys. Suppose I implement a partitioner which returns 0
> for a record, then it isn't necessary that this record will go to peer with
> index 0. It might go to peer 1. The only certitude which partitioner's
> provide is that all records returning 0 will go to the same peer. I needed
> partitioner to work for PrefixSum I was implementing.
>
> Things to do next.
> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
> update Suraj)
>
> B.T.W there are problems in mvn test.
> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.ArrayWritable*
> * at
> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
> *
> *
> I don't think my commit is breaking this.
>
> Thanks
>
>
> --
> Regards,
> Apurv Verma
>
>
>
>
> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <surajsmenon@apache.org>
> wrote:
>
> > Please explain the nature of problems you are facing with Partitioner?
> >
> > >Any reasons for deciding to move the
> > > PartitioningJob inside BSPJobClient from BSPJob?
> >
> > Twofold, BSPJob was just a configuration holder object, didn't want to
> add
> > the partitioning responsibility to the class.
> > And also I wanted to know the number of splits, before taking the
> decision
> > whether to repartition or not.
> > Repartitioning is done if :
> > - the number of splits found are not equal to the number of BSP tasks
> > configured for the job. OR
> > - the flag is set to true by the user ("bsp.input.runtime.partitioning")
> OR
> > - user has specified a Runtime Partitioner class and enabled runtime
> > partitioning
> >
> > Thanks,
> > Suraj
> >
> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <dapurv5@gmail.com> wrote:
> >
> > > Thanks, let me have a careful look at it. On a cursory look, I seem to
> > > understand the basic idea. Any reasons for deciding to move the
> > > PartitioningJob inside BSPJobClient from BSPJob?
> > > BTW the current partitioner doesn't work as intended, only the default
> > > partitioner HashPartitioner works fine, if I try to put some custom
> > > partitioner there are problems.
> > >
> > > Let's resolve the partitioning completely before the spilling message
> > > queue.
> > >
> > >
> > > --
> > > Regards,
> > > Apurv Verma
> > >
> > >
> > >
> > >
> > > On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <surajsmenon@apache.org>
> > > wrote:
> > >
> > > > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> > > > suggestions or even work on it.
> > > >
> > > > Thanks,
> > > > Suraj
> > > >
> > > > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <dapurv5@gmail.com>
> wrote:
> > > >
> > > > > Hey Edward,
> > > > >  There was a compile bug which i fixed temporarily. isPartitioned
> was
> > > not
> > > > > being initialized. Could you please check the last commit. I have
> > > > currently
> > > > > initialized it to false but I guess this should be configurable.
> > > > > There was some jira where we wanted partitioning to be skipped if
> > user
> > > > > thinks his data is already partitioned.
> > > > >
> > > > > Thanks again.
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Apurv Verma
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <
> > edwardyoon@apache.org
> > > > > >wrote:
> > > > >
> > > > > > Thanks, then I'll finish tomorrow. Please feel free to comment
> > there.
> > > > > >
> > > > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > > > > <tommaso.teofili@gmail.com> wrote:
> > > > > > > thanks Edward, it looks good.
> > > > > > > Tommaso
> > > > > > >
> > > > > > >
> > > > > > > 2013/1/8 Edward J. Yoon <edwardyoon@apache.org>
> > > > > > >
> > > > > > >> Please review this:
> > > > > > >>
> > > > > > >> http://wiki.apache.org/hama/Partitioning
> > > > > > >>
> > > > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> > > > edwardyoon@apache.org
> > > > > >
> > > > > > >> wrote:
> > > > > > >> > I mean, the pre-partitioning or resizing partitions
is
> really
> > > > > > important.
> > > > > > >> >
> > > > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon
<
> > > > > edwardyoon@apache.org
> > > > > > >
> > > > > > >> wrote:
> > > > > > >> >> This is another talk ...
> > > > > > >> >>
> > > > > > >> >> Unlike MapReduce, I think, Hama BSP will handle
tasks that
> > > input
> > > > is
> > > > > > >> >> small in size but large in computational complexity,
such
> as
> > > > graph,
> > > > > > >> >> sparse matrix, machine learning algorithms.
> > > > > > >> >>
> > > > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J.
Yoon <
> > > > > > edwardyoon@apache.org>
> > > > > > >> wrote:
> > > > > > >> >>> Even though the numbers of splits and
tasks are the same,
> > > > > > user-defined
> > > > > > >> >>> partitioning job should be run (because
it is not only for
> > > > > resizing
> > > > > > >> >>> partitions. For example, range partitioning
of unsorted
> data
> > > set
> > > > > or
> > > > > > >> >>> hash key partitioning, ..., etc).
> > > > > > >> >>>
> > > > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj
Menon <
> > > > > surajsmenon@apache.org
> > > > > > >
> > > > > > >> wrote:
> > > > > > >> >>>>>    1. I am referring to
> > > > org.apache.hama.bsp.PartitioningRunner,
> > > > > > it's
> > > > > > >> named
> > > > > > >> >>>>>    as so in the HEAD (1429573)
of trunk. It isn't
> removed
> > > but
> > > > it
> > > > > > >> isn't
> > > > > > >> >>>>>    referred to anywhere else.
I can't find any
> references
> > to
> > > > it
> > > > > in
> > > > > > >> the
> > > > > > >> >>>>>    workspace.
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>> It is referred in BSPJob#waitForCompletion
function as a
> > > > separate
> > > > > > BSP
> > > > > > >> job
> > > > > > >> >>>> to create the specified splits.
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>>>    2. job.setPartitioner is the
same as setting
> > > > > > >> >>>>>    "bsp.input.partitioner.class"
. Anyways , So acc. to
> me
> > > > > > >> partitions are
> > > > > > >> >>>>> not
> > > > > > >> >>>>>    being created because of which
the following happens.
> > > > > > >> >>>>>    If I am running the task on
local fs and not hdfs,
> > > there's
> > > > > just
> > > > > > >> one
> > > > > > >> >>>>>    input split and even if I set
a partitioner to create
> > two
> > > > > > >> partitions and
> > > > > > >> >>>>>    set bsp.setNumTasks(2) , this
is overriden and only
> one
> > > > task
> > > > > is
> > > > > > >> >>>>> executed.
> > > > > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > > > > >> >>>>>    where it does the following
> > > > > > >> >>>>>    job.setNumBspTask(writeSplits(job,
submitSplitFile,
> > > > > maxTasks));
> > > > > > >> Line
> > > > > > >> >>>>>    326.
> > > > > > >> >>>>>
> > > > > > >> >>>>> This job is set to run if the
number of splits != number
> > of
> > > > > Tasks
> > > > > > or
> > > > > > >> if
> > > > > > >> >>>> forced by the configuration. I can
share my HAMA-700
> > current
> > > > > state
> > > > > > of
> > > > > > >> patch
> > > > > > >> >>>> with you.
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>>>    3. So here is what I think
is happening, Partitioner
> is
> > > not
> > > > > in
> > > > > > the
> > > > > > >> >>>>>    codepath (try putting a breakpoint
inside the
> > partitioner
> > > > and
> > > > > > >> executing
> > > > > > >> >>>>> and
> > > > > > >> >>>>>    non graph bsp task), so partitions
are not being
> > created
> > > > and
> > > > > > >> >>>>> writeSplits()
> > > > > > >> >>>>>    is returning 1.
> > > > > > >> >>>>>    [ writeSplits() returns the
number of splits in the
> > > input.
> > > > ]
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>> Probably because it is running as
a separate process?
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> --
> > > > > > >> >>> Best Regards, Edward J. Yoon
> > > > > > >> >>> @eddieyoon
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> --
> > > > > > >> >> Best Regards, Edward J. Yoon
> > > > > > >> >> @eddieyoon
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Best Regards, Edward J. Yoon
> > > > > > >> > @eddieyoon
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Best Regards, Edward J. Yoon
> > > > > > >> @eddieyoon
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards, Edward J. Yoon
> > > > > > @eddieyoon
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message