hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 17:08:50 GMT
2013/5/6 Suraj Menon <surajsmenon@apache.org>

> I am assuming that the storage of vertices (NoSQL or any other format) need
> not be updated after every iteration.
>
> Based on the above assumption, I have the following suggestions:
>
> - Instead of running a separate job, we inject a partitioning superstep
> before the first superstep of the job. (This has a dependency on the
> Superstep API)
>

could we do that without introducing that dependency? I mean would that
work also if not using the Superstep API on the client side?


> - The partitions instead of being written to HDFS, which is creating a copy
> of input files in HDFS Cluster (too costly I believe), should be written to
> local files and read from.
>

+1


> - For graph jobs, we can configure this partitioning superstep class
> specific to graph partitioning class that partitions and loads vertices.
>

this seems to be inline with the above assumption thus it probably makes
sense.


>
> This sure has some dependencies. But would be a graceful solution and can
> tackle every problem. This is what I want to achieve in the end. Please
> proceed if you have any intermediate ways to reach here faster.
>

Your solution sounds good to me generally, better if we can avoid the
dependency, but still ok if not.
Let's collect also others' opinions and try to reach a shared consensus.

Tommaso




>
> Regards,
> Suraj
>
>
>
>
> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > P.S., BSPJob (with table input) also the same. It's not only for
> GraphJob.
> >
> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
> > wrote:
> > > All,
> > >
> > > I've also roughly described details about design of Graph APIs[1]. To
> > > reduce our misunderstandings (please read first Partitioning and
> > > GraphModuleInternals documents),
> > >
> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
> > > rewrite partition files on HDFS. So, in these input cases, I think
> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
> > > method.
> > >
> > > At here, we faced two options: 1) The current implementation of
> > > 'PartitioningRunner' writes converted vertices on sequence format
> > > partition files. And GraphJobRunner reads only Vertex Writable
> > > objects. If input is table, we maybe have to skip the Partitioning job
> > > and have to parse vertex structure at loadVertices() method after
> > > checking some conditions. 2) PartitioningRunner just writes raw
> > > records to proper partition files after checking its partition ID. And
> > > GraphJobRunner.loadVertices() always parses and loads vertices.
> > >
> > > I was mean that I prefer the latter and there's no need to write
> > > VertexWritable files. It's not related whether graph will support only
> > > Seq format or not. Hope my explanation is enough!
> > >
> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
> > >
> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <edwardyoon@apache.org
> >
> > wrote:
> > >> I've described my big picture here:
> > http://wiki.apache.org/hama/Partitioning
> > >>
> > >> Please review and feedback whether this is acceptable.
> > >>
> > >>
> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org> wrote:
> > >>> p.s., i think theres mis understand. it doesn't mean that graph will
> > support only sequence file format. Main is whether converting at
> > patitioning stage or loadVertices stage.
> > >>>
> > >>> Sent from my iPhone
> > >>>
> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
> wrote:
> > >>>
> > >>>> Sure, Please go ahead.
> > >>>>
> > >>>>
> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
> edwardyoon@apache.org
> > >wrote:
> > >>>>
> > >>>>>>> Please let me know before this is changed, I would
like to work
> on
> > a
> > >>>>>>> separate branch.
> > >>>>>
> > >>>>> I personally, we have to focus on high priority tasks. and
more
> > >>>>> feedbacks and contributions from users. So, if changes made,
I'll
> > >>>>> release periodically. If you want to work on another place,
please
> > do.
> > >>>>> I don't want to wait your patches.
> > >>>>>
> > >>>>>
> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
> > edwardyoon@apache.org>
> > >>>>> wrote:
> > >>>>>> For preparing integration with NoSQLs, of course, maybe
condition
> > >>>>>> check (whether converted or not) can be used without removing
> record
> > >>>>>> converter.
> > >>>>>>
> > >>>>>> We need to discuss everything.
> > >>>>>>
> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
> surajsmenon@apache.org
> > >
> > >>>>> wrote:
> > >>>>>>> I am still -1 if this means our graph module can work
only on
> > sequential
> > >>>>>>> file format.
> > >>>>>>> Please note that you can set record converter to null
and make
> > changes
> > >>>>> to
> > >>>>>>> loadVertices for what you desire here.
> > >>>>>>>
> > >>>>>>> If we came to this design, because TextInputFormat
is
> inefficient,
> > would
> > >>>>>>> this work for Avro or Thrift input format?
> > >>>>>>> Please let me know before this is changed, I would
like to work
> on
> > a
> > >>>>>>> separate branch.
> > >>>>>>> You may proceed as you wish.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Suraj
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
> > edwardyoon@apache.org
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I think 'record converter' should be removed. It's
not good
> idea.
> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex
input
> > reader, we
> > >>>>>>>> can move related classes into common module.
> > >>>>>>>>
> > >>>>>>>> Let's go with my original plan.
> > >>>>>>>>
> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon
<
> > edwardyoon@apache.org>
> > >>>>>>>> wrote:
> > >>>>>>>>> Hi all,
> > >>>>>>>>>
> > >>>>>>>>> I'm reading our old discussions about record
converter,
> superstep
> > >>>>>>>>> injection, and common module:
> > >>>>>>>>>
> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
> > >>>>>>>>>
> > >>>>>>>>> To clarify goals and objectives:
> > >>>>>>>>>
> > >>>>>>>>> 1. A parallel input partition is necessary
for obtaining
> > scalability
> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel
processing (It's
> > not a
> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644.
Please don't
> > >>>>>>>>> shake).
> > >>>>>>>>> 2. Input partitioning should be handled at
BSP framework level,
> > and
> > >>>>> it
> > >>>>>>>>> is for every Hama jobs, not only for Graph
jobs.
> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided,
and NoSQLs
> input
> > also
> > >>>>>>>>> should be considered.
> > >>>>>>>>>
> > >>>>>>>>> The current problem is that every input of
graph jobs should be
> > >>>>>>>>> rewritten on HDFS. If you have a good idea,
Please let me know.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Best Regards, Edward J. Yoon
> > >>>>>>>>> @eddieyoon
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Best Regards, Edward J. Yoon
> > >>>>>>>> @eddieyoon
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Best Regards, Edward J. Yoon
> > >>>>>> @eddieyoon
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best Regards, Edward J. Yoon
> > >>>>> @eddieyoon
> > >>>>>
> > >>
> > >>
> > >>
> > >> --
> > >> Best Regards, Edward J. Yoon
> > >> @eddieyoon
> > >
> > >
> > >
> > > --
> > > Best Regards, Edward J. Yoon
> > > @eddieyoon
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message