hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 17:55:19 GMT
2013/5/6 Edward J. Yoon <edwardyoon@apache.org>

> > - Instead of running a separate job, we inject a partitioning superstep
> > before the first superstep of the job. (This has a dependency on the
> > Superstep API)
> > - The partitions instead of being written to HDFS, which is creating a
> copy
> > of input files in HDFS Cluster (too costly I believe), should be written
> to
> > local files and read from.
> > - For graph jobs, we can configure this partitioning superstep class
> > specific to graph partitioning class that partitions and loads vertices.
>
> I believe that above suggestion can be a future improvement task.
>
> > This sure has some dependencies. But would be a graceful solution and can
> > tackle every problem. This is what I want to achieve in the end. Please
> > proceed if you have any intermediate ways to reach here faster.
>
> If you understand my plan now, Please let me know so that I can start
> the work. My patch will change only few lines.
>

while to me it's clear what Suraj's proposal is, I'm not completely sure
about what your final proposal would be, could you explain that in more
detail (or otherwise perhaps a path to review it's enough) ?


>
> Finally, I think now we can prepare the integration with NoSQLs table
> input format.
>

as I said, I'd like to have a broad consensus before doing any significant
change to core stuff.

thanks,
Tommaso

p.s.:
probably worth a different thread: what's the NoSQL usage scenario with
regard to Hama?



>
> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <surajsmenon@apache.org>
> wrote:
> > I am assuming that the storage of vertices (NoSQL or any other format)
> need
> > not be updated after every iteration.
> >
> > Based on the above assumption, I have the following suggestions:
> >
> > - Instead of running a separate job, we inject a partitioning superstep
> > before the first superstep of the job. (This has a dependency on the
> > Superstep API)
> > - The partitions instead of being written to HDFS, which is creating a
> copy
> > of input files in HDFS Cluster (too costly I believe), should be written
> to
> > local files and read from.
> > - For graph jobs, we can configure this partitioning superstep class
> > specific to graph partitioning class that partitions and loads vertices.
> >
> > This sure has some dependencies. But would be a graceful solution and can
> > tackle every problem. This is what I want to achieve in the end. Please
> > proceed if you have any intermediate ways to reach here faster.
> >
> > Regards,
> > Suraj
> >
> >
> >
> >
> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
> >
> >> P.S., BSPJob (with table input) also the same. It's not only for
> GraphJob.
> >>
> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
> >> wrote:
> >> > All,
> >> >
> >> > I've also roughly described details about design of Graph APIs[1]. To
> >> > reduce our misunderstandings (please read first Partitioning and
> >> > GraphModuleInternals documents),
> >> >
> >> >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
> >> > rewrite partition files on HDFS. So, in these input cases, I think
> >> > vertex structure should be parsed at GraphJobRunner.loadVertices()
> >> > method.
> >> >
> >> > At here, we faced two options: 1) The current implementation of
> >> > 'PartitioningRunner' writes converted vertices on sequence format
> >> > partition files. And GraphJobRunner reads only Vertex Writable
> >> > objects. If input is table, we maybe have to skip the Partitioning job
> >> > and have to parse vertex structure at loadVertices() method after
> >> > checking some conditions. 2) PartitioningRunner just writes raw
> >> > records to proper partition files after checking its partition ID. And
> >> > GraphJobRunner.loadVertices() always parses and loads vertices.
> >> >
> >> > I was mean that I prefer the latter and there's no need to write
> >> > VertexWritable files. It's not related whether graph will support only
> >> > Seq format or not. Hope my explanation is enough!
> >> >
> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals
> >> >
> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <
> edwardyoon@apache.org>
> >> wrote:
> >> >> I've described my big picture here:
> >> http://wiki.apache.org/hama/Partitioning
> >> >>
> >> >> Please review and feedback whether this is acceptable.
> >> >>
> >> >>
> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org> wrote:
> >> >>> p.s., i think theres mis understand. it doesn't mean that graph
will
> >> support only sequence file format. Main is whether converting at
> >> patitioning stage or loadVertices stage.
> >> >>>
> >> >>> Sent from my iPhone
> >> >>>
> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
> wrote:
> >> >>>
> >> >>>> Sure, Please go ahead.
> >> >>>>
> >> >>>>
> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
> edwardyoon@apache.org
> >> >wrote:
> >> >>>>
> >> >>>>>>> Please let me know before this is changed, I would
like to work
> on
> >> a
> >> >>>>>>> separate branch.
> >> >>>>>
> >> >>>>> I personally, we have to focus on high priority tasks.
and more
> >> >>>>> feedbacks and contributions from users. So, if changes
made, I'll
> >> >>>>> release periodically. If you want to work on another place,
please
> >> do.
> >> >>>>> I don't want to wait your patches.
> >> >>>>>
> >> >>>>>
> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
> >> edwardyoon@apache.org>
> >> >>>>> wrote:
> >> >>>>>> For preparing integration with NoSQLs, of course, maybe
condition
> >> >>>>>> check (whether converted or not) can be used without
removing
> record
> >> >>>>>> converter.
> >> >>>>>>
> >> >>>>>> We need to discuss everything.
> >> >>>>>>
> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
> surajsmenon@apache.org
> >> >
> >> >>>>> wrote:
> >> >>>>>>> I am still -1 if this means our graph module can
work only on
> >> sequential
> >> >>>>>>> file format.
> >> >>>>>>> Please note that you can set record converter to
null and make
> >> changes
> >> >>>>> to
> >> >>>>>>> loadVertices for what you desire here.
> >> >>>>>>>
> >> >>>>>>> If we came to this design, because TextInputFormat
is
> inefficient,
> >> would
> >> >>>>>>> this work for Avro or Thrift input format?
> >> >>>>>>> Please let me know before this is changed, I would
like to work
> on
> >> a
> >> >>>>>>> separate branch.
> >> >>>>>>> You may proceed as you wish.
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> Suraj
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon
<
> >> edwardyoon@apache.org
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> I think 'record converter' should be removed.
It's not good
> idea.
> >> >>>>>>>> Moreover, it's unnecessarily complex. To keep
vertex input
> >> reader, we
> >> >>>>>>>> can move related classes into common module.
> >> >>>>>>>>
> >> >>>>>>>> Let's go with my original plan.
> >> >>>>>>>>
> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon
<
> >> edwardyoon@apache.org>
> >> >>>>>>>> wrote:
> >> >>>>>>>>> Hi all,
> >> >>>>>>>>>
> >> >>>>>>>>> I'm reading our old discussions about record
converter,
> superstep
> >> >>>>>>>>> injection, and common module:
> >> >>>>>>>>>
> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
> >> >>>>>>>>>
> >> >>>>>>>>> To clarify goals and objectives:
> >> >>>>>>>>>
> >> >>>>>>>>> 1. A parallel input partition is necessary
for obtaining
> >> scalability
> >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel
processing (It's
> >> not a
> >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or
HAMA-644. Please
> don't
> >> >>>>>>>>> shake).
> >> >>>>>>>>> 2. Input partitioning should be handled
at BSP framework
> level,
> >> and
> >> >>>>> it
> >> >>>>>>>>> is for every Hama jobs, not only for Graph
jobs.
> >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be
avoided, and NoSQLs
> input
> >> also
> >> >>>>>>>>> should be considered.
> >> >>>>>>>>>
> >> >>>>>>>>> The current problem is that every input
of graph jobs should
> be
> >> >>>>>>>>> rewritten on HDFS. If you have a good idea,
Please let me
> know.
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Best Regards, Edward J. Yoon
> >> >>>>>>>>> @eddieyoon
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> Best Regards, Edward J. Yoon
> >> >>>>>>>> @eddieyoon
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best Regards, Edward J. Yoon
> >> >>>>>> @eddieyoon
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Best Regards, Edward J. Yoon
> >> >>>>> @eddieyoon
> >> >>>>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards, Edward J. Yoon
> >> > @eddieyoon
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message