hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 18:46:59 GMT
Currently, the PartitioningRunner writes converted records to
partition file, then GraphJobRunner reads VertexWritable, NullWritable
K/V records. In other words,

1) input record: 'a\tb\tc'  // assume that input is Text
2) partition files: the sequence of Vertex writable
3) GraphJobRunner.loadVertices() reads sequence format partition file.

My suggestion is, just writes raw records to partition file in
PartitioningRunner.

1) input record: 'a\tb\tc'  // assume that input is Text
2) partition files: 'a\tb\tc'  // data shuffled by partition ID, but
format is the same with original.
3) GraphJobRunner.loadVertices() reads records from assigned
partition, and parse Vertex structure.

Few lines will be changed.

Why? As I described in Wiki, NoSQLs table input case (which supports
range or random access by sorted key), there's no need to
re-partitioning. Because they are already range partitioned. It means
that Parsing vertex structure is needed at GraphJobRunner.

With or without Suraj's suggestion, parsing vertex structure should be
done at GraphJobRunner.loadVertices() method to prepare the NoSQLs
input format.

Can you understand?


On Tue, May 7, 2013 at 2:55 AM, Tommaso Teofili
<tommaso.teofili@gmail.com> wrote:
> 2013/5/6 Edward J. Yoon <edwardyoon@apache.org>
>
>> > - Instead of running a separate job, we inject a partitioning superstep
>> > before the first superstep of the job. (This has a dependency on the
>> > Superstep API)
>> > - The partitions instead of being written to HDFS, which is creating a
>> copy
>> > of input files in HDFS Cluster (too costly I believe), should be written
>> to
>> > local files and read from.
>> > - For graph jobs, we can configure this partitioning superstep class
>> > specific to graph partitioning class that partitions and loads vertices.
>>
>> I believe that above suggestion can be a future improvement task.
>>
>> > This sure has some dependencies. But would be a graceful solution and can
>> > tackle every problem. This is what I want to achieve in the end. Please
>> > proceed if you have any intermediate ways to reach here faster.
>>
>> If you understand my plan now, Please let me know so that I can start
>> the work. My patch will change only few lines.
>>
>
> while to me it's clear what Suraj's proposal is, I'm not completely sure
> about what your final proposal would be, could you explain that in more
> detail (or otherwise perhaps a path to review it's enough) ?
>
>
>>
>> Finally, I think now we can prepare the integration with NoSQLs table
>> input format.
>>
>
> as I said, I'd like to have a broad consensus before doing any significant
> change to core stuff.
>
> thanks,
> Tommaso
>
> p.s.:
> probably worth a different thread: what's the NoSQL usage scenario with
> regard to Hama?
>
>
>
>>
>> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <surajsmenon@apache.org>
>> wrote:
>> > I am assuming that the storage of vertices (NoSQL or any other format)
>> need
>> > not be updated after every iteration.
>> >
>> > Based on the above assumption, I have the following suggestions:
>> >
>> > - Instead of running a separate job, we inject a partitioning superstep
>> > before the first superstep of the job. (This has a dependency on the
>> > Superstep API)
>> > - The partitions instead of being written to HDFS, which is creating a
>> copy
>> > of input files in HDFS Cluster (too costly I believe), should be written
>> to
>> > local files and read from.
>> > - For graph jobs, we can configure this partitioning superstep class
>> > specific to graph partitioning class that partitions and loads vertices.
>> >
>> > This sure has some dependencies. But would be a graceful solution and can
>> > tackle every problem. This is what I want to achieve in the end. Please
>> > proceed if you have any intermediate ways to reach here faster.
>> >
>> > Regards,
>> > Suraj
>> >
>> >
>> >
>> >
>> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
>> >wrote:
>> >
>> >> P.S., BSPJob (with table input) also the same. It's not only for
>> GraphJob.
>> >>
>> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
>> >> wrote:
>> >> > All,
>> >> >
>> >> > I've also roughly described details about design of Graph APIs[1].
To
>> >> > reduce our misunderstandings (please read first Partitioning and
>> >> > GraphModuleInternals documents),
>> >> >
>> >> >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>> >> > rewrite partition files on HDFS. So, in these input cases, I think
>> >> > vertex structure should be parsed at GraphJobRunner.loadVertices()
>> >> > method.
>> >> >
>> >> > At here, we faced two options: 1) The current implementation of
>> >> > 'PartitioningRunner' writes converted vertices on sequence format
>> >> > partition files. And GraphJobRunner reads only Vertex Writable
>> >> > objects. If input is table, we maybe have to skip the Partitioning
job
>> >> > and have to parse vertex structure at loadVertices() method after
>> >> > checking some conditions. 2) PartitioningRunner just writes raw
>> >> > records to proper partition files after checking its partition ID.
And
>> >> > GraphJobRunner.loadVertices() always parses and loads vertices.
>> >> >
>> >> > I was mean that I prefer the latter and there's no need to write
>> >> > VertexWritable files. It's not related whether graph will support only
>> >> > Seq format or not. Hope my explanation is enough!
>> >> >
>> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals
>> >> >
>> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <
>> edwardyoon@apache.org>
>> >> wrote:
>> >> >> I've described my big picture here:
>> >> http://wiki.apache.org/hama/Partitioning
>> >> >>
>> >> >> Please review and feedback whether this is acceptable.
>> >> >>
>> >> >>
>> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org>
wrote:
>> >> >>> p.s., i think theres mis understand. it doesn't mean that graph
will
>> >> support only sequence file format. Main is whether converting at
>> >> patitioning stage or loadVertices stage.
>> >> >>>
>> >> >>> Sent from my iPhone
>> >> >>>
>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
>> wrote:
>> >> >>>
>> >> >>>> Sure, Please go ahead.
>> >> >>>>
>> >> >>>>
>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>> edwardyoon@apache.org
>> >> >wrote:
>> >> >>>>
>> >> >>>>>>> Please let me know before this is changed,
I would like to work
>> on
>> >> a
>> >> >>>>>>> separate branch.
>> >> >>>>>
>> >> >>>>> I personally, we have to focus on high priority tasks.
and more
>> >> >>>>> feedbacks and contributions from users. So, if changes
made, I'll
>> >> >>>>> release periodically. If you want to work on another
place, please
>> >> do.
>> >> >>>>> I don't want to wait your patches.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>> >> edwardyoon@apache.org>
>> >> >>>>> wrote:
>> >> >>>>>> For preparing integration with NoSQLs, of course,
maybe condition
>> >> >>>>>> check (whether converted or not) can be used without
removing
>> record
>> >> >>>>>> converter.
>> >> >>>>>>
>> >> >>>>>> We need to discuss everything.
>> >> >>>>>>
>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>> surajsmenon@apache.org
>> >> >
>> >> >>>>> wrote:
>> >> >>>>>>> I am still -1 if this means our graph module
can work only on
>> >> sequential
>> >> >>>>>>> file format.
>> >> >>>>>>> Please note that you can set record converter
to null and make
>> >> changes
>> >> >>>>> to
>> >> >>>>>>> loadVertices for what you desire here.
>> >> >>>>>>>
>> >> >>>>>>> If we came to this design, because TextInputFormat
is
>> inefficient,
>> >> would
>> >> >>>>>>> this work for Avro or Thrift input format?
>> >> >>>>>>> Please let me know before this is changed,
I would like to work
>> on
>> >> a
>> >> >>>>>>> separate branch.
>> >> >>>>>>> You may proceed as you wish.
>> >> >>>>>>>
>> >> >>>>>>> Regards,
>> >> >>>>>>> Suraj
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon
<
>> >> edwardyoon@apache.org
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>>> I think 'record converter' should be removed.
It's not good
>> idea.
>> >> >>>>>>>> Moreover, it's unnecessarily complex. To
keep vertex input
>> >> reader, we
>> >> >>>>>>>> can move related classes into common module.
>> >> >>>>>>>>
>> >> >>>>>>>> Let's go with my original plan.
>> >> >>>>>>>>
>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward
J. Yoon <
>> >> edwardyoon@apache.org>
>> >> >>>>>>>> wrote:
>> >> >>>>>>>>> Hi all,
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm reading our old discussions about
record converter,
>> superstep
>> >> >>>>>>>>> injection, and common module:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>> >> >>>>>>>>>
>> >> >>>>>>>>> To clarify goals and objectives:
>> >> >>>>>>>>>
>> >> >>>>>>>>> 1. A parallel input partition is necessary
for obtaining
>> >> scalability
>> >> >>>>>>>>> and elasticity of a Bulk Synchronous
Parallel processing (It's
>> >> not a
>> >> >>>>>>>>> memory issue, or Disk/Spilling Queue,
or HAMA-644. Please
>> don't
>> >> >>>>>>>>> shake).
>> >> >>>>>>>>> 2. Input partitioning should be handled
at BSP framework
>> level,
>> >> and
>> >> >>>>> it
>> >> >>>>>>>>> is for every Hama jobs, not only for
Graph jobs.
>> >> >>>>>>>>> 3. Unnecessary I/O Overhead need to
be avoided, and NoSQLs
>> input
>> >> also
>> >> >>>>>>>>> should be considered.
>> >> >>>>>>>>>
>> >> >>>>>>>>> The current problem is that every input
of graph jobs should
>> be
>> >> >>>>>>>>> rewritten on HDFS. If you have a good
idea, Please let me
>> know.
>> >> >>>>>>>>>
>> >> >>>>>>>>> --
>> >> >>>>>>>>> Best Regards, Edward J. Yoon
>> >> >>>>>>>>> @eddieyoon
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> --
>> >> >>>>>>>> Best Regards, Edward J. Yoon
>> >> >>>>>>>> @eddieyoon
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> --
>> >> >>>>>> Best Regards, Edward J. Yoon
>> >> >>>>>> @eddieyoon
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Best Regards, Edward J. Yoon
>> >> >>>>> @eddieyoon
>> >> >>>>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards, Edward J. Yoon
>> >> > @eddieyoon
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message