hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 19:07:18 GMT
In short,

= Current =

BSP core: Input partitioning + converting to VertexWritable
Graph module: Reads only VertexWritable

= Future =

BSP core: Input partitioning
Graph module: Reads partition and parses Vertex structure



On Tue, May 7, 2013 at 3:53 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
> I think it was misunderstanding of the term of 'remove' and 'record converter'.
>
> PartitioningRunner converts records. I call this as a 'record
> converter'. But, there's no need to write converted records in
> PartitioningRunner. Partitioner is just a partitioner in BSP core
> module.
>
> On Tue, May 7, 2013 at 3:46 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>> Currently, the PartitioningRunner writes converted records to
>> partition file, then GraphJobRunner reads VertexWritable, NullWritable
>> K/V records. In other words,
>>
>> 1) input record: 'a\tb\tc'  // assume that input is Text
>> 2) partition files: the sequence of Vertex writable
>> 3) GraphJobRunner.loadVertices() reads sequence format partition file.
>>
>> My suggestion is, just writes raw records to partition file in
>> PartitioningRunner.
>>
>> 1) input record: 'a\tb\tc'  // assume that input is Text
>> 2) partition files: 'a\tb\tc'  // data shuffled by partition ID, but
>> format is the same with original.
>> 3) GraphJobRunner.loadVertices() reads records from assigned
>> partition, and parse Vertex structure.
>>
>> Few lines will be changed.
>>
>> Why? As I described in Wiki, NoSQLs table input case (which supports
>> range or random access by sorted key), there's no need to
>> re-partitioning. Because they are already range partitioned. It means
>> that Parsing vertex structure is needed at GraphJobRunner.
>>
>> With or without Suraj's suggestion, parsing vertex structure should be
>> done at GraphJobRunner.loadVertices() method to prepare the NoSQLs
>> input format.
>>
>> Can you understand?
>>
>>
>> On Tue, May 7, 2013 at 2:55 AM, Tommaso Teofili
>> <tommaso.teofili@gmail.com> wrote:
>>> 2013/5/6 Edward J. Yoon <edwardyoon@apache.org>
>>>
>>>> > - Instead of running a separate job, we inject a partitioning superstep
>>>> > before the first superstep of the job. (This has a dependency on the
>>>> > Superstep API)
>>>> > - The partitions instead of being written to HDFS, which is creating
a
>>>> copy
>>>> > of input files in HDFS Cluster (too costly I believe), should be written
>>>> to
>>>> > local files and read from.
>>>> > - For graph jobs, we can configure this partitioning superstep class
>>>> > specific to graph partitioning class that partitions and loads vertices.
>>>>
>>>> I believe that above suggestion can be a future improvement task.
>>>>
>>>> > This sure has some dependencies. But would be a graceful solution and
can
>>>> > tackle every problem. This is what I want to achieve in the end. Please
>>>> > proceed if you have any intermediate ways to reach here faster.
>>>>
>>>> If you understand my plan now, Please let me know so that I can start
>>>> the work. My patch will change only few lines.
>>>>
>>>
>>> while to me it's clear what Suraj's proposal is, I'm not completely sure
>>> about what your final proposal would be, could you explain that in more
>>> detail (or otherwise perhaps a path to review it's enough) ?
>>>
>>>
>>>>
>>>> Finally, I think now we can prepare the integration with NoSQLs table
>>>> input format.
>>>>
>>>
>>> as I said, I'd like to have a broad consensus before doing any significant
>>> change to core stuff.
>>>
>>> thanks,
>>> Tommaso
>>>
>>> p.s.:
>>> probably worth a different thread: what's the NoSQL usage scenario with
>>> regard to Hama?
>>>
>>>
>>>
>>>>
>>>> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <surajsmenon@apache.org>
>>>> wrote:
>>>> > I am assuming that the storage of vertices (NoSQL or any other format)
>>>> need
>>>> > not be updated after every iteration.
>>>> >
>>>> > Based on the above assumption, I have the following suggestions:
>>>> >
>>>> > - Instead of running a separate job, we inject a partitioning superstep
>>>> > before the first superstep of the job. (This has a dependency on the
>>>> > Superstep API)
>>>> > - The partitions instead of being written to HDFS, which is creating
a
>>>> copy
>>>> > of input files in HDFS Cluster (too costly I believe), should be written
>>>> to
>>>> > local files and read from.
>>>> > - For graph jobs, we can configure this partitioning superstep class
>>>> > specific to graph partitioning class that partitions and loads vertices.
>>>> >
>>>> > This sure has some dependencies. But would be a graceful solution and
can
>>>> > tackle every problem. This is what I want to achieve in the end. Please
>>>> > proceed if you have any intermediate ways to reach here faster.
>>>> >
>>>> > Regards,
>>>> > Suraj
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
>>>> >wrote:
>>>> >
>>>> >> P.S., BSPJob (with table input) also the same. It's not only for
>>>> GraphJob.
>>>> >>
>>>> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
>>>> >> wrote:
>>>> >> > All,
>>>> >> >
>>>> >> > I've also roughly described details about design of Graph APIs[1].
To
>>>> >> > reduce our misunderstandings (please read first Partitioning
and
>>>> >> > GraphModuleInternals documents),
>>>> >> >
>>>> >> >  * In NoSQLs case, there's obviously no need to Hash-partitioning
or
>>>> >> > rewrite partition files on HDFS. So, in these input cases,
I think
>>>> >> > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>>> >> > method.
>>>> >> >
>>>> >> > At here, we faced two options: 1) The current implementation
of
>>>> >> > 'PartitioningRunner' writes converted vertices on sequence
format
>>>> >> > partition files. And GraphJobRunner reads only Vertex Writable
>>>> >> > objects. If input is table, we maybe have to skip the Partitioning
job
>>>> >> > and have to parse vertex structure at loadVertices() method
after
>>>> >> > checking some conditions. 2) PartitioningRunner just writes
raw
>>>> >> > records to proper partition files after checking its partition
ID. And
>>>> >> > GraphJobRunner.loadVertices() always parses and loads vertices.
>>>> >> >
>>>> >> > I was mean that I prefer the latter and there's no need to
write
>>>> >> > VertexWritable files. It's not related whether graph will support
only
>>>> >> > Seq format or not. Hope my explanation is enough!
>>>> >> >
>>>> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>>> >> >
>>>> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <
>>>> edwardyoon@apache.org>
>>>> >> wrote:
>>>> >> >> I've described my big picture here:
>>>> >> http://wiki.apache.org/hama/Partitioning
>>>> >> >>
>>>> >> >> Please review and feedback whether this is acceptable.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org>
wrote:
>>>> >> >>> p.s., i think theres mis understand. it doesn't mean
that graph will
>>>> >> support only sequence file format. Main is whether converting at
>>>> >> patitioning stage or loadVertices stage.
>>>> >> >>>
>>>> >> >>> Sent from my iPhone
>>>> >> >>>
>>>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
>>>> wrote:
>>>> >> >>>
>>>> >> >>>> Sure, Please go ahead.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon
<
>>>> edwardyoon@apache.org
>>>> >> >wrote:
>>>> >> >>>>
>>>> >> >>>>>>> Please let me know before this is changed,
I would like to work
>>>> on
>>>> >> a
>>>> >> >>>>>>> separate branch.
>>>> >> >>>>>
>>>> >> >>>>> I personally, we have to focus on high priority
tasks. and more
>>>> >> >>>>> feedbacks and contributions from users. So,
if changes made, I'll
>>>> >> >>>>> release periodically. If you want to work on
another place, please
>>>> >> do.
>>>> >> >>>>> I don't want to wait your patches.
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon
<
>>>> >> edwardyoon@apache.org>
>>>> >> >>>>> wrote:
>>>> >> >>>>>> For preparing integration with NoSQLs,
of course, maybe condition
>>>> >> >>>>>> check (whether converted or not) can be
used without removing
>>>> record
>>>> >> >>>>>> converter.
>>>> >> >>>>>>
>>>> >> >>>>>> We need to discuss everything.
>>>> >> >>>>>>
>>>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon
<
>>>> surajsmenon@apache.org
>>>> >> >
>>>> >> >>>>> wrote:
>>>> >> >>>>>>> I am still -1 if this means our graph
module can work only on
>>>> >> sequential
>>>> >> >>>>>>> file format.
>>>> >> >>>>>>> Please note that you can set record
converter to null and make
>>>> >> changes
>>>> >> >>>>> to
>>>> >> >>>>>>> loadVertices for what you desire here.
>>>> >> >>>>>>>
>>>> >> >>>>>>> If we came to this design, because
TextInputFormat is
>>>> inefficient,
>>>> >> would
>>>> >> >>>>>>> this work for Avro or Thrift input
format?
>>>> >> >>>>>>> Please let me know before this is changed,
I would like to work
>>>> on
>>>> >> a
>>>> >> >>>>>>> separate branch.
>>>> >> >>>>>>> You may proceed as you wish.
>>>> >> >>>>>>>
>>>> >> >>>>>>> Regards,
>>>> >> >>>>>>> Suraj
>>>> >> >>>>>>>
>>>> >> >>>>>>>
>>>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward
J. Yoon <
>>>> >> edwardyoon@apache.org
>>>> >> >>>>>> wrote:
>>>> >> >>>>>>>
>>>> >> >>>>>>>> I think 'record converter' should
be removed. It's not good
>>>> idea.
>>>> >> >>>>>>>> Moreover, it's unnecessarily complex.
To keep vertex input
>>>> >> reader, we
>>>> >> >>>>>>>> can move related classes into common
module.
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> Let's go with my original plan.
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM,
Edward J. Yoon <
>>>> >> edwardyoon@apache.org>
>>>> >> >>>>>>>> wrote:
>>>> >> >>>>>>>>> Hi all,
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> I'm reading our old discussions
about record converter,
>>>> superstep
>>>> >> >>>>>>>>> injection, and common module:
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> To clarify goals and objectives:
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> 1. A parallel input partition
is necessary for obtaining
>>>> >> scalability
>>>> >> >>>>>>>>> and elasticity of a Bulk Synchronous
Parallel processing (It's
>>>> >> not a
>>>> >> >>>>>>>>> memory issue, or Disk/Spilling
Queue, or HAMA-644. Please
>>>> don't
>>>> >> >>>>>>>>> shake).
>>>> >> >>>>>>>>> 2. Input partitioning should
be handled at BSP framework
>>>> level,
>>>> >> and
>>>> >> >>>>> it
>>>> >> >>>>>>>>> is for every Hama jobs, not
only for Graph jobs.
>>>> >> >>>>>>>>> 3. Unnecessary I/O Overhead
need to be avoided, and NoSQLs
>>>> input
>>>> >> also
>>>> >> >>>>>>>>> should be considered.
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> The current problem is that
every input of graph jobs should
>>>> be
>>>> >> >>>>>>>>> rewritten on HDFS. If you have
a good idea, Please let me
>>>> know.
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> --
>>>> >> >>>>>>>>> Best Regards, Edward J. Yoon
>>>> >> >>>>>>>>> @eddieyoon
>>>> >> >>>>>>>>
>>>> >> >>>>>>>>
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> --
>>>> >> >>>>>>>> Best Regards, Edward J. Yoon
>>>> >> >>>>>>>> @eddieyoon
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>> --
>>>> >> >>>>>> Best Regards, Edward J. Yoon
>>>> >> >>>>>> @eddieyoon
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> --
>>>> >> >>>>> Best Regards, Edward J. Yoon
>>>> >> >>>>> @eddieyoon
>>>> >> >>>>>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Best Regards, Edward J. Yoon
>>>> >> >> @eddieyoon
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best Regards, Edward J. Yoon
>>>> >> > @eddieyoon
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Best Regards, Edward J. Yoon
>>>> >> @eddieyoon
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message