hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 07:09:05 GMT
All,

I've also roughly described details about design of Graph APIs[1]. To
reduce our misunderstandings (please read first Partitioning and
GraphModuleInternals documents),

 * In NoSQLs case, there's obviously no need to Hash-partitioning or
rewrite partition files on HDFS. So, in these input cases, I think
vertex structure should be parsed at GraphJobRunner.loadVertices()
method.

At here, we faced two options: 1) The current implementation of
'PartitioningRunner' writes converted vertices on sequence format
partition files. And GraphJobRunner reads only Vertex Writable
objects. If input is table, we maybe have to skip the Partitioning job
and have to parse vertex structure at loadVertices() method after
checking some conditions. 2) PartitioningRunner just writes raw
records to proper partition files after checking its partition ID. And
GraphJobRunner.loadVertices() always parses and loads vertices.

I was mean that I prefer the latter and there's no need to write
VertexWritable files. It's not related whether graph will support only
Seq format or not. Hope my explanation is enough!

1. http://wiki.apache.org/hama/GraphModuleInternals

On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
> I've described my big picture here: http://wiki.apache.org/hama/Partitioning
>
> Please review and feedback whether this is acceptable.
>
>
> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org> wrote:
>> p.s., i think theres mis understand. it doesn't mean that graph will support only
sequence file format. Main is whether converting at patitioning stage or loadVertices stage.
>>
>> Sent from my iPhone
>>
>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com> wrote:
>>
>>> Sure, Please go ahead.
>>>
>>>
>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>>>
>>>>>> Please let me know before this is changed, I would like to work on
a
>>>>>> separate branch.
>>>>
>>>> I personally, we have to focus on high priority tasks. and more
>>>> feedbacks and contributions from users. So, if changes made, I'll
>>>> release periodically. If you want to work on another place, please do.
>>>> I don't want to wait your patches.
>>>>
>>>>
>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <edwardyoon@apache.org>
>>>> wrote:
>>>>> For preparing integration with NoSQLs, of course, maybe condition
>>>>> check (whether converted or not) can be used without removing record
>>>>> converter.
>>>>>
>>>>> We need to discuss everything.
>>>>>
>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <surajsmenon@apache.org>
>>>> wrote:
>>>>>> I am still -1 if this means our graph module can work only on sequential
>>>>>> file format.
>>>>>> Please note that you can set record converter to null and make changes
>>>> to
>>>>>> loadVertices for what you desire here.
>>>>>>
>>>>>> If we came to this design, because TextInputFormat is inefficient,
would
>>>>>> this work for Avro or Thrift input format?
>>>>>> Please let me know before this is changed, I would like to work on
a
>>>>>> separate branch.
>>>>>> You may proceed as you wish.
>>>>>>
>>>>>> Regards,
>>>>>> Suraj
>>>>>>
>>>>>>
>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org
>>>>> wrote:
>>>>>>
>>>>>>> I think 'record converter' should be removed. It's not good idea.
>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input reader,
we
>>>>>>> can move related classes into common module.
>>>>>>>
>>>>>>> Let's go with my original plan.
>>>>>>>
>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <edwardyoon@apache.org>
>>>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'm reading our old discussions about record converter, superstep
>>>>>>>> injection, and common module:
>>>>>>>>
>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>>>>>>
>>>>>>>> To clarify goals and objectives:
>>>>>>>>
>>>>>>>> 1. A parallel input partition is necessary for obtaining
scalability
>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing
(It's not a
>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please
don't
>>>>>>>> shake).
>>>>>>>> 2. Input partitioning should be handled at BSP framework
level, and
>>>> it
>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs
input also
>>>>>>>> should be considered.
>>>>>>>>
>>>>>>>> The current problem is that every input of graph jobs should
be
>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me
know.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> @eddieyoon
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message