hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Mon, 06 May 2013 19:35:36 GMT
1. Graph/Matrix data is small but Graph/Matrix algo requires huge
computations. Hence, the number of BSP processors should be able to
adjust ( != file blocks).

2. I'm -1 for using local disk to store partitions. HDFS is high cost.
But, reuse of partitions should be considered.

On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
<tommaso.teofili@gmail.com> wrote:
> 2013/5/6 Suraj Menon <surajsmenon@apache.org>
>
>> I am assuming that the storage of vertices (NoSQL or any other format) need
>> not be updated after every iteration.
>>
>> Based on the above assumption, I have the following suggestions:
>>
>> - Instead of running a separate job, we inject a partitioning superstep
>> before the first superstep of the job. (This has a dependency on the
>> Superstep API)
>>
>
> could we do that without introducing that dependency? I mean would that
> work also if not using the Superstep API on the client side?
>
>
>> - The partitions instead of being written to HDFS, which is creating a copy
>> of input files in HDFS Cluster (too costly I believe), should be written to
>> local files and read from.
>>
>
> +1
>
>
>> - For graph jobs, we can configure this partitioning superstep class
>> specific to graph partitioning class that partitions and loads vertices.
>>
>
> this seems to be inline with the above assumption thus it probably makes
> sense.
>
>
>>
>> This sure has some dependencies. But would be a graceful solution and can
>> tackle every problem. This is what I want to achieve in the end. Please
>> proceed if you have any intermediate ways to reach here faster.
>>
>
> Your solution sounds good to me generally, better if we can avoid the
> dependency, but still ok if not.
> Let's collect also others' opinions and try to reach a shared consensus.
>
> Tommaso
>
>
>
>
>>
>> Regards,
>> Suraj
>>
>>
>>
>>
>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
>> >wrote:
>>
>> > P.S., BSPJob (with table input) also the same. It's not only for
>> GraphJob.
>> >
>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
>> > wrote:
>> > > All,
>> > >
>> > > I've also roughly described details about design of Graph APIs[1]. To
>> > > reduce our misunderstandings (please read first Partitioning and
>> > > GraphModuleInternals documents),
>> > >
>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>> > > rewrite partition files on HDFS. So, in these input cases, I think
>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>> > > method.
>> > >
>> > > At here, we faced two options: 1) The current implementation of
>> > > 'PartitioningRunner' writes converted vertices on sequence format
>> > > partition files. And GraphJobRunner reads only Vertex Writable
>> > > objects. If input is table, we maybe have to skip the Partitioning job
>> > > and have to parse vertex structure at loadVertices() method after
>> > > checking some conditions. 2) PartitioningRunner just writes raw
>> > > records to proper partition files after checking its partition ID. And
>> > > GraphJobRunner.loadVertices() always parses and loads vertices.
>> > >
>> > > I was mean that I prefer the latter and there's no need to write
>> > > VertexWritable files. It's not related whether graph will support only
>> > > Seq format or not. Hope my explanation is enough!
>> > >
>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>> > >
>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <edwardyoon@apache.org
>> >
>> > wrote:
>> > >> I've described my big picture here:
>> > http://wiki.apache.org/hama/Partitioning
>> > >>
>> > >> Please review and feedback whether this is acceptable.
>> > >>
>> > >>
>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org> wrote:
>> > >>> p.s., i think theres mis understand. it doesn't mean that graph
will
>> > support only sequence file format. Main is whether converting at
>> > patitioning stage or loadVertices stage.
>> > >>>
>> > >>> Sent from my iPhone
>> > >>>
>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
>> wrote:
>> > >>>
>> > >>>> Sure, Please go ahead.
>> > >>>>
>> > >>>>
>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>> edwardyoon@apache.org
>> > >wrote:
>> > >>>>
>> > >>>>>>> Please let me know before this is changed, I would
like to work
>> on
>> > a
>> > >>>>>>> separate branch.
>> > >>>>>
>> > >>>>> I personally, we have to focus on high priority tasks.
and more
>> > >>>>> feedbacks and contributions from users. So, if changes
made, I'll
>> > >>>>> release periodically. If you want to work on another place,
please
>> > do.
>> > >>>>> I don't want to wait your patches.
>> > >>>>>
>> > >>>>>
>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>> > edwardyoon@apache.org>
>> > >>>>> wrote:
>> > >>>>>> For preparing integration with NoSQLs, of course, maybe
condition
>> > >>>>>> check (whether converted or not) can be used without
removing
>> record
>> > >>>>>> converter.
>> > >>>>>>
>> > >>>>>> We need to discuss everything.
>> > >>>>>>
>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>> surajsmenon@apache.org
>> > >
>> > >>>>> wrote:
>> > >>>>>>> I am still -1 if this means our graph module can
work only on
>> > sequential
>> > >>>>>>> file format.
>> > >>>>>>> Please note that you can set record converter to
null and make
>> > changes
>> > >>>>> to
>> > >>>>>>> loadVertices for what you desire here.
>> > >>>>>>>
>> > >>>>>>> If we came to this design, because TextInputFormat
is
>> inefficient,
>> > would
>> > >>>>>>> this work for Avro or Thrift input format?
>> > >>>>>>> Please let me know before this is changed, I would
like to work
>> on
>> > a
>> > >>>>>>> separate branch.
>> > >>>>>>> You may proceed as you wish.
>> > >>>>>>>
>> > >>>>>>> Regards,
>> > >>>>>>> Suraj
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon
<
>> > edwardyoon@apache.org
>> > >>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> I think 'record converter' should be removed.
It's not good
>> idea.
>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep
vertex input
>> > reader, we
>> > >>>>>>>> can move related classes into common module.
>> > >>>>>>>>
>> > >>>>>>>> Let's go with my original plan.
>> > >>>>>>>>
>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon
<
>> > edwardyoon@apache.org>
>> > >>>>>>>> wrote:
>> > >>>>>>>>> Hi all,
>> > >>>>>>>>>
>> > >>>>>>>>> I'm reading our old discussions about record
converter,
>> superstep
>> > >>>>>>>>> injection, and common module:
>> > >>>>>>>>>
>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>> > >>>>>>>>>
>> > >>>>>>>>> To clarify goals and objectives:
>> > >>>>>>>>>
>> > >>>>>>>>> 1. A parallel input partition is necessary
for obtaining
>> > scalability
>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel
processing (It's
>> > not a
>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or
HAMA-644. Please don't
>> > >>>>>>>>> shake).
>> > >>>>>>>>> 2. Input partitioning should be handled
at BSP framework level,
>> > and
>> > >>>>> it
>> > >>>>>>>>> is for every Hama jobs, not only for Graph
jobs.
>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be
avoided, and NoSQLs
>> input
>> > also
>> > >>>>>>>>> should be considered.
>> > >>>>>>>>>
>> > >>>>>>>>> The current problem is that every input
of graph jobs should be
>> > >>>>>>>>> rewritten on HDFS. If you have a good idea,
Please let me know.
>> > >>>>>>>>>
>> > >>>>>>>>> --
>> > >>>>>>>>> Best Regards, Edward J. Yoon
>> > >>>>>>>>> @eddieyoon
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> --
>> > >>>>>>>> Best Regards, Edward J. Yoon
>> > >>>>>>>> @eddieyoon
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> Best Regards, Edward J. Yoon
>> > >>>>>> @eddieyoon
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> --
>> > >>>>> Best Regards, Edward J. Yoon
>> > >>>>> @eddieyoon
>> > >>>>>
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Best Regards, Edward J. Yoon
>> > >> @eddieyoon
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Edward J. Yoon
>> > > @eddieyoon
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>> >
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message