hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Wed, 08 May 2013 09:35:47 GMT
I think this is a important step to move forward, but let's close this
discussion by lazy consensus if no-one objects within the next three
days.



On Tue, May 7, 2013 at 11:26 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
> I've noted Suraj's suggestion and added my opinions, too -
> http://wiki.apache.org/hama/Partitioning
>
> In this thread, please focus on the problem of integration with
> NoSQLs. Since PartitioningRunner converts records of input data, and
> GraphJobRunner reads converted records from partition files, Table
> input must go unnecessarily through PartitioningRunner. That's the
> problem of current "Partitioning and Record converter".
>
>
> On Tue, May 7, 2013 at 9:50 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>> And, using of superstep API is a improvement or approach of partition
>> processing. So, the main is whether we will parse vertex at bsp core
>> or graph job runner. Please don't shake.
>>
>> On Tue, May 7, 2013 at 9:45 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>>> Hello all,
>>>
>>> a GSoC student who want to try to integrate NoSQLs with Graph looking
>>> at this thread. My suggestion is not a quick fix solution. It's a
>>> must. Please let me know whether you understand my suggestion or not.
>>>
>>> On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>> Do you need also a separated Wiki? :-) If not, please feel free to
>>>> describe your ideas on Wiki, dividing short-term/long-term plans.
>>>>
>>>> On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>>> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge
>>>>> computations. Hence, the number of BSP processors should be able to
>>>>> adjust ( != file blocks).
>>>>>
>>>>> 2. I'm -1 for using local disk to store partitions. HDFS is high cost.
>>>>> But, reuse of partitions should be considered.
>>>>>
>>>>> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
>>>>> <tommaso.teofili@gmail.com> wrote:
>>>>>> 2013/5/6 Suraj Menon <surajsmenon@apache.org>
>>>>>>
>>>>>>> I am assuming that the storage of vertices (NoSQL or any other
format) need
>>>>>>> not be updated after every iteration.
>>>>>>>
>>>>>>> Based on the above assumption, I have the following suggestions:
>>>>>>>
>>>>>>> - Instead of running a separate job, we inject a partitioning
superstep
>>>>>>> before the first superstep of the job. (This has a dependency
on the
>>>>>>> Superstep API)
>>>>>>>
>>>>>>
>>>>>> could we do that without introducing that dependency? I mean would
that
>>>>>> work also if not using the Superstep API on the client side?
>>>>>>
>>>>>>
>>>>>>> - The partitions instead of being written to HDFS, which is creating
a copy
>>>>>>> of input files in HDFS Cluster (too costly I believe), should
be written to
>>>>>>> local files and read from.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>> - For graph jobs, we can configure this partitioning superstep
class
>>>>>>> specific to graph partitioning class that partitions and loads
vertices.
>>>>>>>
>>>>>>
>>>>>> this seems to be inline with the above assumption thus it probably
makes
>>>>>> sense.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> This sure has some dependencies. But would be a graceful solution
and can
>>>>>>> tackle every problem. This is what I want to achieve in the end.
Please
>>>>>>> proceed if you have any intermediate ways to reach here faster.
>>>>>>>
>>>>>>
>>>>>> Your solution sounds good to me generally, better if we can avoid
the
>>>>>> dependency, but still ok if not.
>>>>>> Let's collect also others' opinions and try to reach a shared consensus.
>>>>>>
>>>>>> Tommaso
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Suraj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
>>>>>>> >wrote:
>>>>>>>
>>>>>>> > P.S., BSPJob (with table input) also the same. It's not
only for
>>>>>>> GraphJob.
>>>>>>> >
>>>>>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
>>>>>>> > wrote:
>>>>>>> > > All,
>>>>>>> > >
>>>>>>> > > I've also roughly described details about design of
Graph APIs[1]. To
>>>>>>> > > reduce our misunderstandings (please read first Partitioning
and
>>>>>>> > > GraphModuleInternals documents),
>>>>>>> > >
>>>>>>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning
or
>>>>>>> > > rewrite partition files on HDFS. So, in these input
cases, I think
>>>>>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>>>>>> > > method.
>>>>>>> > >
>>>>>>> > > At here, we faced two options: 1) The current implementation
of
>>>>>>> > > 'PartitioningRunner' writes converted vertices on sequence
format
>>>>>>> > > partition files. And GraphJobRunner reads only Vertex
Writable
>>>>>>> > > objects. If input is table, we maybe have to skip the
Partitioning job
>>>>>>> > > and have to parse vertex structure at loadVertices()
method after
>>>>>>> > > checking some conditions. 2) PartitioningRunner just
writes raw
>>>>>>> > > records to proper partition files after checking its
partition ID. And
>>>>>>> > > GraphJobRunner.loadVertices() always parses and loads
vertices.
>>>>>>> > >
>>>>>>> > > I was mean that I prefer the latter and there's no
need to write
>>>>>>> > > VertexWritable files. It's not related whether graph
will support only
>>>>>>> > > Seq format or not. Hope my explanation is enough!
>>>>>>> > >
>>>>>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>>>>>> > >
>>>>>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <edwardyoon@apache.org
>>>>>>> >
>>>>>>> > wrote:
>>>>>>> > >> I've described my big picture here:
>>>>>>> > http://wiki.apache.org/hama/Partitioning
>>>>>>> > >>
>>>>>>> > >> Please review and feedback whether this is acceptable.
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org>
wrote:
>>>>>>> > >>> p.s., i think theres mis understand. it doesn't
mean that graph will
>>>>>>> > support only sequence file format. Main is whether converting
at
>>>>>>> > patitioning stage or loadVertices stage.
>>>>>>> > >>>
>>>>>>> > >>> Sent from my iPhone
>>>>>>> > >>>
>>>>>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
>>>>>>> wrote:
>>>>>>> > >>>
>>>>>>> > >>>> Sure, Please go ahead.
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward
J. Yoon <
>>>>>>> edwardyoon@apache.org
>>>>>>> > >wrote:
>>>>>>> > >>>>
>>>>>>> > >>>>>>> Please let me know before this
is changed, I would like to work
>>>>>>> on
>>>>>>> > a
>>>>>>> > >>>>>>> separate branch.
>>>>>>> > >>>>>
>>>>>>> > >>>>> I personally, we have to focus on high
priority tasks. and more
>>>>>>> > >>>>> feedbacks and contributions from users.
So, if changes made, I'll
>>>>>>> > >>>>> release periodically. If you want to
work on another place, please
>>>>>>> > do.
>>>>>>> > >>>>> I don't want to wait your patches.
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward
J. Yoon <
>>>>>>> > edwardyoon@apache.org>
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>> For preparing integration with
NoSQLs, of course, maybe condition
>>>>>>> > >>>>>> check (whether converted or not)
can be used without removing
>>>>>>> record
>>>>>>> > >>>>>> converter.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> We need to discuss everything.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM,
Suraj Menon <
>>>>>>> surajsmenon@apache.org
>>>>>>> > >
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>>> I am still -1 if this means
our graph module can work only on
>>>>>>> > sequential
>>>>>>> > >>>>>>> file format.
>>>>>>> > >>>>>>> Please note that you can set
record converter to null and make
>>>>>>> > changes
>>>>>>> > >>>>> to
>>>>>>> > >>>>>>> loadVertices for what you desire
here.
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> If we came to this design,
because TextInputFormat is
>>>>>>> inefficient,
>>>>>>> > would
>>>>>>> > >>>>>>> this work for Avro or Thrift
input format?
>>>>>>> > >>>>>>> Please let me know before this
is changed, I would like to work
>>>>>>> on
>>>>>>> > a
>>>>>>> > >>>>>>> separate branch.
>>>>>>> > >>>>>>> You may proceed as you wish.
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> Regards,
>>>>>>> > >>>>>>> Suraj
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> On Sun, May 5, 2013 at 4:09
PM, Edward J. Yoon <
>>>>>>> > edwardyoon@apache.org
>>>>>>> > >>>>>> wrote:
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>> I think 'record converter'
should be removed. It's not good
>>>>>>> idea.
>>>>>>> > >>>>>>>> Moreover, it's unnecessarily
complex. To keep vertex input
>>>>>>> > reader, we
>>>>>>> > >>>>>>>> can move related classes
into common module.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Let's go with my original
plan.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> On Sat, May 4, 2013 at
9:32 AM, Edward J. Yoon <
>>>>>>> > edwardyoon@apache.org>
>>>>>>> > >>>>>>>> wrote:
>>>>>>> > >>>>>>>>> Hi all,
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> I'm reading our old
discussions about record converter,
>>>>>>> superstep
>>>>>>> > >>>>>>>>> injection, and common
module:
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>>>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> To clarify goals and
objectives:
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> 1. A parallel input
partition is necessary for obtaining
>>>>>>> > scalability
>>>>>>> > >>>>>>>>> and elasticity of a
Bulk Synchronous Parallel processing (It's
>>>>>>> > not a
>>>>>>> > >>>>>>>>> memory issue, or Disk/Spilling
Queue, or HAMA-644. Please don't
>>>>>>> > >>>>>>>>> shake).
>>>>>>> > >>>>>>>>> 2. Input partitioning
should be handled at BSP framework level,
>>>>>>> > and
>>>>>>> > >>>>> it
>>>>>>> > >>>>>>>>> is for every Hama jobs,
not only for Graph jobs.
>>>>>>> > >>>>>>>>> 3. Unnecessary I/O
Overhead need to be avoided, and NoSQLs
>>>>>>> input
>>>>>>> > also
>>>>>>> > >>>>>>>>> should be considered.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> The current problem
is that every input of graph jobs should be
>>>>>>> > >>>>>>>>> rewritten on HDFS.
If you have a good idea, Please let me know.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> --
>>>>>>> > >>>>>>>>> Best Regards, Edward
J. Yoon
>>>>>>> > >>>>>>>>> @eddieyoon
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> --
>>>>>>> > >>>>>>>> Best Regards, Edward J.
Yoon
>>>>>>> > >>>>>>>> @eddieyoon
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> --
>>>>>>> > >>>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>>> @eddieyoon
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> --
>>>>>>> > >>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>> @eddieyoon
>>>>>>> > >>>>>
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> --
>>>>>>> > >> Best Regards, Edward J. Yoon
>>>>>>> > >> @eddieyoon
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > --
>>>>>>> > > Best Regards, Edward J. Yoon
>>>>>>> > > @eddieyoon
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Best Regards, Edward J. Yoon
>>>>>>> > @eddieyoon
>>>>>>> >
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message