hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Issues about Partitioning and Record converter
Date Tue, 07 May 2013 02:26:25 GMT
I've noted Suraj's suggestion and added my opinions, too -
http://wiki.apache.org/hama/Partitioning

In this thread, please focus on the problem of integration with
NoSQLs. Since PartitioningRunner converts records of input data, and
GraphJobRunner reads converted records from partition files, Table
input must go unnecessarily through PartitioningRunner. That's the
problem of current "Partitioning and Record converter".


On Tue, May 7, 2013 at 9:50 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
> And, using of superstep API is a improvement or approach of partition
> processing. So, the main is whether we will parse vertex at bsp core
> or graph job runner. Please don't shake.
>
> On Tue, May 7, 2013 at 9:45 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>> Hello all,
>>
>> a GSoC student who want to try to integrate NoSQLs with Graph looking
>> at this thread. My suggestion is not a quick fix solution. It's a
>> must. Please let me know whether you understand my suggestion or not.
>>
>> On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>>> Do you need also a separated Wiki? :-) If not, please feel free to
>>> describe your ideas on Wiki, dividing short-term/long-term plans.
>>>
>>> On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge
>>>> computations. Hence, the number of BSP processors should be able to
>>>> adjust ( != file blocks).
>>>>
>>>> 2. I'm -1 for using local disk to store partitions. HDFS is high cost.
>>>> But, reuse of partitions should be considered.
>>>>
>>>> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
>>>> <tommaso.teofili@gmail.com> wrote:
>>>>> 2013/5/6 Suraj Menon <surajsmenon@apache.org>
>>>>>
>>>>>> I am assuming that the storage of vertices (NoSQL or any other format)
need
>>>>>> not be updated after every iteration.
>>>>>>
>>>>>> Based on the above assumption, I have the following suggestions:
>>>>>>
>>>>>> - Instead of running a separate job, we inject a partitioning superstep
>>>>>> before the first superstep of the job. (This has a dependency on
the
>>>>>> Superstep API)
>>>>>>
>>>>>
>>>>> could we do that without introducing that dependency? I mean would that
>>>>> work also if not using the Superstep API on the client side?
>>>>>
>>>>>
>>>>>> - The partitions instead of being written to HDFS, which is creating
a copy
>>>>>> of input files in HDFS Cluster (too costly I believe), should be
written to
>>>>>> local files and read from.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> - For graph jobs, we can configure this partitioning superstep class
>>>>>> specific to graph partitioning class that partitions and loads vertices.
>>>>>>
>>>>>
>>>>> this seems to be inline with the above assumption thus it probably makes
>>>>> sense.
>>>>>
>>>>>
>>>>>>
>>>>>> This sure has some dependencies. But would be a graceful solution
and can
>>>>>> tackle every problem. This is what I want to achieve in the end.
Please
>>>>>> proceed if you have any intermediate ways to reach here faster.
>>>>>>
>>>>>
>>>>> Your solution sounds good to me generally, better if we can avoid the
>>>>> dependency, but still ok if not.
>>>>> Let's collect also others' opinions and try to reach a shared consensus.
>>>>>
>>>>> Tommaso
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Suraj
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <edwardyoon@apache.org
>>>>>> >wrote:
>>>>>>
>>>>>> > P.S., BSPJob (with table input) also the same. It's not only
for
>>>>>> GraphJob.
>>>>>> >
>>>>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <edwardyoon@apache.org>
>>>>>> > wrote:
>>>>>> > > All,
>>>>>> > >
>>>>>> > > I've also roughly described details about design of Graph
APIs[1]. To
>>>>>> > > reduce our misunderstandings (please read first Partitioning
and
>>>>>> > > GraphModuleInternals documents),
>>>>>> > >
>>>>>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning
or
>>>>>> > > rewrite partition files on HDFS. So, in these input cases,
I think
>>>>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>>>>> > > method.
>>>>>> > >
>>>>>> > > At here, we faced two options: 1) The current implementation
of
>>>>>> > > 'PartitioningRunner' writes converted vertices on sequence
format
>>>>>> > > partition files. And GraphJobRunner reads only Vertex Writable
>>>>>> > > objects. If input is table, we maybe have to skip the Partitioning
job
>>>>>> > > and have to parse vertex structure at loadVertices() method
after
>>>>>> > > checking some conditions. 2) PartitioningRunner just writes
raw
>>>>>> > > records to proper partition files after checking its partition
ID. And
>>>>>> > > GraphJobRunner.loadVertices() always parses and loads vertices.
>>>>>> > >
>>>>>> > > I was mean that I prefer the latter and there's no need
to write
>>>>>> > > VertexWritable files. It's not related whether graph will
support only
>>>>>> > > Seq format or not. Hope my explanation is enough!
>>>>>> > >
>>>>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>>>>> > >
>>>>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <edwardyoon@apache.org
>>>>>> >
>>>>>> > wrote:
>>>>>> > >> I've described my big picture here:
>>>>>> > http://wiki.apache.org/hama/Partitioning
>>>>>> > >>
>>>>>> > >> Please review and feedback whether this is acceptable.
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <edward@udanax.org>
wrote:
>>>>>> > >>> p.s., i think theres mis understand. it doesn't
mean that graph will
>>>>>> > support only sequence file format. Main is whether converting
at
>>>>>> > patitioning stage or loadVertices stage.
>>>>>> > >>>
>>>>>> > >>> Sent from my iPhone
>>>>>> > >>>
>>>>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <menonsuraj5@gmail.com>
>>>>>> wrote:
>>>>>> > >>>
>>>>>> > >>>> Sure, Please go ahead.
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon
<
>>>>>> edwardyoon@apache.org
>>>>>> > >wrote:
>>>>>> > >>>>
>>>>>> > >>>>>>> Please let me know before this
is changed, I would like to work
>>>>>> on
>>>>>> > a
>>>>>> > >>>>>>> separate branch.
>>>>>> > >>>>>
>>>>>> > >>>>> I personally, we have to focus on high
priority tasks. and more
>>>>>> > >>>>> feedbacks and contributions from users.
So, if changes made, I'll
>>>>>> > >>>>> release periodically. If you want to work
on another place, please
>>>>>> > do.
>>>>>> > >>>>> I don't want to wait your patches.
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward
J. Yoon <
>>>>>> > edwardyoon@apache.org>
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>> For preparing integration with NoSQLs,
of course, maybe condition
>>>>>> > >>>>>> check (whether converted or not) can
be used without removing
>>>>>> record
>>>>>> > >>>>>> converter.
>>>>>> > >>>>>>
>>>>>> > >>>>>> We need to discuss everything.
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj
Menon <
>>>>>> surajsmenon@apache.org
>>>>>> > >
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>>> I am still -1 if this means our
graph module can work only on
>>>>>> > sequential
>>>>>> > >>>>>>> file format.
>>>>>> > >>>>>>> Please note that you can set record
converter to null and make
>>>>>> > changes
>>>>>> > >>>>> to
>>>>>> > >>>>>>> loadVertices for what you desire
here.
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> If we came to this design, because
TextInputFormat is
>>>>>> inefficient,
>>>>>> > would
>>>>>> > >>>>>>> this work for Avro or Thrift input
format?
>>>>>> > >>>>>>> Please let me know before this
is changed, I would like to work
>>>>>> on
>>>>>> > a
>>>>>> > >>>>>>> separate branch.
>>>>>> > >>>>>>> You may proceed as you wish.
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Regards,
>>>>>> > >>>>>>> Suraj
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM,
Edward J. Yoon <
>>>>>> > edwardyoon@apache.org
>>>>>> > >>>>>> wrote:
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> I think 'record converter'
should be removed. It's not good
>>>>>> idea.
>>>>>> > >>>>>>>> Moreover, it's unnecessarily
complex. To keep vertex input
>>>>>> > reader, we
>>>>>> > >>>>>>>> can move related classes into
common module.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Let's go with my original plan.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32
AM, Edward J. Yoon <
>>>>>> > edwardyoon@apache.org>
>>>>>> > >>>>>>>> wrote:
>>>>>> > >>>>>>>>> Hi all,
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> I'm reading our old discussions
about record converter,
>>>>>> superstep
>>>>>> > >>>>>>>>> injection, and common module:
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> To clarify goals and objectives:
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> 1. A parallel input partition
is necessary for obtaining
>>>>>> > scalability
>>>>>> > >>>>>>>>> and elasticity of a Bulk
Synchronous Parallel processing (It's
>>>>>> > not a
>>>>>> > >>>>>>>>> memory issue, or Disk/Spilling
Queue, or HAMA-644. Please don't
>>>>>> > >>>>>>>>> shake).
>>>>>> > >>>>>>>>> 2. Input partitioning should
be handled at BSP framework level,
>>>>>> > and
>>>>>> > >>>>> it
>>>>>> > >>>>>>>>> is for every Hama jobs,
not only for Graph jobs.
>>>>>> > >>>>>>>>> 3. Unnecessary I/O Overhead
need to be avoided, and NoSQLs
>>>>>> input
>>>>>> > also
>>>>>> > >>>>>>>>> should be considered.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> The current problem is
that every input of graph jobs should be
>>>>>> > >>>>>>>>> rewritten on HDFS. If you
have a good idea, Please let me know.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> --
>>>>>> > >>>>>>>>> Best Regards, Edward J.
Yoon
>>>>>> > >>>>>>>>> @eddieyoon
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> --
>>>>>> > >>>>>>>> Best Regards, Edward J. Yoon
>>>>>> > >>>>>>>> @eddieyoon
>>>>>> > >>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>> --
>>>>>> > >>>>>> Best Regards, Edward J. Yoon
>>>>>> > >>>>>> @eddieyoon
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> --
>>>>>> > >>>>> Best Regards, Edward J. Yoon
>>>>>> > >>>>> @eddieyoon
>>>>>> > >>>>>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> --
>>>>>> > >> Best Regards, Edward J. Yoon
>>>>>> > >> @eddieyoon
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > --
>>>>>> > > Best Regards, Edward J. Yoon
>>>>>> > > @eddieyoon
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Best Regards, Edward J. Yoon
>>>>>> > @eddieyoon
>>>>>> >
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message