hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leonidas Fegaras <fega...@cse.uta.edu>
Subject Re: [ANNOUNCEMENT] A query system for BSP processing
Date Tue, 11 Sep 2012 19:13:42 GMT
I created a project on Github:
https://github.com/fegaras/mrql.git

Thank you for your help
Leonidas Fegaras

On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:

> Yep, a subproject would be the alternative.
> In this case we would give you PMC and committer rights so you can  
> actively
> work on that.
> However this would make the mapreduce part more or less useless, so  
> if you
> want to go the hybrid way, feel free to submit an incubation request.
>
> 2012/9/7 Suraj Menon <surajsmenon@apache.org>
>
>> I think Thomas has a point. How about making it a sub-module/sub- 
>> project of
>> Hama for now? If/When it gains enough community support to make it  
>> a top
>> level project, you can fork it as a separate project.
>> I am not completely aware of the procedures and requirements for  
>> getting
>> external project as sub-project.
>> We can look into it if you are ready to take this route.
>>
>>> Could you please send me a link for setting up an open-source Apache
>> project?
>> If I am right this is what you are looking for -
>> http://incubator.apache.org/guides/proposal.html
>> http://incubator.apache.org/sitemap.html
>>
>> Good luck,
>> Suraj
>>
>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>> <thomas.jungblut@gmail.com>wrote:
>>
>>> Although I think this is a great project, I think that you will  
>>> not meet
>>> the requirements.
>>> You need a community and a charter to get it into the incubation.
>>>
>>> What about hosting it on Github?
>>>
>>> 2012/9/7 Leonidas Fegaras <fegaras@cse.uta.edu>
>>>
>>>> Yes, this is a great idea. I have used GIT on my own server but I  
>>>> don't
>>>> know how to do this for ASF. Could you please send me a link for
>> setting
>>> up
>>>> an open-source Apache project?
>>>>
>>>>
>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>
>>>>> If you can open source this then I'm sure the ASF community can  
>>>>> help
>>>>> you and make this software better.
>>>>>
>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>
>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>> fegaras@cse.uta.edu>
>>>>> wrote:
>>>>>
>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>> but I
>>>>>> have
>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>> k-means
>>>>>> clustering (it's a special case because you improve a fixed  
>>>>>> number of
>>>>>> points).
>>>>>> Leonidas
>>>>>>
>>>>>>
>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>
>>>>>> Shall we work together?
>>>>>>>
>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>> fegaras@cse.uta.edu
>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you very much for your interest and for testing my
 
>>>>>>>> system.
>>>>>>>> It seems that my release was premature: It worked for some
 
>>>>>>>> random
>>> data
>>>>>>>> but
>>>>>>>> didn't for some others. It's a minor logical error that I
 
>>>>>>>> will try
>> to
>>>>>>>> fix
>>>>>>>> in
>>>>>>>> the next few days. The problem is with the stopping condition
 
>>>>>>>> of
>> the
>>>>>>>> repeat
>>>>>>>> expression that calculates the new pagerank from the old.
It  
>>>>>>>> must
>>> stop
>>>>>>>> if
>>>>>>>> ALL peers reach  the specified precision. This is done by
 
>>>>>>>> having
>>> those
>>>>>>>> peers
>>>>>>>> that need to continue send a message to others to continue.
It
>> seems
>>>>>>>> that
>>>>>>>> now when all peers agree at the same time, the program works
 
>>>>>>>> fine.
>>> But
>>>>>>>> if
>>>>>>>> one finishes sooner, instead of continuing the repeat loop,
 
>>>>>>>> it runs
>>>>>>>> away
>>>>>>>> to
>>>>>>>> the next BSP step that follows the repeat, then exits  
>>>>>>>> prematurely
>> and
>>>>>>>> the
>>>>>>>> system hangs. The casting errors are due to the run-away
peers
>>>>>>>> executing
>>>>>>>> the
>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>> though
>>>>>>>> are
>>>>>>>> OK.
>>>>>>>> By the way, I had a problem exchanging large amount of data
 
>>>>>>>> during
>>> sync
>>>>>>>> (I
>>>>>>>> discussed this with Thomas).  My solution was to to break
a BSP
>>>>>>>> superstep
>>>>>>>> into multiple substeps so that each substep can handle a
max  
>>>>>>>> number
>>> of
>>>>>>>> messages. Of course my program has to collect all messages
in a
>>> vector
>>>>>>>> in
>>>>>>>> memory. When the vector is too big, it is spilled in a local
 
>>>>>>>> file.
>>> This
>>>>>>>> moved the problem from the Hama side to my side and allowed
 
>>>>>>>> me to
>>>>>>>> handle
>>>>>>>> larger data, especially in joins. I think this problem of
>> exchanging
>>>>>>>> large
>>>>>>>> amount of data during a superstep is currently a weakness
of  
>>>>>>>> Hama.
>>>>>>>> Leonidas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>
>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>> thomas.jungblut@gmail.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Leonidas!
>>>>>>>>>>
>>>>>>>>>> I have to admit that I have known what is going on
(and had  
>>>>>>>>>> to
>> keep
>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>> This will help many people writing BSPs in a more
easier way.
>>>>>>>>>>
>>>>>>>>>> Of course this is not as fast as the native BSP code,
Hive  
>>>>>>>>>> and
>> Pig
>>>>>>>>>> suffer
>>>>>>>>>> from the same problems in MR.
>>>>>>>>>> But it gives people the opportunity to develop faster
and get
>> their
>>>>>>>>>> code
>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>
>>>>>>>>>> And I think, that we will help you gladly on improving
the  
>>>>>>>>>> BSP
>> part
>>>>>>>>>> of
>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Edward J. Yoon<edwardyoon@apache.org>
>>>>>>>>>>
>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>> network).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>
>>>>>>>>>>> P.S., There are some errors so I couldn't test
large-scale.
>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int
cannot  
>>>>>>>>>>> be cast
>>> to
>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear
a
>>> non-materialized
>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>
>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each
task will  
>>>>>>>>>>> handle
>>>>>>>>>>> about
>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>
>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>
>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each
task will  
>>>>>>>>>>> handle
>>>>>>>>>>> about
>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>
>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>> <edwardyoon@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Wow, very interesting. I'm going to install
and test on my
>> large
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> cluster.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas
Fegaras
>>>>>>>>>>>> <fegaras@cse.uta.edu>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>> I am pleased to announce that the MRQL
query processing  
>>>>>>>>>>>>> system
>>> can
>>>>>>>>>>>>> now
>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster.
MRQL is  
>>>>>>>>>>>>> available
>>> at:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>
>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language)
is an SQL-like query
>>> language
>>>>>>>>>>>>> for
>>>>>>>>>>>>> large-scale, distributed data analysis.
MRQL is powerful
>> enough
>>> to
>>>>>>>>>>>>> express most common data analysis tasks
over many  
>>>>>>>>>>>>> different
>>> kinds
>>>>>>>>>>>>> of
>>>>>>>>>>>>> raw data, including hierarchical data
and nested  
>>>>>>>>>>>>> collections,
>>> such
>>>>>>>>>>>>> as
>>>>>>>>>>>>> XML data. MRQL can run in two modes:
in MR (Map-Reduce)  
>>>>>>>>>>>>> mode
>>> using
>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous
Parallel) mode
>> using
>>>>>>>>>>>>> Apache
>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to
read and write their
>> data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that, the BSP mode is currently
experimental (not
>>> fine-tuned
>>>>>>>>>>>>> yet)
>>>>>>>>>>>>> and lacks any fault-tolerance (if an
error occurs, the  
>>>>>>>>>>>>> entire
>>> job
>>>>>>>>>>>>> must
>>>>>>>>>>>>> be restarted). Due to our limited resources,
MRQL has only
>> been
>>>>>>>>>>>>> tested
>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores).
We compared the BSP
>> mode
>>>>>>>>>>>>> with
>>>>>>>>>>>>> the MR mode by evaluating a pagerank
query over a small  
>>>>>>>>>>>>> graph
>>>>>>>>>>>>> (100K
>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode
is about 4.5  
>>>>>>>>>>>>> times
>>> faster
>>>>>>>>>>>>> than the MR mode. Please let me know
if you'd like to
>> contribute
>>>>>>>>>>>>> to
>>>>>>>>>>>>> this project by testing MRQL on a larger
cluster.
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>
>>>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> @eddieyoon
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Mime
View raw message