hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [ANNOUNCEMENT] A query system for BSP processing
Date Thu, 13 Sep 2012 11:20:37 GMT
Just curious, is there a plan to support sophisticated queries for
unstructured spatial datasets?

On Wed, Sep 12, 2012 at 4:13 AM, Leonidas Fegaras <fegaras@cse.uta.edu> wrote:
> I created a project on Github:
> https://github.com/fegaras/mrql.git
>
> Thank you for your help
> Leonidas Fegaras
>
>
> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>
>> Yep, a subproject would be the alternative.
>> In this case we would give you PMC and committer rights so you can
>> actively
>> work on that.
>> However this would make the mapreduce part more or less useless, so if you
>> want to go the hybrid way, feel free to submit an incubation request.
>>
>> 2012/9/7 Suraj Menon <surajsmenon@apache.org>
>>
>>> I think Thomas has a point. How about making it a sub-module/sub-project
>>> of
>>> Hama for now? If/When it gains enough community support to make it a top
>>> level project, you can fork it as a separate project.
>>> I am not completely aware of the procedures and requirements for getting
>>> external project as sub-project.
>>> We can look into it if you are ready to take this route.
>>>
>>>> Could you please send me a link for setting up an open-source Apache
>>>
>>> project?
>>> If I am right this is what you are looking for -
>>> http://incubator.apache.org/guides/proposal.html
>>> http://incubator.apache.org/sitemap.html
>>>
>>> Good luck,
>>> Suraj
>>>
>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>> <thomas.jungblut@gmail.com>wrote:
>>>
>>>> Although I think this is a great project, I think that you will not meet
>>>> the requirements.
>>>> You need a community and a charter to get it into the incubation.
>>>>
>>>> What about hosting it on Github?
>>>>
>>>> 2012/9/7 Leonidas Fegaras <fegaras@cse.uta.edu>
>>>>
>>>>> Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>> know how to do this for ASF. Could you please send me a link for
>>>
>>> setting
>>>>
>>>> up
>>>>>
>>>>> an open-source Apache project?
>>>>>
>>>>>
>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>
>>>>>> If you can open source this then I'm sure the ASF community can help
>>>>>> you and make this software better.
>>>>>>
>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>
>>> fegaras@cse.uta.edu>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes sure. I have fixed the bug with the repeat stopping condition
>>>
>>> but I
>>>>>>>
>>>>>>> have
>>>>>>> only tested pagerank on my small cluster. I still need to fix
the
>>>>
>>>> k-means
>>>>>>>
>>>>>>> clustering (it's a special case because you improve a fixed number
of
>>>>>>> points).
>>>>>>> Leonidas
>>>>>>>
>>>>>>>
>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Shall we work together?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>
>>> fegaras@cse.uta.edu
>>>>>
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you very much for your interest and for testing
my system.
>>>>>>>>> It seems that my release was premature: It worked for
some random
>>>>
>>>> data
>>>>>>>>>
>>>>>>>>> but
>>>>>>>>> didn't for some others. It's a minor logical error that
I will try
>>>
>>> to
>>>>>>>>>
>>>>>>>>> fix
>>>>>>>>> in
>>>>>>>>> the next few days. The problem is with the stopping condition
of
>>>
>>> the
>>>>>>>>>
>>>>>>>>> repeat
>>>>>>>>> expression that calculates the new pagerank from the
old. It must
>>>>
>>>> stop
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> ALL peers reach  the specified precision. This is done
by having
>>>>
>>>> those
>>>>>>>>>
>>>>>>>>> peers
>>>>>>>>> that need to continue send a message to others to continue.
It
>>>
>>> seems
>>>>>>>>>
>>>>>>>>> that
>>>>>>>>> now when all peers agree at the same time, the program
works fine.
>>>>
>>>> But
>>>>>>>>>
>>>>>>>>> if
>>>>>>>>> one finishes sooner, instead of continuing the repeat
loop, it runs
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> the next BSP step that follows the repeat, then exits
prematurely
>>>
>>> and
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>> system hangs. The casting errors are due to the run-away
peers
>>>>>>>>> executing
>>>>>>>>> the
>>>>>>>>> wrong BSP steps reading wrong messages. Queries without
repeat
>>>
>>> though
>>>>>>>>>
>>>>>>>>> are
>>>>>>>>> OK.
>>>>>>>>> By the way, I had a problem exchanging large amount of
data during
>>>>
>>>> sync
>>>>>>>>>
>>>>>>>>> (I
>>>>>>>>> discussed this with Thomas).  My solution was to to break
a BSP
>>>>>>>>> superstep
>>>>>>>>> into multiple substeps so that each substep can handle
a max number
>>>>
>>>> of
>>>>>>>>>
>>>>>>>>> messages. Of course my program has to collect all messages
in a
>>>>
>>>> vector
>>>>>>>>>
>>>>>>>>> in
>>>>>>>>> memory. When the vector is too big, it is spilled in
a local file.
>>>>
>>>> This
>>>>>>>>>
>>>>>>>>> moved the problem from the Hama side to my side and allowed
me to
>>>>>>>>> handle
>>>>>>>>> larger data, especially in joins. I think this problem
of
>>>
>>> exchanging
>>>>>>>>>
>>>>>>>>> large
>>>>>>>>> amount of data during a superstep is currently a weakness
of Hama.
>>>>>>>>> Leonidas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
>>>>
>>>> thomas.jungblut@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to admit that I have known what is going
on (and had to
>>>
>>> keep
>>>>>>>>>>>
>>>>>>>>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>> This will help many people writing BSPs in a
more easier way.
>>>>>>>>>>>
>>>>>>>>>>> Of course this is not as fast as the native BSP
code, Hive and
>>>
>>> Pig
>>>>>>>>>>>
>>>>>>>>>>> suffer
>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>> But it gives people the opportunity to develop
faster and get
>>>
>>> their
>>>>>>>>>>>
>>>>>>>>>>> code
>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>
>>>>>>>>>>> And I think, that we will help you gladly on
improving the BSP
>>>
>>> part
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Edward J. Yoon<edwardyoon@apache.org>
>>>>>>>>>>>
>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s
infiniband
>>>>
>>>> network).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>
>>>>>>>>>>>> P.S., There are some errors so I couldn't
test large-scale.
>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int
cannot be cast
>>>>
>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot
clear a
>>>>
>>>> non-materialized
>>>>>>>>>>>>
>>>>>>>>>>>> sequence ..., etc.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10).
Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20).
Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J.
Yoon
>>>>>>>>>>>> <edwardyoon@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wow, very interesting. I'm going to install
and test on my
>>>
>>> large
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas
Fegaras
>>>>>>>>>>>>> <fegaras@cse.uta.edu>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Hama users,
>>>>>>>>>>>>>> I am pleased to announce that the
MRQL query processing system
>>>>
>>>> can
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> now
>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama
cluster. MRQL is available
>>>>
>>>> at:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language)
is an SQL-like query
>>>>
>>>> language
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> large-scale, distributed data analysis.
MRQL is powerful
>>>
>>> enough
>>>>
>>>> to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> express most common data analysis
tasks over many different
>>>>
>>>> kinds
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> raw data, including hierarchical
data and nested collections,
>>>>
>>>> such
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> XML data. MRQL can run in two modes:
in MR (Map-Reduce) mode
>>>>
>>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous
Parallel) mode
>>>
>>> using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS
to read and write their
>>>
>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that, the BSP mode is currently
experimental (not
>>>>
>>>> fine-tuned
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yet)
>>>>>>>>>>>>>> and lacks any fault-tolerance (if
an error occurs, the entire
>>>>
>>>> job
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> must
>>>>>>>>>>>>>> be restarted). Due to our limited
resources, MRQL has only
>>>
>>> been
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tested
>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores).
We compared the BSP
>>>
>>> mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> the MR mode by evaluating a pagerank
query over a small graph
>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP
mode is about 4.5 times
>>>>
>>>> faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> than the MR mode. Please let me know
if you'd like to
>>>
>>> contribute
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> this project by testing MRQL on a
larger cluster.
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message