airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Supun Nakandala <supun.nakand...@gmail.com>
Subject Re: Airavata Registry Considerations
Date Mon, 08 Jun 2015 10:29:46 GMT
Hi Shameera,

As you have mentioned this work is only to evaluate the use of document
oriented database (MongoDB) for Airavata use cases.

On Mon, Jun 8, 2015 at 8:29 AM, Shameera Rathnayaka <shameerainfo@gmail.com>
wrote:

> Hi Supun,
>
> I feel something fundamentally wrong here, as per your evaluation doc, we
> have something to fix in JPA and data model layer and improve it. Beside
> you are suggesting to have a new database with new implementation. What are
> we trying to do here?
>

The data model has also been changed. If you check the experiment document
now it contains all the other related entities as embedded documents.

finding a good database and good implementation from one attempt ? I would
> like to suggest to solve one by one. First try to fix or improve the
> current implementation.
>

I completely agree with you on this. Chathuri and I have done some
modifications to the Registry and PGA and now most of the slow operations
are fixed and they have good performance.


> Then try to evaluate which database is good or bad based on  airavata use
> cases, instead of selecting a one of handy DB in the industry. People still
> use trains even there are lot of quality cars, because cars doesn't match
> for all requirements. I have an experience of using no-SQL database for a
> environment which doesn't match for it and we suffered lot in production
> and moved back to SQL.
>
> I don't see a considerable reason to move from SQL database to no-SQL, may
> be I am wrong here. But looking at the details, we have strongly typed
> schema and not considerably large insert requests (few insert for an
> experiment, lets say 100 experiment per sec, then we will have few hundred
> of inserts, looking at mysql benchmark it is not a problem for mysql at all
> https://www.mysql.com/why-mysql/benchmarks/ ). As whole backend depend on
> state of experiment/task or job, consistency is major requirement which is
> also lack in no-SQL environment. After an experiment creation we only
> update few parameters of experiment so we can reduce database read count
> using proper database access layer caching.
>

Consistency is something that I have not yet evaluated. But I agree  with
your point.


> It is worth to evaluate new technologies and evolve with it but need to
> take a wise decision on selection process.
>
>
One other drawback in MongoDB approach is since it is schema less we cannot
migrate data when schema change. We can easily change the models and start
on a fresh database. But there are no support in MongoDB for data
migrations. The only way to achieve this is to write code to read the data
from existing db change the models in application layer and re insert them.


> Thanks,
> Shameera.
>
> On Sun, Jun 7, 2015 at 2:42 PM, Pierce, Marlon <marpierc@iu.edu> wrote:
>
>>  Thanks, Supun, great to see your concrete evaluations of this
>> approach.  Is it also possible to link documents so that frequently
>> changing parts of the data model can be separate docs and don’t need to be
>> embedded in larger documents?
>>
>>  What other ways can we concretely compare this to the JPA/MySQL
>> approach we use now?
>>
>>  Marlon
>>
>>
>>   From: Supun Nakandala <supun.nakandala@gmail.com>
>> Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
>> Date: Sunday, June 7, 2015 at 1:49 PM
>> To: "dev@airavata.apache.org" <dev@airavata.apache.org>
>> Subject: Re: Airavata Registry Considerations
>>
>>   Hi All,
>>
>>  I did the initial POC on the subject and would like to summarize the
>> findings here.
>>
>>  I used the same Registry CPI and RegistryImpl class tried to use a
>> mongo db back end  instead of the JPA one. The architecture of the module
>> is as follows
>> [image: Inline image 1]
>>  MongoDB stores its internal data in JSON format and we can simply
>> insert JSON data directly. Therefore I have used a Thrift to JSON
>> conversion layer and get rid of the additional DB models. The conversion
>> layer is based on a generic serializer/ deserializer and therefore it is
>> easy to make data model changes without changing registry. But if we make
>> changes to ID fields and Indexed -fields those changes has to be updated in
>> the MongoDB indexes. I see getting rid of the additional DB model classes
>> layer as a -plus point. This also reduces the developer effort required
>> when incorporating changes.
>>
>>  One major difference between the current registry data modelling and
>> this approach is the Experiment model. In the MongoDB based model all
>> related data to an experiment is stored in the same experiment document. A
>> sample JSON which gets stored in the database would be similar to this
>> https://gist.github.com/scnakandala/19fe3c6edf3be3354439. It is possible
>> to change sub document contents and retrieve sub documents only. eg.
>> retrieving a specific task. But in this POC I have not used that function.
>> When updating I have retrieved the entire document update the required
>> fields in the application logic and update the entire document. The reason
>> for this is to keep the Dao objects simple as possible and to do the
>> quickly do the implementation. However it is said the first approach has
>> slight performance advantages compared to the current approach.
>>
>>  I have push the changes to Airavata git repository under the branch
>> name "mongo-registry". The implementation is not 100% complete yet but
>> captures most of the idea.
>>
>>  Things I have not investigated yet
>>
>>  1. Read/Write performance
>> 2. Cluster deployment of MongoDB and Consistency of data
>>
>>  Thanks
>> Supun
>>
>>
>>
>>
>> On Fri, May 22, 2015 at 10:33 PM, Suresh Marru <smarru@apache.org> wrote:
>>
>>> Hi Supun & Supun :)
>>>
>>>  This is good discussion. I think we need to balance both aspects here.
>>> I am not at all favoring shoehorning into mongodb and again spend few
>>> months addressing the unknowns. On the other hand, GSoC is the right time
>>> to explore alternatives.
>>>
>>>  My expectation from this document was not so much of criticizing the
>>> current JPA based implementation. Back then the focus was to adopt thrift
>>> for the data models (thanks Supun K for the recommendation). Among other
>>> things, thrift helped us to keep the focus on airavata’s core capabilities
>>> and quickly unify all the legacy interfaces. The currently JPA registry was
>>> developed from scratch in a hurry to help with thrift adoption. I think it
>>> did well and exceeded initial expectations.
>>>
>>>  We now slowly circled through all the components and made tremendous
>>> progress. We reduced the internal footprint significantly (rabbitmq in
>>> favor of WS Messenger, work queues in place of custom co-ordination in
>>> workflow interpreter and so forth). I think its time to step back and
>>> re-look at the metadata management needs.
>>>
>>>  How about we not worry on the implementation costs and focus on what
>>> criteria we should look into potential solutions and how to profile them?
>>> We should also include the full JPA based implementation as one of the
>>> candidates. As both of you said, its important to identify the profiling
>>> criteria. Chathuri has early work on this, in both the survey paper and
>>> performance measurements, we probably should revisit them and build from
>>> there.
>>>
>>>  Thanks,
>>> Suresh
>>>
>>>  On May 22, 2015, at 12:28 PM, Supun Nakandala <
>>> supun.nakandala@gmail.com> wrote:
>>>
>>>  Hi Supun,
>>>
>>> On Fri, May 22, 2015 at 9:42 PM, Supun Kamburugamuve <supun06@gmail.com>
>>>  wrote:
>>>
>>>> Hi Supun,
>>>>
>>>>  In normal software developments, it is normal to have these kind of
>>>> slowness. We cannot foresee all the things when we develop. The solution
is
>>>> to improve the performance of important operations rather than re-writing
>>>> everything from the beginning. For example for this particular select
>>>> operations you can directly use SQL rather than going through JPA.
>>>>
>>>>  I'm pretty sure you'll encounter more problems, if you implement this
>>>> in MongoDB than in the current MySQL. If that happens, do you think
>>>> abandoning that technology and going for a new database will be a good
>>>> solution? Now you have more experience with MySQL than MongoDB as well.
>>>>
>>>>  Rather than going to abandon everything you have because of one
>>>> problem, trying to fix it may be better for you in the long run.
>>>>
>>>>  Thanks,
>>>> Supun..
>>>>
>>>>   I completely agree with you. Writing things from scratch will need
>>> more development effort and proper testing. And has the potential of
>>> incorporating new unknown issues. It is completely possible to fix these
>>> issues in current registry and I have mentioned that in the doc also.
>>>
>>>  In addition to that I also checked several other alternatives and
>>> found MongoDB interesting. I am not saying that we should completely
>>> rewrite registry using MongoDB. But I think it is worth exploring it at a
>>> POC level.
>>>
>>>
>>>>   On Fri, May 22, 2015 at 11:49 AM, Supun Nakandala <
>>>> supun.nakandala@gmail.com> wrote:
>>>>
>>>>> Hi Supun,
>>>>>
>>>>>  I haven't done done profiling of registry based operations. Here
>>>>> what I mean by slow performance is mainly the slowness of the SELECT
>>>>> operations in PHP Reference Gateway. e.g fetching Projects, fetching
>>>>> experiments. Even a simple query to fetch the 20 most recent experiments
is
>>>>> embarrassingly slow in PGA.
>>>>>
>>>>>  Even though I didn't do a proper profiling of operations I did a
>>>>> query log analysis for a SELECT experiment query. This was a simple query
>>>>> to fetch 20 most recent experiments. I found that JPA layer is generating
>>>>> enormous amount of queries for this task rather than one single query
(due
>>>>> to the select N+1 isssue). This issue is same for fetching a single
>>>>> experiment by specifying the id.
>>>>>
>>>>>  I think it is ok to say that current registry has become bottleneck
>>>>> for most of the PGA specific operations. But I don't have evidence to
show
>>>>> how it has become a bottleneck for the Orchestrator or GFac specific
>>>>> operations. For that as you have mentioned we need to profile the
>>>>> operations. But I think the argument is still valid even for GFac and
>>>>> Orchestrator based operations.
>>>>>
>>>>>  I have attached the query log for the above mentioned select
>>>>> operation here with. If you observe the query log you can see that every
>>>>> associated entity is fetched separately using complex join operations.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 22, 2015 at 8:05 PM, Supun Kamburugamuve <
>>>>> supun06@gmail.com> wrote:
>>>>>
>>>>>> Hi Supun,
>>>>>>
>>>>>>  In your report it says Slow performance. Do you have any data about
>>>>>> this slow performance? For a typical request in what percent the
registry
>>>>>> slow down the processing compared to overall time it takes to execute
that
>>>>>> request?
>>>>>>
>>>>>>  Do you have a use case where registry is the bottleneck?
>>>>>>
>>>>>>  Thanks,
>>>>>> Supun..
>>>>>>
>>>>>> On Fri, May 22, 2015 at 9:45 AM, Suresh Marru <smarru@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Supun,
>>>>>>>
>>>>>>>  This is very good analysis, you have nicely embraced the problem.
>>>>>>> Before we jump into the solution, we may want to do small POC’s
to validate
>>>>>>> your claims.
>>>>>>>
>>>>>>>  Thank you for getting a headstart, this also cuts into GSoC
goals
>>>>>>> of Douglas’s project. So lets work on this collaboratively.
>>>>>>>
>>>>>>>  Hi Madhu,
>>>>>>>
>>>>>>>  Can you please provide guidance on this effort on how to
>>>>>>> academically approach the data management challenges of Airavata.
The
>>>>>>> students might appreciate insights on how to profile and benchmark
any
>>>>>>> possible solutions.
>>>>>>>
>>>>>>>  Cheers,
>>>>>>> Suresh
>>>>>>>
>>>>>>>  On May 22, 2015, at 9:18 AM, Supun Nakandala <
>>>>>>> supun.nakandala@gmail.com> wrote:
>>>>>>>
>>>>>>>  Hi Devs,
>>>>>>>
>>>>>>>  I have compiled a document based on the analysis I did on current
>>>>>>> registry architecture/technology and possible modification and
>>>>>>> alternatives. You can find the document at
>>>>>>> https://docs.google.com/document/d/1XWAQLtdtCf9nTigAz6r5JINHR99bP0oeYaTgeEIVr4w/edit#
>>>>>>>
>>>>>>>  Thanks
>>>>>>> Supun
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>>>> Supun Kamburugamuva
>>>>>> Member, Apache Software Foundation; http://www.apache.org
>>>>>> E-mail: supun06@gmail.com;  Mobile: +1 812 369 6762
>>>>>> Blog: http://supunk.blogspot.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Thank you
>>>>> Supun Nakandala
>>>>> Dept. Computer Science and Engineering
>>>>> University of Moratuwa
>>>>>
>>>>
>>>>
>>>>
>>>>  --
>>>> Supun Kamburugamuva
>>>> Member, Apache Software Foundation; http://www.apache.org
>>>> E-mail: supun06@gmail.com;  Mobile: +1 812 369 6762
>>>> Blog: http://supunk.blogspot.com
>>>>
>>>>
>>>
>>>
>>>  --
>>> Thank you
>>> Supun Nakandala
>>> Dept. Computer Science and Engineering
>>> University of Moratuwa
>>>
>>>
>>>
>>
>>
>>  --
>> Thank you
>> Supun Nakandala
>> Dept. Computer Science and Engineering
>> University of Moratuwa
>>
>
>
>
> --
> Best Regards,
> Shameera Rathnayaka.
>
> email: shameera AT apache.org , shameerainfo AT gmail.com
> Blog : http://shameerarathnayaka.blogspot.com/
>



-- 
Thank you
Supun Nakandala
Dept. Computer Science and Engineering
University of Moratuwa

Mime
View raw message