airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lahiru Gunathilake <glah...@gmail.com>
Subject Re: Profiling the current Airavata registry
Date Thu, 14 Aug 2014 14:09:40 GMT
Hi Sachith,

I think we should use mysql which is our production recommended database. I
think we should do the performance test with the production scenario.

Lahiru


On Thu, Aug 14, 2014 at 7:35 PM, Sachith Withana <swsachith@gmail.com>
wrote:

> The Derby one.
>
>
> On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena <kamalasini@gmail.com
> > wrote:
>
>> Hi Sachith,
>>
>> Which DB you are using to do the profiling ?
>>
>>
>> On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <swsachith@gmail.com>
>> wrote:
>>
>>> Here's how I've written the script to do it.
>>>
>>> Experiments loaded:
>>> 10 users, 4 projects per each user,
>>> each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
>>> containing experiments like echo, Amber
>>>
>>> Methods tested:
>>>
>>> getExperiment()
>>> searchExperimentByName
>>> searchExperimentByApplication
>>> searchExperimentByDescription
>>>
>>> WDYT?
>>>
>>>
>>> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <marpierc@iu.edu> wrote:
>>>
>>>> You can start with the API search functions that we have now: by name,
>>>> by application, by description.
>>>>
>>>> Marlon
>>>>
>>>>
>>>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:
>>>>
>>>>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <marpierc@iu.edu>
>>>>> wrote:
>>>>>
>>>>>  A single user may have O(100) to O(1000) experiments, so 10K is too
>>>>>> small
>>>>>> as an upper bound on the registry for many users.
>>>>>>
>>>>> +1
>>>>>
>>>>> I agree with Marlon, we have the most basic search method, but the
>>>>> reality
>>>>> is we need search criteria like Marlon suggest, and I am sure content
>>>>> based
>>>>> search will be pretty slow with large number of experiments. So we
>>>>> have to
>>>>> use a search platform like Solr to improve the performance.
>>>>>
>>>>> I think first you can do the performance test without content based
>>>>> search
>>>>> then we can implement that feature, then do performance analysis, if
>>>>> its
>>>>> too bad(more likely) then we can integrate a search platform to
>>>>> improve the
>>>>> performance.
>>>>>
>>>>> Lahiru
>>>>>
>>>>>  We should really test until things break.  A plot implying infinite
>>>>>> scaling (by extrapolation) is not useful.  A plot showing OK scaling
>>>>>> up to
>>>>>> a certain point before things decay is useful.
>>>>>>
>>>>>> I suggest you post more carefully a set of experiments, starting
with
>>>>>> Lahiru's suggestion. How many users? How many experiments per user?
>>>>>>  What
>>>>>> kind of searches?  Probably the most common will be "get all my
>>>>>> experiments
>>>>>> that match this string", "get all experiments that have state
>>>>>> FAILED", and
>>>>>> "get all my experiments from the last 30 days".  But the API may
not
>>>>>> have
>>>>>> the latter two yet.
>>>>>>
>>>>>> So to start, you should specify a prototype user.  For example, each
>>>>>> user
>>>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc.
>>>>>> Each user
>>>>>> will have a unique but human readable name (user1, user2, ...). Each
>>>>>> experiment will have a unique human readable description (AMBER job
1
>>>>>> for
>>>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for
>>>>>> searching.
>>>>>>
>>>>>> Post these details first, and then you can create via scripts
>>>>>> experiment
>>>>>> registries of any size. Each experiment is different but suitable
for
>>>>>> pattern searching.
>>>>>>
>>>>>> This is 10 minutes worth of thought while waiting for my tea to brew,
>>>>>> so
>>>>>> hopefully this is the right start, but I encourage you to not take
>>>>>> this as
>>>>>> fixed instructions.
>>>>>>
>>>>>> Marlon
>>>>>>
>>>>>>
>>>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:
>>>>>>
>>>>>>  Hi Sachith,
>>>>>>>
>>>>>>> How did you test this ? What database did you use ?
>>>>>>>
>>>>>>> I think 1000 experiments is a very low number. I think most
>>>>>>> important part
>>>>>>> is when there are large number of experiments, how expensive
is the
>>>>>>> search
>>>>>>> and how expensive is a single experiment retrieval.
>>>>>>>
>>>>>>> If we support to get defined number of experiments in the API
(I
>>>>>>> think
>>>>>>> this
>>>>>>> is the practical scenario, among 10k experiments get 100) we
have to
>>>>>>> test
>>>>>>> the performance of that too.
>>>>>>>
>>>>>>> Regards
>>>>>>> Lahiru
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <
>>>>>>> swsachith@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Hi all,
>>>>>>>
>>>>>>>> I'm testing the registry with 10,1000,10,000 Experiments
and I've
>>>>>>>> tested
>>>>>>>> the database performance executing the getAllExperiments
method.
>>>>>>>> I'll post the complete analysis.
>>>>>>>>
>>>>>>>> What are the other methods that I should test using?
>>>>>>>>
>>>>>>>> getExperiment(experiment_id)
>>>>>>>> searchExperiment
>>>>>>>>
>>>>>>>> Any pointers?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <marpierc@iu.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Thanks, Sachith. Did you look at scaling also?  That is,
will the
>>>>>>>>
>>>>>>>>> operations below still be the slowest if the DB is 10x,
100x, 1000x
>>>>>>>>> bigger?
>>>>>>>>>
>>>>>>>>> Marlon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote:
>>>>>>>>>
>>>>>>>>>   Hi all,
>>>>>>>>>
>>>>>>>>>> I'm profiling the current registry in few different
aspects.
>>>>>>>>>>
>>>>>>>>>> I looked into the database operations and I've listed
the
>>>>>>>>>> operations
>>>>>>>>>> that
>>>>>>>>>> take the most amount of time.
>>>>>>>>>>
>>>>>>>>>> 1. Getting the Status of an Experiment (takes around
10% of the
>>>>>>>>>> overall
>>>>>>>>>> time spent)
>>>>>>>>>>        Has to go through the hierarchy of the datamodel
to get to
>>>>>>>>>> the
>>>>>>>>>> actual
>>>>>>>>>> experiment status ( node,     tasks ...etc)
>>>>>>>>>>
>>>>>>>>>> 2. Dealing with the Application Inputs
>>>>>>>>>>        Strangely it takes a long time for the queries
regarding
>>>>>>>>>> the
>>>>>>>>>> ApplicationInputs to complete.
>>>>>>>>>>        This is a part of the new Application Catalog
>>>>>>>>>>
>>>>>>>>>> 3. Getting all the Experiments ( using the * wild
card)
>>>>>>>>>>        This takes the maximum amount of time when
queried at
>>>>>>>>>> first. But
>>>>>>>>>> thanks
>>>>>>>>>> to the OpenJPA        caching, it flattens out as
we keep
>>>>>>>>>> querying.
>>>>>>>>>>
>>>>>>>>>> To reduce the first issue, I would suggest to have
a different
>>>>>>>>>> table
>>>>>>>>>> for
>>>>>>>>>> Experiment Summaries,
>>>>>>>>>> where the status ( both the state and the state update
time)
>>>>>>>>>> would be
>>>>>>>>>> the
>>>>>>>>>> only varying entity, and use that to improve the
query time for
>>>>>>>>>> Experiment
>>>>>>>>>> summaries.
>>>>>>>>>>
>>>>>>>>>> It would also help improve the performance for getting
all the
>>>>>>>>>> Experiments
>>>>>>>>>> ( experiment summaries)
>>>>>>>>>>
>>>>>>>>>> WDYT?
>>>>>>>>>>
>>>>>>>>>> ToDos :  Look into memory consumption ( in terms
of memory leakage
>>>>>>>>>> ...etc)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any more suggestions?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>> Thanks,
>>>>>>>> Sachith Withana
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>  Sachith Withana
>>>
>>>
>>
>
>
> --
> Thanks,
> Sachith Withana
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Mime
View raw message