airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachith Withana <swsach...@gmail.com>
Subject Re: Profiling the current Airavata registry
Date Thu, 14 Aug 2014 14:05:03 GMT
The Derby one.


On Thu, Aug 14, 2014 at 7:06 PM, Chathuri Wimalasena <kamalasini@gmail.com>
wrote:

> Hi Sachith,
>
> Which DB you are using to do the profiling ?
>
>
> On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <swsachith@gmail.com>
> wrote:
>
>> Here's how I've written the script to do it.
>>
>> Experiments loaded:
>> 10 users, 4 projects per each user,
>> each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
>> containing experiments like echo, Amber
>>
>> Methods tested:
>>
>> getExperiment()
>> searchExperimentByName
>> searchExperimentByApplication
>> searchExperimentByDescription
>>
>> WDYT?
>>
>>
>> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <marpierc@iu.edu> wrote:
>>
>>> You can start with the API search functions that we have now: by name,
>>> by application, by description.
>>>
>>> Marlon
>>>
>>>
>>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:
>>>
>>>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <marpierc@iu.edu> wrote:
>>>>
>>>>  A single user may have O(100) to O(1000) experiments, so 10K is too
>>>>> small
>>>>> as an upper bound on the registry for many users.
>>>>>
>>>> +1
>>>>
>>>> I agree with Marlon, we have the most basic search method, but the
>>>> reality
>>>> is we need search criteria like Marlon suggest, and I am sure content
>>>> based
>>>> search will be pretty slow with large number of experiments. So we have
>>>> to
>>>> use a search platform like Solr to improve the performance.
>>>>
>>>> I think first you can do the performance test without content based
>>>> search
>>>> then we can implement that feature, then do performance analysis, if its
>>>> too bad(more likely) then we can integrate a search platform to improve
>>>> the
>>>> performance.
>>>>
>>>> Lahiru
>>>>
>>>>  We should really test until things break.  A plot implying infinite
>>>>> scaling (by extrapolation) is not useful.  A plot showing OK scaling
>>>>> up to
>>>>> a certain point before things decay is useful.
>>>>>
>>>>> I suggest you post more carefully a set of experiments, starting with
>>>>> Lahiru's suggestion. How many users? How many experiments per user?
>>>>>  What
>>>>> kind of searches?  Probably the most common will be "get all my
>>>>> experiments
>>>>> that match this string", "get all experiments that have state FAILED",
>>>>> and
>>>>> "get all my experiments from the last 30 days".  But the API may not
>>>>> have
>>>>> the latter two yet.
>>>>>
>>>>> So to start, you should specify a prototype user.  For example, each
>>>>> user
>>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. Each
>>>>> user
>>>>> will have a unique but human readable name (user1, user2, ...). Each
>>>>> experiment will have a unique human readable description (AMBER job 1
>>>>> for
>>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for
>>>>> searching.
>>>>>
>>>>> Post these details first, and then you can create via scripts
>>>>> experiment
>>>>> registries of any size. Each experiment is different but suitable for
>>>>> pattern searching.
>>>>>
>>>>> This is 10 minutes worth of thought while waiting for my tea to brew,
>>>>> so
>>>>> hopefully this is the right start, but I encourage you to not take
>>>>> this as
>>>>> fixed instructions.
>>>>>
>>>>> Marlon
>>>>>
>>>>>
>>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:
>>>>>
>>>>>  Hi Sachith,
>>>>>>
>>>>>> How did you test this ? What database did you use ?
>>>>>>
>>>>>> I think 1000 experiments is a very low number. I think most important
>>>>>> part
>>>>>> is when there are large number of experiments, how expensive is the
>>>>>> search
>>>>>> and how expensive is a single experiment retrieval.
>>>>>>
>>>>>> If we support to get defined number of experiments in the API (I
think
>>>>>> this
>>>>>> is the practical scenario, among 10k experiments get 100) we have
to
>>>>>> test
>>>>>> the performance of that too.
>>>>>>
>>>>>> Regards
>>>>>> Lahiru
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <swsachith@gmail.com
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>   Hi all,
>>>>>>
>>>>>>> I'm testing the registry with 10,1000,10,000 Experiments and
I've
>>>>>>> tested
>>>>>>> the database performance executing the getAllExperiments method.
>>>>>>> I'll post the complete analysis.
>>>>>>>
>>>>>>> What are the other methods that I should test using?
>>>>>>>
>>>>>>> getExperiment(experiment_id)
>>>>>>> searchExperiment
>>>>>>>
>>>>>>> Any pointers?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <marpierc@iu.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Thanks, Sachith. Did you look at scaling also?  That is, will
the
>>>>>>>
>>>>>>>> operations below still be the slowest if the DB is 10x, 100x,
1000x
>>>>>>>> bigger?
>>>>>>>>
>>>>>>>> Marlon
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote:
>>>>>>>>
>>>>>>>>   Hi all,
>>>>>>>>
>>>>>>>>> I'm profiling the current registry in few different aspects.
>>>>>>>>>
>>>>>>>>> I looked into the database operations and I've listed
the
>>>>>>>>> operations
>>>>>>>>> that
>>>>>>>>> take the most amount of time.
>>>>>>>>>
>>>>>>>>> 1. Getting the Status of an Experiment (takes around
10% of the
>>>>>>>>> overall
>>>>>>>>> time spent)
>>>>>>>>>        Has to go through the hierarchy of the datamodel
to get to
>>>>>>>>> the
>>>>>>>>> actual
>>>>>>>>> experiment status ( node,     tasks ...etc)
>>>>>>>>>
>>>>>>>>> 2. Dealing with the Application Inputs
>>>>>>>>>        Strangely it takes a long time for the queries
regarding the
>>>>>>>>> ApplicationInputs to complete.
>>>>>>>>>        This is a part of the new Application Catalog
>>>>>>>>>
>>>>>>>>> 3. Getting all the Experiments ( using the * wild card)
>>>>>>>>>        This takes the maximum amount of time when queried
at
>>>>>>>>> first. But
>>>>>>>>> thanks
>>>>>>>>> to the OpenJPA        caching, it flattens out as we
keep querying.
>>>>>>>>>
>>>>>>>>> To reduce the first issue, I would suggest to have a
different
>>>>>>>>> table
>>>>>>>>> for
>>>>>>>>> Experiment Summaries,
>>>>>>>>> where the status ( both the state and the state update
time) would
>>>>>>>>> be
>>>>>>>>> the
>>>>>>>>> only varying entity, and use that to improve the query
time for
>>>>>>>>> Experiment
>>>>>>>>> summaries.
>>>>>>>>>
>>>>>>>>> It would also help improve the performance for getting
all the
>>>>>>>>> Experiments
>>>>>>>>> ( experiment summaries)
>>>>>>>>>
>>>>>>>>> WDYT?
>>>>>>>>>
>>>>>>>>> ToDos :  Look into memory consumption ( in terms of memory
leakage
>>>>>>>>> ...etc)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any more suggestions?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>> Thanks,
>>>>>>> Sachith Withana
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks,
>>  Sachith Withana
>>
>>
>


-- 
Thanks,
Sachith Withana

Mime
View raw message