Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@airavata.apache.org
Received-SPF: pass (athena.apache.org: domain of kamalasini@gmail.com
 designates 209.85.215.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOpPrj-mmAy__6pYriLWqOoqu_X3qdnE_TmGNfx+Qw_2Fq8f5w@mail.gmail.com>
References: 
 <CAOpPrj_Qs2_S6kf1Lxf-dg79jMPUsRDkKouudg5hwMHhzk81KQ@mail.gmail.com>
 <53CFAC8D.8070000@iu.edu>
 <CAOpPrj8_7=MJm+khzRAfvC36=GNOjSMfhGUkpG-V62uWoAvwmw@mail.gmail.com>
 <CAKq9_6KyySmdy_06W9+xX8buef-u07-fWNoPmR2BNr6w_sNEsw@mail.gmail.com>
 <53EA12CB.9070505@iu.edu>
 <CAKq9_6+mkjLsYUX8D1z_RLnu1duRbQAXKtHrCZ-FxccuE_S9wA@mail.gmail.com>
 <53EA1660.3020606@iu.edu>
 <CAOpPrj-mmAy__6pYriLWqOoqu_X3qdnE_TmGNfx+Qw_2Fq8f5w@mail.gmail.com>
From: Chathuri Wimalasena <kamalasini@gmail.com>
Date: Thu, 14 Aug 2014 09:36:58 -0400
Message-ID: 
 <CABFJMzNCH1bhjdn10AN9aEv5DpEA6HPfY49UgZaKBPan_oAbPQ@mail.gmail.com>
Subject: Re: Profiling the current Airavata registry
To: dev@airavata.apache.org
Content-Type: multipart/alternative; boundary=001a11348414f2b0950500970059

--001a11348414f2b0950500970059
Content-Type: text/plain; charset=UTF-8

Hi Sachith,

Which DB you are using to do the profiling ?


On Wed, Aug 13, 2014 at 11:51 PM, Sachith Withana <swsachith@gmail.com>
wrote:

> Here's how I've written the script to do it.
>
> Experiments loaded:
> 10 users, 4 projects per each user,
> each user would have 1000 to 100,000 experiments  (1000,10,000,100,000)
> containing experiments like echo, Amber
>
> Methods tested:
>
> getExperiment()
> searchExperimentByName
> searchExperimentByApplication
> searchExperimentByDescription
>
> WDYT?
>
>
> On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <marpierc@iu.edu> wrote:
>
>> You can start with the API search functions that we have now: by name, by
>> application, by description.
>>
>> Marlon
>>
>>
>> On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:
>>
>>> On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce <marpierc@iu.edu> wrote:
>>>
>>>  A single user may have O(100) to O(1000) experiments, so 10K is too
>>>> small
>>>> as an upper bound on the registry for many users.
>>>>
>>> +1
>>>
>>> I agree with Marlon, we have the most basic search method, but the
>>> reality
>>> is we need search criteria like Marlon suggest, and I am sure content
>>> based
>>> search will be pretty slow with large number of experiments. So we have
>>> to
>>> use a search platform like Solr to improve the performance.
>>>
>>> I think first you can do the performance test without content based
>>> search
>>> then we can implement that feature, then do performance analysis, if its
>>> too bad(more likely) then we can integrate a search platform to improve
>>> the
>>> performance.
>>>
>>> Lahiru
>>>
>>>  We should really test until things break.  A plot implying infinite
>>>> scaling (by extrapolation) is not useful.  A plot showing OK scaling up
>>>> to
>>>> a certain point before things decay is useful.
>>>>
>>>> I suggest you post more carefully a set of experiments, starting with
>>>> Lahiru's suggestion. How many users? How many experiments per user?
>>>>  What
>>>> kind of searches?  Probably the most common will be "get all my
>>>> experiments
>>>> that match this string", "get all experiments that have state FAILED",
>>>> and
>>>> "get all my experiments from the last 30 days".  But the API may not
>>>> have
>>>> the latter two yet.
>>>>
>>>> So to start, you should specify a prototype user.  For example, each
>>>> user
>>>> will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. Each
>>>> user
>>>> will have a unique but human readable name (user1, user2, ...). Each
>>>> experiment will have a unique human readable description (AMBER job 1
>>>> for
>>>> user 1, Amber job 2 for user 1, ...), etc that is suitable for
>>>> searching.
>>>>
>>>> Post these details first, and then you can create via scripts experiment
>>>> registries of any size. Each experiment is different but suitable for
>>>> pattern searching.
>>>>
>>>> This is 10 minutes worth of thought while waiting for my tea to brew, so
>>>> hopefully this is the right start, but I encourage you to not take this
>>>> as
>>>> fixed instructions.
>>>>
>>>> Marlon
>>>>
>>>>
>>>> On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:
>>>>
>>>>  Hi Sachith,
>>>>>
>>>>> How did you test this ? What database did you use ?
>>>>>
>>>>> I think 1000 experiments is a very low number. I think most important
>>>>> part
>>>>> is when there are large number of experiments, how expensive is the
>>>>> search
>>>>> and how expensive is a single experiment retrieval.
>>>>>
>>>>> If we support to get defined number of experiments in the API (I think
>>>>> this
>>>>> is the practical scenario, among 10k experiments get 100) we have to
>>>>> test
>>>>> the performance of that too.
>>>>>
>>>>> Regards
>>>>> Lahiru
>>>>>
>>>>>
>>>>> On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana <swsachith@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Hi all,
>>>>>
>>>>>> I'm testing the registry with 10,1000,10,000 Experiments and I've
>>>>>> tested
>>>>>> the database performance executing the getAllExperiments method.
>>>>>> I'll post the complete analysis.
>>>>>>
>>>>>> What are the other methods that I should test using?
>>>>>>
>>>>>> getExperiment(experiment_id)
>>>>>> searchExperiment
>>>>>>
>>>>>> Any pointers?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce <marpierc@iu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>   Thanks, Sachith. Did you look at scaling also?  That is, will the
>>>>>>
>>>>>>> operations below still be the slowest if the DB is 10x, 100x, 1000x
>>>>>>> bigger?
>>>>>>>
>>>>>>> Marlon
>>>>>>>
>>>>>>>
>>>>>>> On 7/23/14, 8:22 AM, Sachith Withana wrote:
>>>>>>>
>>>>>>>   Hi all,
>>>>>>>
>>>>>>>> I'm profiling the current registry in few different aspects.
>>>>>>>>
>>>>>>>> I looked into the database operations and I've listed the operations
>>>>>>>> that
>>>>>>>> take the most amount of time.
>>>>>>>>
>>>>>>>> 1. Getting the Status of an Experiment (takes around 10% of the
>>>>>>>> overall
>>>>>>>> time spent)
>>>>>>>>        Has to go through the hierarchy of the datamodel to get to
>>>>>>>> the
>>>>>>>> actual
>>>>>>>> experiment status ( node,     tasks ...etc)
>>>>>>>>
>>>>>>>> 2. Dealing with the Application Inputs
>>>>>>>>        Strangely it takes a long time for the queries regarding the
>>>>>>>> ApplicationInputs to complete.
>>>>>>>>        This is a part of the new Application Catalog
>>>>>>>>
>>>>>>>> 3. Getting all the Experiments ( using the * wild card)
>>>>>>>>        This takes the maximum amount of time when queried at first.
>>>>>>>> But
>>>>>>>> thanks
>>>>>>>> to the OpenJPA        caching, it flattens out as we keep querying.
>>>>>>>>
>>>>>>>> To reduce the first issue, I would suggest to have a different table
>>>>>>>> for
>>>>>>>> Experiment Summaries,
>>>>>>>> where the status ( both the state and the state update time) would
>>>>>>>> be
>>>>>>>> the
>>>>>>>> only varying entity, and use that to improve the query time for
>>>>>>>> Experiment
>>>>>>>> summaries.
>>>>>>>>
>>>>>>>> It would also help improve the performance for getting all the
>>>>>>>> Experiments
>>>>>>>> ( experiment summaries)
>>>>>>>>
>>>>>>>> WDYT?
>>>>>>>>
>>>>>>>> ToDos :  Look into memory consumption ( in terms of memory leakage
>>>>>>>> ...etc)
>>>>>>>>
>>>>>>>>
>>>>>>>> Any more suggestions?
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>> Thanks,
>>>>>> Sachith Withana
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>
>
>
> --
> Thanks,
>  Sachith Withana
>
>

--001a11348414f2b0950500970059
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Sachith,=C2=A0<div><br></div><div>Which DB you are usin=
g to do the profiling ? =C2=A0</div></div><div class=3D"gmail_extra"><br><b=
r><div class=3D"gmail_quote">On Wed, Aug 13, 2014 at 11:51 PM, Sachith With=
ana <span dir=3D"ltr">&lt;<a href=3D"mailto:swsachith@gmail.com" target=3D"=
_blank">swsachith@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Here&#39;s how I&#39;ve wri=
tten the script to do it.<div><br></div><div>Experiments loaded:</div><div>=
10 users, 4 projects per each user,</div>

<div>each user would have 1000 to 100,000 experiments =C2=A0(1000,10,000,10=
0,000) containing experiments like echo, Amber</div>
<div><br></div><div>Methods tested:=C2=A0</div><div><br></div><div>getExper=
iment()</div><div>searchExperimentByName</div><div>searchExperimentByApplic=
ation</div><div>searchExperimentByDescription</div><div><br></div><div>WDYT=
?</div>


</div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><br><div class=
=3D"gmail_quote">On Tue, Aug 12, 2014 at 6:58 PM, Marlon Pierce <span dir=
=3D"ltr">&lt;<a href=3D"mailto:marpierc@iu.edu" target=3D"_blank">marpierc@=
iu.edu</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">You can start with the API search functions =
that we have now: by name, by application, by description.<span><font color=
=3D"#888888"><br>


<br>
Marlon</font></span><div><div><br>
<br>
On 8/12/14, 9:25 AM, Lahiru Gunathilake wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
On Tue, Aug 12, 2014 at 6:42 PM, Marlon Pierce &lt;<a href=3D"mailto:marpie=
rc@iu.edu" target=3D"_blank">marpierc@iu.edu</a>&gt; wrote:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
A single user may have O(100) to O(1000) experiments, so 10K is too small<b=
r>
as an upper bound on the registry for many users.<br>
</blockquote>
+1<br>
<br>
I agree with Marlon, we have the most basic search method, but the reality<=
br>
is we need search criteria like Marlon suggest, and I am sure content based=
<br>
search will be pretty slow with large number of experiments. So we have to<=
br>
use a search platform like Solr to improve the performance.<br>
<br>
I think first you can do the performance test without content based search<=
br>
then we can implement that feature, then do performance analysis, if its<br=
>
too bad(more likely) then we can integrate a search platform to improve the=
<br>
performance.<br>
<br>
Lahiru<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
We should really test until things break. =C2=A0A plot implying infinite<br=
>
scaling (by extrapolation) is not useful. =C2=A0A plot showing OK scaling u=
p to<br>
a certain point before things decay is useful.<br>
<br>
I suggest you post more carefully a set of experiments, starting with<br>
Lahiru&#39;s suggestion. How many users? How many experiments per user? =C2=
=A0What<br>
kind of searches? =C2=A0Probably the most common will be &quot;get all my e=
xperiments<br>
that match this string&quot;, &quot;get all experiments that have state FAI=
LED&quot;, and<br>
&quot;get all my experiments from the last 30 days&quot;. =C2=A0But the API=
 may not have<br>
the latter two yet.<br>
<br>
So to start, you should specify a prototype user. =C2=A0For example, each u=
ser<br>
will have 1000 experiments: 100 AMBER jobs, 100 LAMMPS jobs, etc. Each user=
<br>
will have a unique but human readable name (user1, user2, ...). Each<br>
experiment will have a unique human readable description (AMBER job 1 for<b=
r>
user 1, Amber job 2 for user 1, ...), etc that is suitable for searching.<b=
r>
<br>
Post these details first, and then you can create via scripts experiment<br=
>
registries of any size. Each experiment is different but suitable for<br>
pattern searching.<br>
<br>
This is 10 minutes worth of thought while waiting for my tea to brew, so<br=
>
hopefully this is the right start, but I encourage you to not take this as<=
br>
fixed instructions.<br>
<br>
Marlon<br>
<br>
<br>
On 8/12/14, 8:54 AM, Lahiru Gunathilake wrote:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi Sachith,<br>
<br>
How did you test this ? What database did you use ?<br>
<br>
I think 1000 experiments is a very low number. I think most important part<=
br>
is when there are large number of experiments, how expensive is the search<=
br>
and how expensive is a single experiment retrieval.<br>
<br>
If we support to get defined number of experiments in the API (I think<br>
this<br>
is the practical scenario, among 10k experiments get 100) we have to test<b=
r>
the performance of that too.<br>
<br>
Regards<br>
Lahiru<br>
<br>
<br>
On Tue, Aug 12, 2014 at 4:59 PM, Sachith Withana &lt;<a href=3D"mailto:swsa=
chith@gmail.com" target=3D"_blank">swsachith@gmail.com</a>&gt;<br>
wrote:<br>
<br>
=C2=A0 Hi all,<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I&#39;m testing the registry with 10,1000,10,000 Experiments and I&#39;ve t=
ested<br>
the database performance executing the getAllExperiments method.<br>
I&#39;ll post the complete analysis.<br>
<br>
What are the other methods that I should test using?<br>
<br>
getExperiment(experiment_id)<br>
searchExperiment<br>
<br>
Any pointers?<br>
<br>
<br>
<br>
On Wed, Jul 23, 2014 at 6:07 PM, Marlon Pierce &lt;<a href=3D"mailto:marpie=
rc@iu.edu" target=3D"_blank">marpierc@iu.edu</a>&gt; wrote:<br>
<br>
=C2=A0 Thanks, Sachith. Did you look at scaling also? =C2=A0That is, will t=
he<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
operations below still be the slowest if the DB is 10x, 100x, 1000x<br>
bigger?<br>
<br>
Marlon<br>
<br>
<br>
On 7/23/14, 8:22 AM, Sachith Withana wrote:<br>
<br>
=C2=A0 Hi all,<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I&#39;m profiling the current registry in few different aspects.<br>
<br>
I looked into the database operations and I&#39;ve listed the operations<br=
>
that<br>
take the most amount of time.<br>
<br>
1. Getting the Status of an Experiment (takes around 10% of the overall<br>
time spent)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0Has to go through the hierarchy of the datamodel=
 to get to the<br>
actual<br>
experiment status ( node, =C2=A0 =C2=A0 tasks ...etc)<br>
<br>
2. Dealing with the Application Inputs<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0Strangely it takes a long time for the queries r=
egarding the<br>
ApplicationInputs to complete.<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0This is a part of the new Application Catalog<br=
>
<br>
3. Getting all the Experiments ( using the * wild card)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0This takes the maximum amount of time when queri=
ed at first. But<br>
thanks<br>
to the OpenJPA =C2=A0 =C2=A0 =C2=A0 =C2=A0caching, it flattens out as we ke=
ep querying.<br>
<br>
To reduce the first issue, I would suggest to have a different table<br>
for<br>
Experiment Summaries,<br>
where the status ( both the state and the state update time) would be<br>
the<br>
only varying entity, and use that to improve the query time for<br>
Experiment<br>
summaries.<br>
<br>
It would also help improve the performance for getting all the<br>
Experiments<br>
( experiment summaries)<br>
<br>
WDYT?<br>
<br>
ToDos : =C2=A0Look into memory consumption ( in terms of memory leakage<br>
...etc)<br>
<br>
<br>
Any more suggestions?<br>
<br>
<br>
</blockquote></blockquote>
--<br>
Thanks,<br>
Sachith Withana<br>
<br>
<br>
<br>
</blockquote></blockquote></blockquote>
<br>
</blockquote>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div dir=3D"ltr"=
><div><font style=3D"font-family:garamond,serif;color:rgb(0,0,0);background=
-color:rgb(255,255,255)">Thanks,</font></div>

<font style=3D"font-family:garamond,serif;color:rgb(0,0,0);background-color=
:rgb(255,255,255)"></font><font style=3D"font-family:garamond,serif;backgro=
und-color:rgb(255,255,255);color:rgb(0,0,0)"></font><div style=3D"backgroun=
d-color:rgb(255,255,255)">


<font face=3D"garamond, serif">Sachith Withana<br></font><br></div></div>
</font></span></div>
</blockquote></div><br></div>

--001a11348414f2b0950500970059--