incubator-olio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shanti Subramanyam <shanti.subraman...@gmail.com>
Subject Re: Olio Scaling
Date Mon, 08 Feb 2010 18:03:18 GMT
On Mon, Feb 8, 2010 at 1:18 AM, Vasileios Kontorinis
<bkontorinis@gmail.com>wrote:

> Akara and Shanti,
>   I managed to fix a very subtle issue with xen. There was an issue with
> the checksum that reduces the throughput of the network from 1Gbs to 1Mbs.
>

Wow !  Is this a generic issue with Xen ?

When that was fixed my I managed to scale to 1800 concurrent users.
> However, the only metric failing now is the
>
> Average images loaded per Home Page 2.65   >=3       FAILED
>
> Actually I managed to get a passing result for 25 users.
>
>
We need to look into this issue  - I suspect that something subtle has
changed in 0.2 which hasn't got accounted for in the expected #images
loaded. Can I please request that you file a JIRA on this ?

I also had some question regarding Memcached. In the MemcachedStats output
> log I get:
>
> Server              Time  items  cache_MB  conns  sets/s  gets/s
>  get_hits/s  get_misses/s  evicts/s  rB/s    wB/s
> --------------  --------  -----  --------  -----  ------  ------
>  ----------  ------------  --------  ----  ------
> olio-mem:11211  04:20:47      3      0.05     34    1.70   13.10
> 10.30          2.80         0  5709  216402
>
>
> Server              Time  items  cache_MB  conns  sets/s  gets/s
>  get_hits/s  get_misses/s  evicts/s  rB/s  wB/s
> --------------  --------  -----  --------  -----  ------  ------
>  ----------  ------------  --------  ----  ----
> olio-mem:11211  04:20:47      3      0.05     34    0.00    0.00
>  0.00          0.00         0     0    48
>
>
> Server              Time  items  cache_MB  conns  sets/s  gets/s
>  get_hits/s  get_misses/s  evicts/s  rB/s  wB/s
> --------------  --------  -----  --------  -----  ------  ------
>  ----------  ------------  --------  ----  ----
> olio-mem:11211  04:20:47      3      0.05     34    0.00    0.00
>  0.00          0.00         0     0    48
>
>
> Does this mean that I only use 0.05 MB from the memcached memory?
> I am pretty sure that the memcached command has  -m 256   which means that
> I should be reach close to 256MB, when running with high number of users.
> Is cache_MB something different?
>


Your cache_MB size is correct - we actually cache very little in memcached.
However, the number of 'conns' you are seeing is worrisome. I have typically
seen the same or more as the actual #concurrent users (so you should see
around 1800). Your first entry looks good for the other stats, but you
should see similar numbers (with rBs, wB/s, get_hits etc.) in other entries
as well. Depending on the frequency you are running it at, you will see some
entries with zero number (like the ones you have).

Shanti


> Thanks again
> -------------------------------------------------------------------
> Kontorinis Vasileios
> Phd student, University of California San Diego
> San Diego, CA 92122
> Cell. phone: (858) 717 6899
> bkontorinis@gmail.com, vkontori@ucsd.edu
> -------------------------------------------------------------------
>
>
> 2010/1/27 Shanti Subramanyam <shanti.subramanyam@gmail.com>
>
>> Yes - these are problems that I'm already aware of.
>> The best solution to the filestore issue is to change ownership of the
>> directory to the same user/group as the apache process. We could have the
>> fileloader.sh change write access I guess, but since that's a big security
>> hole, we may not want to do that automatically without letting the user know
>> about it.
>>
>> The fact that your response times are so high indicate that you're running
>> a far larger load than the system can handle and/or you still need some
>> tuning.
>> I suggest you start over from say 100 users and see at what point your
>> response times start getting really large. The apache error log should be
>> pulled in as part of the 'Statistics' tab, so do keep monitoring that.
>>
>> Shanti
>>
>>
>> On Wed, Jan 27, 2010 at 1:34 AM, Vasileios Kontorinis <
>> bkontorinis@gmail.com> wrote:
>>
>>> Shanti hi again,
>>>    I checked my apache logs and there were a bunch of errors.
>>> It looks like there some issues with the
>>> webapp/php/trunk/classes/ImageUtil.php in the last release of olio. (I
>>> downloaded
>>> http://www.alliedquotes.com/mirrors/apache/incubator/olio/0.2/apache-olio-php-src-0.2.tar.gz)
>>> 1) There is a line that needs to be commented. php complains ("1.5. Must
>>> be greater than zero.").
>>> 2) Then, it was complaining that it cannot find function
>>> fastimagecopyresampled . To work around that moved the function
>>> fastimagecopyresampled above createThumb (this might not  be required ) and
>>> deplared it static.
>>>     Finally,  I call the function from createThumb with
>>> self::fastimagecopyresampled .
>>> 3) Then, it started complaining because it could not write to the
>>> filestore. The problem is that wants to write the new images as www-data
>>> from the apache, while the filestore does not have write persmission for
>>> others. Manually,
>>>     giving access solves the problem (chmod -R o+w <path>/filestore) but
>>> since the directories in filestore are generated automatically, maybe the
>>> chmod command should be added in fileloader.sh
>>>
>>> Funnily enough, after fixing those issues, I still cannot pass the:
>>> Average images loaded per Home Page 2.65   >=3       FAILED
>>>
>>> and on top of that I also have:
>>> Response Times (secs)
>>> AddPerson     5.190  13.194  3.387 8.800     3.000 FAILED
>>> AddEvent       5.904  16.784  3.159 10.400   4.000 FAILED
>>>
>>> Think tims for AddPerson and AddEvent fail as well.
>>>
>>> Any insights are welcome .... :-(
>>>
>>> -------------------------------------------------------------------
>>> Kontorinis Vasileios
>>> Phd student, University of California San Diego
>>> San Diego, CA 92122
>>> Cell. phone: (858) 717 6899
>>> bkontorinis@gmail.com, vkontori@ucsd.edu
>>> -------------------------------------------------------------------
>>>
>>>
>>> 2010/1/26 Shanti Subramanyam <shanti.subramanyam@gmail.com>
>>>
>>>> Yes - 0.2 requires a lot more disk space as we changed the ratio of
>>>> concurrent users to registered users to 1:100. If you haven't already,
>>>> please check out our published Blueprints for detailed performance
>>>> characteristics of the workload:
>>>> Deploying Web 2.0 Applications on Sun Servers and the OpenSolaris
>>>> Operating System<http://wikis.sun.com/display/BluePrints/Deploying+Web+2.0+Applications+on+Sun+Servers+and+the+OpenSolaris+Operating+System>
>>>> <http://wikis.sun.com/display/BluePrints/Deploying+Web+2.0+Applications+on+Sun+Servers+and+the+OpenSolaris+Operating+System>
>>>> If you run for long enough, you should get passing runs. Have you
>>>> verified that there are no errors in the run logs when you see the 'Avg.
>>>> images loaded per home page' fail ?
>>>>
>>>> On to your open files error  - you may have to tune your networking tier
>>>> and/or #open file descriptors. I don't believe we have ever seen as many
>>>> files open as you are seeing. Can you determine whether these are from the
>>>> file store or network ? We also typically run the filestore on a different
>>>> system and nfs-mount it on the webserver box.
>>>> You will have to tune your system to ensure good performance since you
>>>> will need memory for both apache and files.
>>>>
>>>> Shanti
>>>>
>>>>
>>>> On Mon, Jan 25, 2010 at 5:06 PM, Vasileios Kontorinis <
>>>> bkontorinis@gmail.com> wrote:
>>>>
>>>>> Akara and Shanti hi,
>>>>>    I did migrate to Olio 0.2. With the last version of Olio I came
>>>>> across some new interesting things.
>>>>>
>>>>> Scaling issues:
>>>>>   - I am still getting the:
>>>>> Average images loaded per Home Page2.55>= 3
>>>>> FAILED
>>>>>  - additionally, when I scale the concurrent users to 800 I run out of
>>>>> diskspace since my filestore occupies more than 62GB.
>>>>> Actually for 600 users it occupies 50GB. I was curious if that makes
>>>>> sense. How much space I will need to reach 1000 users?
>>>>> In the php_setup.html it suggests that we will need 50GB but apparently
>>>>> we need way more for large number of users.
>>>>>
>>>>>  - Finally and most importantly, for 600 users many of the operations
>>>>> fail with the exception:
>>>>> Message: java.net.SocketException: Too many open files
>>>>> Stack Trace:
>>>>>  Class Method Line java.net.PlainSocketImpl socketAccept
>>>>> java.net.PlainSocketImpl accept 390 java.net.ServerSocket implAccept
>>>>> 453 java.net.ServerSocket accept 421
>>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop executeAcceptLoop 369
>>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop run 341java.lang.Thread
>>>>> run 619
>>>>> or
>>>>>
>>>>> java.net.SocketException: Too many open files
>>>>> Stack Trace:
>>>>>  Class Method Line java.net.Socket createImpl 394 java.net.Socket
>>>>> getImpl 457 java.net.Socket bind 571
>>>>> com.sun.faban.driver.transport.hc3.ProtocolTimedSocketFactory
>>>>> createSocket 60 org.apache.commons.httpclient.HttpConnection open 707
>>>>> org.apache.commons.httpclient.HttpMethodDirector executeWithRetry 387
>>>>> org.apache.commons.httpclient.HttpMethodDirector executeMethod 171
>>>>> org.apache.commons.httpclient.HttpClient executeMethod 397
>>>>> org.apache.commons.httpclient.HttpClient executeMethod 323
>>>>> com.sun.faban.driver.transport.hc3.ApacheHC3Transport readURL 274
>>>>> org.apache.olio.workload.driver.UIDriver doLogin 398
>>>>> org.apache.olio.workload.driver.UIDriver doLogin 424
>>>>> sun.reflect.GeneratedMethodAccessor8 invoke
>>>>> sun.reflect.DelegatingMethodAccessorImpl invoke 25
>>>>> java.lang.reflect.Method invoke 597
>>>>> com.sun.faban.driver.engine.TimeThread doRun 169
>>>>> com.sun.faban.driver.engine.AgentThrea
>>>>>
>>>>> I am monitoring the number of open files in the web-server with
>>>>> `watch "lsof | wc"` and the olio starts failing when around 65000-70,000
>>>>> files are open. lsof shows that for each apache2 thread there are around
100
>>>>> files open. Therefore there are around 650-700 different apache2 threads
>>>>> that create the bulk of those open file descriptors.
>>>>> The soft and hard limit is set to 403238, which means that there should
>>>>> be many more open files before it will start failing.
>>>>> (Actually, I verified the limit by opening a bunch of files with a
>>>>> python script and it does reach the limitation of 403238.)
>>>>> Any insights?  Is there any chance the the file descriptors take more
>>>>> time that usual to be reclaimed after being closed in the xen vm I use
for
>>>>> my web-server? Does it make sense for olio at the first place to have
so
>>>>> many files open at the same time?
>>>>>
>>>>> Thanks again.
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> Kontorinis Vasileios
>>>>> Phd student, University of California San Diego
>>>>> San Diego, CA 92122
>>>>> Cell. phone: (858) 717 6899
>>>>> bkontorinis@gmail.com, vkontori@ucsd.edu
>>>>> -------------------------------------------------------------------
>>>>>
>>>>>
>>>>> 2010/1/16 Shanti Subramanyam <shanti.subramanyam@gmail.com>
>>>>>
>>>>>  I would really recommend that you migrate to Olio 0.2. In addition to
>>>>>> bug fixes, there are some major features changes in it. See Olio
>>>>>> 0.2 released<http://perfwork.wordpress.com/2010/01/13/olio-0-2-relesed/>
>>>>>>
>>>>>>
>>>>>> Shanti
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 16, 2010 at 4:49 PM, Vasileios Kontorinis <
>>>>>> bkontorinis@gmail.com> wrote:
>>>>>>
>>>>>>> Akara hi again,
>>>>>>>    Below I have comments on your suggestions and at the end some
>>>>>>> bonus questions... Thanks again.
>>>>>>>
>>>>>>> 2010/1/13 Akara Sucharitakul <Akara.Sucharitakul@sun.com>
>>>>>>>
>>>>>>>> With your permission, I'd like to copy the Olio and Faban
user
>>>>>>>> aliases going forward. I feel it will help a much wider audience.
Please see
>>>>>>>> below for answers/comments:
>>>>>>>>
>>>>>>>> Sure. I cced olio user alias. I am not sure which is the
right faban
>>>>>>> list.
>>>>>>>
>>>>>>>
>>>>>>>> Vasileios Kontorinis wrote:
>>>>>>>>
>>>>>>>>> Akara hi,
>>>>>>>>>   I am a grad student at UCSD and I use Olio for a research
project
>>>>>>>>> where we want to measure olio performance under live
virtual machine
>>>>>>>>> migration. We use ubuntu 8.04 on nehalem servers.
>>>>>>>>> I have co ed the last version of olio from the online
svn
>>>>>>>>> repository and downloaded the last version of faban (faban-kit-101509.tar.gz
>>>>>>>>> <http://faban.sunsource.net/nightly/faban-kit-101509.tar.gz>)
>>>>>>>>>
>>>>>>>>
>>>>>>>> 101509 is fairly recent. But the latest on the web site is
111109
>>>>>>>> (Faban 1.0). There were just bug fixes between those releases.
>>>>>>>
>>>>>>>
>>>>>>> I have upgraded to Faban 1.0, still using olio1.0 though ( the
>>>>>>> release of 2.0 was announced, will switch to it if I run into
bugs that have
>>>>>>> been fixed)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> So far, I employed a bunch of hacks to get most of it
to work and I
>>>>>>>>> am almost there. In the process I got a bunch of questions.
>>>>>>>>>
>>>>>>>>> Questions (some of them might be just faban related,
not olio so
>>>>>>>>> bear with me):
>>>>>>>>> 1) In there any way to deploy OlioDriver.jar through
the command
>>>>>>>>> line? Firefox through ssh forwarding is dead slow and
I d rather avoid if I
>>>>>>>>> can.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Just drop the jar into faban/benchmarks/ and it will deploy
itself.
>>>>>>>> This is documented at
>>>>>>>> http://faban.sunsource.net/1.0/docs/guide/harnessdev/deploybenchmark.htmlunder
"Alternate Deployment Methods."
>>>>>>>>
>>>>>>>>
>>>>>>>>  2) The services ApacheHttpdService, MemcachedService, MySQLService
>>>>>>>>> that come with Faban should be deployed before running
Olio?
>>>>>>>>>    I was getting some very weird errors. e.g.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, you should. Olio will search for those.
>>>>>>>>
>>>>>>>> Done
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 03:50:27:WARNING:UIDriverAgent[0]: Forcefully terminating
benchmark
>>>>>>>>> run
>>>>>>>>> 03:50:27:WARNING:UIDriverAgent[0]: 25 threads forcefully
>>>>>>>>> terminated.
>>>>>>>>> java.lang.Throwable: Stack of non-terminating thread.
>>>>>>>>>    at java.net.SocketInputStream.socketRead0 (null)
>>>>>>>>>    at java.net.SocketInputStream.read (129)
>>>>>>>>>    at java.io.FilterInputStream.read (116)
>>>>>>>>>    at com.sun.faban.driver.transport.util.TimedInputStream.read
>>>>>>>>> (139)
>>>>>>>>>    at java.io.BufferedInputStream.fill (218)
>>>>>>>>>    at java.io.BufferedInputStream.read (237)
>>>>>>>>>    at org.apache.commons.httpclient.HttpParser.readRawLine
(78)
>>>>>>>>>    at org.apache.commons.httpclient.HttpParser.readLine
(106)
>>>>>>>>>    at org.apache.commons.httpclient.HttpConnection.readLine
(1116)
>>>>>>>>>    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
>>>>>>>>> (1973)
>>>>>>>>>    at org.apache.commons.httpclient.HttpMethodBase.readResponse
>>>>>>>>> (1735)
>>>>>>>>>    at org.apache.commons.httpclient.HttpMethodBase.execute
(1098)
>>>>>>>>>    at
>>>>>>>>> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
(398)
>>>>>>>>>    at
>>>>>>>>> org.apache.commons.httpclient.HttpMethodDirector.executeMethod
(171)
>>>>>>>>>    at org.apache.commons.httpclient.HttpClient.executeMethod
(397)
>>>>>>>>>    at org.apache.commons.httpclient.HttpClient.executeMethod
(323)
>>>>>>>>>    at
>>>>>>>>> com.sun.faban.driver.transport.hc3.ApacheHC3Transport.fetchURL
(529)
>>>>>>>>>    at
>>>>>>>>> com.sun.faban.driver.transport.hc3.ApacheHC3Transport.fetchURL
(552)
>>>>>>>>>    at org.apache.olio.workload.driver.UIDriver.doHomePage
(355)
>>>>>>>>>    at sun.reflect.NativeMethodAccessorImpl.invoke0 (null)
>>>>>>>>>    at sun.reflect.NativeMethodAccessorImpl.invoke (39)
>>>>>>>>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke
(25)
>>>>>>>>>    at java.lang.reflect.Method.invoke (597)
>>>>>>>>>    at com.sun.faban.driver.engine.TimeThread.doRun (169)
>>>>>>>>>    at com.sun.faban.driver.engine.AgentThread.run (202)
>>>>>>>>>
>>>>>>>>> and afterwards the master was waiting for threads to
join for
>>>>>>>>> ever... (I attached gdb to verify that something was
wrong) and hence I had
>>>>>>>>> to kill the benchmark.
>>>>>>>>>
>>>>>>>>
>>>>>>>> These threads are hanging reading the server responses, that
never
>>>>>>>> came.
>>>>>>>>
>>>>>>>>
>>>>>>> Building the services from Faban probably fixes it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> In the Olio log there are WARNINGS  complaining about
not deploying
>>>>>>>>> those. After building those and manually copying them
to /faban/services
>>>>>>>>> (ant deploy did not place them there... :-(  )
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. But ant deploy should get them there. If not, can you
please
>>>>>>>> let me know the ant messages?
>>>>>>>
>>>>>>>
>>>>>>> Ant was deploying them indeed. I had a mistake in
>>>>>>> building.properties.
>>>>>>> I had:  faban.url=http://<hostname>:9980/   instead of
 faban.url=
>>>>>>> http://localhost:9980/
>>>>>>> After I changed that it started working...
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>  it worked. (mostly worked)
>>>>>>>>>
>>>>>>>>> 3) I still have warnings like:
>>>>>>>>> 01:38:08:INFO:Time difference to host olio-web is 269
ms.
>>>>>>>>> Attempting to set clock.
>>>>>>>>> 01:38:08:INFO:Time difference to host olio-db is 263
ms. Attempting
>>>>>>>>> to set clock.
>>>>>>>>>
>>>>>>>>
>>>>>>>> These two are OK. Just trying to do a clock sync between
the
>>>>>>>> systems.
>>>>>>>>
>>>>>>>>
>>>>>>>>  01:38:08:WARNING:olio-web wakeup-before time reached 700ms
limit.
>>>>>>>>> System is too busy. Giving up.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is one of Faban's clock-setting calibrations. If the
system is
>>>>>>>> too busy or you run on some virtualization architectures,
the lag time
>>>>>>>> between an intended end of sleep and the actual time when
the thread really
>>>>>>>> wakes up (gets scheduled/executed) is too high, calibrations
will fail.
>>>>>>>>
>>>>>>>>
>>>>>>>>  01:38:08:INFO:Time difference to host olio-mem is 262 ms.
>>>>>>>>> Attempting to set clock.
>>>>>>>>> 01:38:10:WARNING:olio-db wakeup-before time reached 700ms
limit.
>>>>>>>>> System is too busy. Giving up.
>>>>>>>>> 09:38:09:WARNING:[date, -u, 011309382010.10]
>>>>>>>>> stderr:
>>>>>>>>> date: cannot set date: Operation not permitted
>>>>>>>>> 09:38:09:WARNING:Error on "[date, -u, 011309382010.10]"
command
>>>>>>>>> trying to set the date. Exit value: 1
>>>>>>>>> 09:38:10:WARNING:[date, -u, 011309382010.11]
>>>>>>>>> stderr:
>>>>>>>>> date: cannot set date: Operation not permitted
>>>>>>>>> 09:38:10:WARNING:Error on "[date, -u, 011309382010.11]"
command
>>>>>>>>> trying to set the date. Exit value: 1
>>>>>>>>> 09:38:09:WARNING:[date, -u, 011309382010.10]
>>>>>>>>> stderr:
>>>>>>>>> date: cannot set date: Operation not permitted
>>>>>>>>> 09:38:09:WARNING:Error on "[date, -u, 011309382010.10]"
command
>>>>>>>>> trying to set the date. Exit value: 1
>>>>>>>>> 09:38:10:WARNING:[date, -u, 011309382010.11]
>>>>>>>>> stderr:
>>>>>>>>> date: cannot set date: Operation not permitted
>>>>>>>>>
>>>>>>>>> Leting faban change the vm clock sounds from the beginning
a bad
>>>>>>>>> idea.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK. So it is xen. Yes, this is what Faban is trying to solve.
You
>>>>>>>> can certainly turn it off. Please see:
>>>>>>>>   http://faban.sunsource.net/1.0/docs/howd services
>>>>>>>> ApacheHttpdService, MemcachedService, MySQLService that come
with Faban
>>>>>>>> should be deployed before running Olio?
>>>>>>>>    I was gettingoi/physclocksync.html<http://faban.sunsource.net/1.0/docs/howdoi/physclocksync.html>
>>>>>>>>
>>>>>>>>
>>>>>>> I added the  <fh:timeSync>false<fh:timeSync> in my
run.xml file (
>>>>>>> btw in the link above there is a mistake :  <fh:timeSync>false
>>>>>>> </fh:timeSync> is correct, the second <fh:timeSync>
needs a closing
>>>>>>> tag, the "/" is missing)
>>>>>>> that made the warnings go away.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>  Unfortunately, xen is really bad in maintaining an accurate
clock.
>>>>>>>>> As a result there is usually time difference between
the different virtual
>>>>>>>>> machines
>>>>>>>>> of more than 10ms. I went over the setTime function in
Faban source
>>>>>>>>> (/faban/com/sun/faban/harness/agent/CmdAgentImpl.java),
it's big and ugly
>>>>>>>>> (very ugly)
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the compliments! I think you mean
>>>>>>>> CmdService.setClockTask. Time sensitive code ain't pretty.
It is the
>>>>>>>> complexities dealing with the clock and trying to achieve
good accuracy. If
>>>>>>>> you think you can simplify this, I'm listening (without loosing
the
>>>>>>>> accuracy, of course). In comparison, CmdAgentImpl has nothing.
>>>>>>>>
>>>>>>>>
>>>>>>> Yes, you r right it is CmdService.setClockTask. The previous
email
>>>>>>> was composed at 3am ... :-)
>>>>>>> I am still a little confused.  the setClockTask is used to set
the
>>>>>>> clock so that all the machines are synchronized with master.
From what you
>>>>>>> mentioned the physical clock sync is only used for the logs.
>>>>>>> Why do we need to do that since 1) it requires root privileges
(which
>>>>>>> might not be always available) 2) I could imagine an alternative
that uses
>>>>>>> deltas from the actual physical clock without having to set it.
>>>>>>> ( I am probably missing something... :-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  Why there is this strict requirement for 10ms difference?
Any
>>>>>>>>> ideas?
>>>>>>>>>
>>>>>>>>
>>>>>>>> It is easily achievable in most cases. May not be true for
VMs.
>>>>>>>>
>>>>>>>> On some VM architectures, the OS however does not get scheduled
till
>>>>>>>> way after that, thus causing problems. You may be able to
measure
>>>>>>>> performance on those VMs. But you don't want to use such
VMs to be a driver.
>>>>>>>> Your response time measurements will be way off.
>>>>>>>>
>>>>>>>> The physical clock sync is not really rigorous. And you can
turn it
>>>>>>>> off. It is more to keep the systems in good time sync. If
your VM stands in
>>>>>>>> the way, just turn it off. The driver's virtual clock sync
is much more
>>>>>>>> picky in comparison. This is because the start time for the
steady state
>>>>>>>> should be the same (with a very small tolerance) no matter
how many drivers
>>>>>>>> are driving. Otherwise the measurement period won't be the
same when viewed
>>>>>>>> from different drivers and the results won't be reliable.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Even with ntp it's hard to provide the 10ms guarantee.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's why we don't use ntp ;-)
>>>>>>>
>>>>>>>
>>>>>>> Just out of curiosity, the physical clocks are set only once
at the
>>>>>>> beginning (right?), therefore for long runs the 10ms difference
will not be
>>>>>>> guaranteed. Nope? Especially under VMs I 've seen significant
clock
>>>>>>> difference withing a few minutes.
>>>>>>> At least ntp can periodically resync (of course doing so, might
screw
>>>>>>> up the logs with time going backwards etc)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  I am thinking of modifying this function to always return
that the
>>>>>>>>> time difference is less than 10ms (so that I do not have
to wait all the
>>>>>>>>> time for the timeouts.)
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why bother. Don't like it, just turn it off. It has good
use in most
>>>>>>>> configurations we're dealing with. And, it avoids ntp inaccuracies.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Will this break anything in Olio?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Nope. Except the times in your logs will appear out of sequence.
>>>>>>>> They rely on the local time on the originating systems.
>>>>>>>>
>>>>>>>>
>>>>>>>>> 4) Warning like:
>>>>>>>>> 09:39:48:WARNING:Image at
>>>>>>>>> http://olio-web:80/fileService.php?cache=false&file=e168t.jpg
<
>>>>>>>>> http://olio-web:80/fileService.php?cache=false&file=e168t.jpg>
>>>>>>>>> size of 249 bytes is too small. Image may not exist
>>>>>>>>> can be ignored, right?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, something is wrong. We don't have images that small.
Check
>>>>>>>> whether e168t.jpg is really that small. That's why we have
that warning.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> It kinda funny, my problem was that I had the olio webkit version
>>>>>>> installed and then I downloaded the version from the online svn
repository.
>>>>>>> I built the driver but forgot to update the webpage for my apache
server.
>>>>>>> Which
>>>>>>> as expected was the source for many of my issues.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 5) Last and most important.
>>>>>>>>> I can run the benchmark and all the operation succeed
but for
>>>>>>>>> login.
>>>>>>>>> I get a bunch of:
>>>>>>>>>
>>>>>>>>> 09:39:25:WARNING:UIDriverAgent[0].2.doLogin: Found login
prompt at
>>>>>>>>> index 2926, Login as at786o08x, 2178 failed.
>>>>>>>>> Note: Error not counted in result.
>>>>>>>>> Either transaction start or end time is not within steady
state.
>>>>>>>>> java.lang.RuntimeException: Found login prompt at index
2926, Login
>>>>>>>>> as at786o08x, 2178 failed.
>>>>>>>>>    at org.apache.olio.workload.driver.UIDriver.doLogin
(404)
>>>>>>>>>    at sun.reflect.NativeMethodAccessorImpl.invoke0 (null)
>>>>>>>>>    at sun.reflect.NativeMethodAccessorImpl.invoke (39)
>>>>>>>>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke
(25)
>>>>>>>>>    at java.lang.reflect.Method.invoke (597)
>>>>>>>>>    at com.sun.faban.driver.engine.TimeThread.doRun (169)
>>>>>>>>>    at com.sun.faban.driver.engine.AgentThread.run (202)
>>>>>>>>>
>>>>>>>>> Any ideas? I do get
>>>>>>>>>
>>>>>>>>
>>>>>>>> You likely have cookie issues. It can't seem to hold on to
a
>>>>>>>> session.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Well there was a permission issue with the http_session dir.
I could
>>>>>>> not right to it. chmod 777 it fixed this.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> (I ve found online:
>>>>>>>>> http://www.mail-archive.com/olio-dev@incubator.apache.org/msg00647.htmlwhich
is similar, but when I added
>>>>>>>>>
>>>>>>>>> com.sun.faban.driver.transport.sunhttp.ThreadCookieHandler.level=FINER
>>>>>>>>>  in build.properties
>>>>>>>>> I did not see any cookie related warnings. Those should
appear in
>>>>>>>>> the olio run log or the apache log, right? Am i just
looking at the wrong
>>>>>>>>> place? )
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, that's applicable only to the Sun Http Transport. The
version
>>>>>>>> of Olio you're using is based on the Apache Http Transport
(Apache
>>>>>>>> HttpClient 3.1). The ThreadCookieHandler is not used for
the Apache
>>>>>>>> transport and that's why you don't see any logs. Try upgrade
to Faban 1.0
>>>>>>>> before looking at other things.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's a long email I know. Your feedback would be most
appreciated.
>>>>>>>>>
>>>>>>>>> -Regards
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> Kontorinis Vasileios
>>>>>>>>> Phd student, University of California San Diego
>>>>>>>>> San Diego, CA 92122
>>>>>>>>> Cell. phone: (858) 717 6899
>>>>>>>>> bkontorinis@gmail.com <mailto:bkontorinis@gmail.com>,
>>>>>>>>> vkontori@ucsd.edu <mailto:vkontori@ucsd.edu>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------\
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for all the questions/comments.
>>>>>>>>
>>>>>>>> -Akara
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> And now some more questions/ comments:
>>>>>>> 1) I get the following error:
>>>>>>>
>>>>>>> 15:13:05:SEVERE:CmdService: Getting - exception reading
>>>>>>> /usr/data/olio-db.err
>>>>>>> java.io.FileNotFoundException: File /usr/data/olio-db.err does
not
>>>>>>> exist.
>>>>>>>     at com.sun.faban.common.FileTransfer.<init> (70)
>>>>>>>     at com.sun.faban.harness.agent.FileAgentImpl.get (315)
>>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0 (null)
>>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke (39)
>>>>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke (25)
>>>>>>>     at java.lang.reflect.Method.invoke (597)
>>>>>>>     at sun.rmi.server.UnicastServerRef.dispatch (305)
>>>>>>>     at sun.rmi.transport.Transport$1.run (159)
>>>>>>>     at java.security.AccessController.doPrivileged (null)
>>>>>>>     at sun.rmi.transport.Transport.serviceCall (155)
>>>>>>>     at sun.rmi.transport.tcp.TCPTransport.handleMessages (535)
>>>>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0
>>>>>>> (790)
>>>>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run
(649)
>>>>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(885)
>>>>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run (907)
>>>>>>>     at java.lang.Thread.run (619)
>>>>>>>     at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer
>>>>>>> (255)
>>>>>>>     at sun.rmi.transport.StreamRemoteCall.executeCall (233)
>>>>>>>     at sun.rmi.server.UnicastRef.invoke (142)
>>>>>>>     at com.sun.faban.harness.agent.FileAgentImpl_Stub.get (null)
>>>>>>>     at com.sun.faban.harness.engine.CmdService.get (1334)
>>>>>>>     at com.sun.faban.harness.RunContext.getFile (346)
>>>>>>>     at com.sun.services.MySQLService.getLogs (197)
>>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0 (null)
>>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke (39)
>>>>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke (25)
>>>>>>>     at java.lang.reflect.Method.invoke (597)
>>>>>>>     at com.sun.faban.harness.util.Invoker.invoke (98)
>>>>>>>     at com.sun.faban.harness.services.ServiceWrapper.getLogs
(200)
>>>>>>>     at com.sun.faban.harness.services.ServiceManager.getLogs
(642)
>>>>>>>     at com.sun.faban.harness.engine.GenericBenchmark.start (323)
>>>>>>>     at com.sun.faban.harness.engine.RunDaemon.run (338)
>>>>>>>     at java.lang.Thread.run (619)
>>>>>>> 15:13:05:WARNING:Could not copy /usr/data/olio-db.err to
>>>>>>> /home/gdhiman/faban.1.0/faban/output/OlioDriver.2D/mysql_err.log.olio-db
>>>>>>>
>>>>>>> Apparently something is misconfigured in my db-server. Any ideas?
>>>>>>>
>>>>>>> 2) I get the following error:
>>>>>>> 15:13:16:WARNING:[/home/gdhiman/faban.1.0/faban/bin/fenxi, process,
>>>>>>> /home/gdhiman/faban.1.0/faban/output/OlioDriver.2D/,
>>>>>>> /home/gdhiman/faban.1.0/faban/output/OlioDriver.2D//post/, OlioDriver.2D]
>>>>>>> stderr:
>>>>>>> Error in executing perl
>>>>>>> /home/gdhiman/faban.1.0/faban/master/webapps/fenxi/txt2db/mpstat.pl
>>>>>>> Error in executing perl
>>>>>>> /home/gdhiman/faban.1.0/faban/master/webapps/fenxi/txt2db/mpstat.pl
>>>>>>>
>>>>>>> Actually I traced back this one. The problem is the difference
in
>>>>>>> output format of the Sun's mpstat and default GNU mpstat.
>>>>>>> This is my output of my mpstat:
>>>>>>>
>>>>>>> gdhiman@olio-client00:~/faban.1.0/faban/output/OlioDriver.2D$
mpstat
>>>>>>> 1
>>>>>>> Linux 2.6.18.8-xen (olio-client00)     01/16/10
>>>>>>>
>>>>>>> 16:25:06     CPU   %user   %nice    %sys %iowait    %irq   %soft
>>>>>>> %steal   %idle    intr/s
>>>>>>> 16:25:07     all    0.00    0.00    0.00    0.00    0.00    0.00
>>>>>>> 0.00  100.00     52.48
>>>>>>> 16:25:08     all    0.00    0.00    0.00    0.00    0.00    0.00
>>>>>>> 0.00  100.00     50.50
>>>>>>> 16:25:09     all    0.00    0.00    0.00    0.00    0.00    0.00
>>>>>>> 0.00  100.00     79.21
>>>>>>> 16:25:10     all    0.00    0.00    0.00    0.00    0.00    0.00
>>>>>>> 0.00  100.00     45.54
>>>>>>> 16:25:11     all    0.00    0.00    0.00    0.00    0.00    0.00
>>>>>>> 0.00  100.00     55.45
>>>>>>>
>>>>>>> The first line as well as the time at the beginning of each entry
>>>>>>> messing up the parsing at mpstat.pl. (also the fields are different)
>>>>>>>   Any plans to support this??
>>>>>>>
>>>>>>> 3) Scaling questions.
>>>>>>> - So far I did not have a single experiment passing. Some are
pretty
>>>>>>> close with only one metric check failing.
>>>>>>>
>>>>>>> Average images loaded per Home Page2.79>= 3
>>>>>>> FAILED
>>>>>>> Any ideas? Is it the case that the disc is not fast enough? I
am just
>>>>>>> using the local filesystem for the filestore.
>>>>>>>
>>>>>>> - As I double the number of concurrent users I observe linear
scaling
>>>>>>> in the thoughput.
>>>>>>> Con Users         Throughput
>>>>>>>  25                        4.967
>>>>>>>  50                       10.06
>>>>>>> 100                      19.375
>>>>>>> 200                      40.21
>>>>>>> 400                      75.818
>>>>>>> 800                       0.383
>>>>>>> 1000                     0.483
>>>>>>>
>>>>>>> The linear scaling stops for 400 concurrent users ( only one
agent).
>>>>>>> Actually it would be exactly linear (value of ~80) but almost
half of the
>>>>>>> login operations failed. I am looking into it.
>>>>>>> Any insights on what might be the first thing failing?
>>>>>>>
>>>>>>> For the 800 and 1000 experiments there are no failed operations
>>>>>>> logged. It looks like those are being discarded... (?)
>>>>>>>
>>>>>>> Bonus question:
>>>>>>> In the runtime statistics
>>>>>>> <runtimeStats enabled="true">
>>>>>>>          <interval>30</interval>
>>>>>>>  </runtimeStats>
>>>>>>>
>>>>>>> only the 90% response time is reported. Is there an easy way
to also
>>>>>>> report the 99% ? ( or I need to add code for that?)
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot again in advance.
>>>>>>> -VK
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message