accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Troxell <steven.trox...@gmail.com>
Subject Re: Accumulo Caching for benchmarking
Date Mon, 06 Aug 2012 18:41:50 GMT
For anyone else curious about this, it seems the OS caching played a much
larger role for me then TServer caching.  I actually measured performance
increase after just stopping/restarting TServers to clear cache. (could
also have been biased by being a weekend run on the cluster).

However I noticed immediate difference when clearing the OS caching through
Eric's commands, the first few querys that had generally been returning in
tenths of seconds, were now up in the minutes range.



On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell <steven.troxell@gmail.com>wrote:

> thanks everyone, that should definately help me out,  while I feel silly
> for ignoring this issue at first, it should be interesting to see how much
> this influences the results.
>
>
>
> On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton <eric.newton@gmail.com> wrote:
>
>> You can drop the OS caches between runs:
>>
>> # echo 1 > /proc/sys/vm/drop_caches
>>
>>
>> On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs <ctubbsii@gmail.com>wrote:
>>
>>> Steve-
>>>
>>> I would probably design the experiment to test different cluster sizes
>>> as completely independent. That means, taking the entire thing down
>>> and back up again (possibly even rebooting the boxes, and/or
>>> re-initializing the cluster at the new size). I'd also do several runs
>>> while it is up at a particular cluster size, to capture any
>>> performance difference between the first and a later run due to OS or
>>> TServer caching, for analysis later.
>>>
>>> Essentially, when in doubt, take more data...
>>>
>>> --L
>>>
>>>
>>> On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell <steven.troxell@gmail.com>
>>> wrote:
>>> > Hi  all,
>>> >
>>> > I am running a benchmarking project on accumulo looking at RDF queries
>>> for
>>> > clusters with different node sizes.   While I intend to look at
>>> caching for
>>> > each optimizing each individual run, I do NOT want caching to
>>> interfere for
>>> > example between runs involving the use of 10 and 8 tablet servers.
>>> >
>>> > Up to now I'd just been killing nodes via the bin/stop-here.sh script
>>> but I
>>> > realize that may have allowed caching from previous runs with
>>> different node
>>> > sizes to influence my results.   It seemed weird to me for exmaple
>>> when I
>>> > realized dropping nodes actually increased performance (as measured by
>>> query
>>> > return times) in some cases (though I acknowledge the code I'm working
>>> with
>>> > has some serious issues with how ineffectively it is actually utilizing
>>> > accumulo, but that's an issue I intend to address later).
>>> >
>>> > I suppose one way would be between a change of node sizes,  stop and
>>> restart
>>> > ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes
>>> for
>>> > example in transitioning from a 10 to 8 node test).  Will this be sure
>>> to
>>> > clear the influence of caching across runs, and is there any cleaner
>>> way to
>>> > do this?
>>> >
>>> > thanks,
>>> > Steve
>>>
>>
>>
>

Mime
View raw message