accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Error stressing with pyaccumulo app
Date Tue, 11 Feb 2014 16:04:44 GMT
The other thing I thought about.. what's the distribution of Key-Values
that you're writing? Specifically, do many of the Keys sort "near" each 
other. Similarly, do you notice excessive load on some tservers, but not 
all (the "Tablet Servers" page on the Monitor is a good check)?

Consider the following: you have 10 tservers and you have 10 proxy 
servers. The first thought is that 10 tservers should be plenty to 
balance the load of those 10 proxy servers. However, a problem arises 
when if the data that each of those proxy servers is writing happens to 
reside on a _small number of tablet servers_. Thus, your 10 proxy 
servers might only be writing to one or two tabletservers.

If you notice that you're getting skew like this (or even just know that 
you're apt to have a situation where multiple clients might write data 
that sorts close to one another), it would be a good idea to add splits 
to your table before starting your workload.

e.g. if you consider that your Key-space is the numbers from 1 to 10, 
and you have ten tservers, it would be a good idea to add splits 1, 2, 
... 10, so that each tservers hosts at least one tablet (e.g. [1,2), 
[2,3)... [10,+inf)). Having at least 5 or 10 tablets per tserver per 
table (split according to the distribution of your data) might help ease 
the load.

On 2/11/14, 10:47 AM, Diego Woitasen wrote:
> Same results with 2G tserver.memory.maps.max.
>
> May be we just reached the limit :)
>
> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen
> <diego.woitasen@vhgroup.net> wrote:
>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>> I assume you're running a datanode along side the tserver on that node? That
>>> may be stretching the capabilities of that node (not to mention ec2 nodes
>>> tend to be a little flakey in general). 2G for the tserver.memory.maps.max
>>> might be a little safer.
>>>
>>> You got an error in a tserver log about that IOException in internalReader.
>>> After that, the tserver was still alive? And the proxy client was dead -
>>> quit normally?
>>
>> Yes, everything is still alive.
>>
>>>
>>> If that's the case, the proxy might just be disconnecting in a noisy manner?
>>
>> Right!
>>
>> I'll try with 2G  tserver.memory.maps.max.
>>>
>>>
>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote:
>>>>
>>>> Hi,
>>>>    I tried increasing the tserver.memory.maps.max to 3G and failed
>>>> again, but with other error. I have a heap size of 3G and 7.5 GB of
>>>> total ram.
>>>>
>>>> The error that I've found in the crashed tserver is:
>>>>
>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got an
>>>> IOException in internalRead!
>>>>
>>>> The tserver haven't crashed, but the client was disconnected during the
>>>> test.
>>>>
>>>> Another hint is welcome :)
>>>>
>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>>>>
>>>>> Oh, ok. So that isn't quite as bad as it seems.
>>>>>
>>>>> The "commits are held" exception is thrown when the tserver is running
>>>>> low
>>>>> on memory. The tserver will block new mutations coming in until it can
>>>>> process the ones it already has and free up some memory. This makes sense
>>>>> that you would see this more often when you have more proxy servers as
>>>>> the
>>>>> total amount of Mutations you can send to your Accumulo instance is
>>>>> increased. With one proxy server, your tserver had enough memory to
>>>>> process
>>>>> the incoming data. With many proxy servers, your tservers would likely
>>>>> fall
>>>>> over eventually because they'll get bogged down in JVM garbage
>>>>> collection.
>>>>>
>>>>> If you have more memory that you can give the tservers, that would help.
>>>>> Also, you should make sure that you're using the Accumulo native maps
as
>>>>> this will use off-JVM-heap space instead of JVM heap which should help
>>>>> tremendously with your ingest rates.
>>>>>
>>>>> Native maps should be on by default unless you turned them off using
the
>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml.
>>>>> Additionally, you can try increasing the size of the native maps using
>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware that with
>>>>> the
>>>>> native maps, you need to ensure that total_ram > JVM_heap +
>>>>> tserver.memory.maps.max
>>>>>
>>>>> - Josh
>>>>>
>>>>>
>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote:
>>>>>>
>>>>>>
>>>>>> I've launched the cluster again and I was able to reproduce the error:
>>>>>>
>>>>>> In the proxy I had the same error that I mention in one of my previous
>>>>>> messages, about a failure in a table server. I checked the log of
that
>>>>>> tablet server and I found:
>>>>>>
>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal
error
>>>>>> processing update
>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits
>>>>>> are
>>>>>> held
>>>>>>
>>>>>> A lot of times.
>>>>>>
>>>>>> Full log if someone want to have a look:
>>>>>>
>>>>>>
>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log
>>>>>>
>>>>>> Regards,
>>>>>>      Diego
>>>>>>
>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <josh.elser@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I would assume that that proxy service would become a bottleneck
fairly
>>>>>>> quickly and your throughput would benefit from running multiple
>>>>>>> proxies,
>>>>>>> but I don't have substantive numbers to back up that assertion.
>>>>>>>
>>>>>>> I'll put this on my list and see if I can reproduce something.
>>>>>>>
>>>>>>>
>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have to run the tests again because they were ec2 instances
and I've
>>>>>>>> destroyed. It's easy to reproduce BTW.
>>>>>>>>
>>>>>>>> My question is, does it makes sense to run multiple proxies?
Are there
>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10 proxies
(running on
>>>>>>>> every node). May be that doesn't make sense or it's a buggy
>>>>>>>> configuration.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <josh.elser@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> When you had multiple proxies, what were the failures
on that tablet
>>>>>>>>> server
>>>>>>>>> (10.202.6.46:9997).
>>>>>>>>>
>>>>>>>>> I'm curious why using one proxy didn't cause errors but
multiple did.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've reproduced the error and I've found this in
the proxy logs:
>>>>>>>>>>
>>>>>>>>>>          2014-01-31 19:47:50,430 [server.THsHaServer]
WARN : Got an
>>>>>>>>>> IOException in internalRead!
>>>>>>>>>>          java.io.IOException: Connection reset by
peer
>>>>>>>>>>              at sun.nio.ch.FileDispatcherImpl.read0(Native
Method)
>>>>>>>>>>              at
>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>>>>>>>              at
>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>>>>>>>              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>>>>>>              at
>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
>>>>>>>>>>              at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
>>>>>>>>>>          2014-01-31 19:51:13,185 [impl.ThriftTransportPool]
WARN :
>>>>>>>>>> Server
>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures in
a short time
>>>>>>>>>> period,
>>>>>>>>>> will not complain anymore
>>>>>>>>>>
>>>>>>>>>> A lot of this messages appear in all the proxies.
>>>>>>>>>>
>>>>>>>>>> I tried the same stress tests agaisnt one proxy and
I was able to
>>>>>>>>>> increase the load without getting any error.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>        Diego
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner <keith@deenlo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do you see more information in the proxy logs?
 "# exceptions 1"
>>>>>>>>>>> indicates
>>>>>>>>>>> an unexpected exception occured in the batch
writer client code.
>>>>>>>>>>> The
>>>>>>>>>>> proxy
>>>>>>>>>>> uses this client code, so maybe there will be
a more detailed stack
>>>>>>>>>>> trace
>>>>>>>>>>> in
>>>>>>>>>>> its logs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego Woitasen
>>>>>>>>>>> <diego.woitasen@vhgroup.net>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>       I'm testing with a ten node cluster
with the proxy enabled in
>>>>>>>>>>>> all
>>>>>>>>>>>> the
>>>>>>>>>>>> nodes. I'm doing a stress test balancing
the connection between
>>>>>>>>>>>> the
>>>>>>>>>>>> proxies using round robin. When I increase
the load (400 workers
>>>>>>>>>>>> writting) I get this error:
>>>>>>>>>>>>
>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>> # constraint violations : 0  security codes:
[]  # server errors 0
>>>>>>>>>>>> #
>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>
>>>>>>>>>>>> The complete message is:
>>>>>>>>>>>>
>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>> # constraint violations : 0  security codes:
[]  # server errors 0
>>>>>>>>>>>> #
>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>> kvlayer-test client failed!
>>>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>>>        File "tests/kvlayer/test_accumulo_throughput.py",
line 64,
>>>>>>>>>>>> in
>>>>>>>>>>>> __call__
>>>>>>>>>>>>          self.client.put('t1', ((u,), self.one_mb))
>>>>>>>>>>>>        File
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py",
>>>>>>>>>>>> line 26, in wrapper
>>>>>>>>>>>>          return method(*args, **kwargs)
>>>>>>>>>>>>        File
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py",
>>>>>>>>>>>> line 154, in put
>>>>>>>>>>>>          batch_writer.close()
>>>>>>>>>>>>        File
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py",
>>>>>>>>>>>> line 126, in close
>>>>>>>>>>>>          self._conn.client.closeWriter(self._writer)
>>>>>>>>>>>>        File
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>> line 3149, in closeWriter
>>>>>>>>>>>>          self.recv_closeWriter()
>>>>>>>>>>>>        File
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>> line 3172, in recv_closeWriter
>>>>>>>>>>>>          raise result.ouch2
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if the errror is produced by
the way I'm using the
>>>>>>>>>>>> cluster with multiple proxies, may be I should
use one.
>>>>>>>>>>>>
>>>>>>>>>>>> Ideas are welcome.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>        Diego
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Diego Woitasen
>>>>>>>>>>>> VHGroup - Linux and Open Source solutions
architect
>>>>>>>>>>>> www.vhgroup.net
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Diego Woitasen
>> VHGroup - Linux and Open Source solutions architect
>> www.vhgroup.net
>
>
>

Mime
View raw message