accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Error stressing with pyaccumulo app
Date Tue, 11 Feb 2014 17:16:16 GMT
Ok. Even so, try adding some split points to the tables before you begin 
(if you aren't already) as it will *greatly* smooth the startup.

Something like [00, 01, 02, ... 10, 11, 12, .. 97, 98, 99] would be 
good. You can easily dump this to a file on local disk and run the 
`addsplits` command in the Accumulo shell and provide it that file with 
the -sf (I think) option.

On 2/11/14, 12:00 PM, Diego Woitasen wrote:
> I'm using random keys for this tests. They are uuid4 keys.
>
> On Tue, Feb 11, 2014 at 1:04 PM, Josh Elser <josh.elser@gmail.com> wrote:
>> The other thing I thought about.. what's the distribution of Key-Values
>> that you're writing? Specifically, do many of the Keys sort "near" each
>> other. Similarly, do you notice excessive load on some tservers, but not all
>> (the "Tablet Servers" page on the Monitor is a good check)?
>>
>> Consider the following: you have 10 tservers and you have 10 proxy servers.
>> The first thought is that 10 tservers should be plenty to balance the load
>> of those 10 proxy servers. However, a problem arises when if the data that
>> each of those proxy servers is writing happens to reside on a _small number
>> of tablet servers_. Thus, your 10 proxy servers might only be writing to one
>> or two tabletservers.
>>
>> If you notice that you're getting skew like this (or even just know that
>> you're apt to have a situation where multiple clients might write data that
>> sorts close to one another), it would be a good idea to add splits to your
>> table before starting your workload.
>>
>> e.g. if you consider that your Key-space is the numbers from 1 to 10, and
>> you have ten tservers, it would be a good idea to add splits 1, 2, ... 10,
>> so that each tservers hosts at least one tablet (e.g. [1,2), [2,3)...
>> [10,+inf)). Having at least 5 or 10 tablets per tserver per table (split
>> according to the distribution of your data) might help ease the load.
>>
>>
>> On 2/11/14, 10:47 AM, Diego Woitasen wrote:
>>>
>>> Same results with 2G tserver.memory.maps.max.
>>>
>>> May be we just reached the limit :)
>>>
>>> On Mon, Feb 10, 2014 at 7:08 PM, Diego Woitasen
>>> <diego.woitasen@vhgroup.net> wrote:
>>>>
>>>> On Mon, Feb 10, 2014 at 6:21 PM, Josh Elser <josh.elser@gmail.com>
wrote:
>>>>>
>>>>> I assume you're running a datanode along side the tserver on that node?
>>>>> That
>>>>> may be stretching the capabilities of that node (not to mention ec2
>>>>> nodes
>>>>> tend to be a little flakey in general). 2G for the
>>>>> tserver.memory.maps.max
>>>>> might be a little safer.
>>>>>
>>>>> You got an error in a tserver log about that IOException in
>>>>> internalReader.
>>>>> After that, the tserver was still alive? And the proxy client was dead
-
>>>>> quit normally?
>>>>
>>>>
>>>> Yes, everything is still alive.
>>>>
>>>>>
>>>>> If that's the case, the proxy might just be disconnecting in a noisy
>>>>> manner?
>>>>
>>>>
>>>> Right!
>>>>
>>>> I'll try with 2G  tserver.memory.maps.max.
>>>>>
>>>>>
>>>>>
>>>>> On 2/10/14, 3:38 PM, Diego Woitasen wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>     I tried increasing the tserver.memory.maps.max to 3G and failed
>>>>>> again, but with other error. I have a heap size of 3G and 7.5 GB
of
>>>>>> total ram.
>>>>>>
>>>>>> The error that I've found in the crashed tserver is:
>>>>>>
>>>>>> 2014-02-08 03:37:35,497 [util.TServerUtils$THsHaServer] WARN : Got
an
>>>>>> IOException in internalRead!
>>>>>>
>>>>>> The tserver haven't crashed, but the client was disconnected during
the
>>>>>> test.
>>>>>>
>>>>>> Another hint is welcome :)
>>>>>>
>>>>>> On Mon, Feb 3, 2014 at 3:58 PM, Josh Elser <josh.elser@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Oh, ok. So that isn't quite as bad as it seems.
>>>>>>>
>>>>>>> The "commits are held" exception is thrown when the tserver is
running
>>>>>>> low
>>>>>>> on memory. The tserver will block new mutations coming in until
it can
>>>>>>> process the ones it already has and free up some memory. This
makes
>>>>>>> sense
>>>>>>> that you would see this more often when you have more proxy servers
as
>>>>>>> the
>>>>>>> total amount of Mutations you can send to your Accumulo instance
is
>>>>>>> increased. With one proxy server, your tserver had enough memory
to
>>>>>>> process
>>>>>>> the incoming data. With many proxy servers, your tservers would
likely
>>>>>>> fall
>>>>>>> over eventually because they'll get bogged down in JVM garbage
>>>>>>> collection.
>>>>>>>
>>>>>>> If you have more memory that you can give the tservers, that
would
>>>>>>> help.
>>>>>>> Also, you should make sure that you're using the Accumulo native
maps
>>>>>>> as
>>>>>>> this will use off-JVM-heap space instead of JVM heap which should
help
>>>>>>> tremendously with your ingest rates.
>>>>>>>
>>>>>>> Native maps should be on by default unless you turned them off
using
>>>>>>> the
>>>>>>> property 'tserver.memory.maps.native.enabled' in accumulo-site.xml.
>>>>>>> Additionally, you can try increasing the size of the native maps
using
>>>>>>> 'tserver.memory.maps.max' in accumulo-site.xml. Just be aware
that
>>>>>>> with
>>>>>>> the
>>>>>>> native maps, you need to ensure that total_ram > JVM_heap
+
>>>>>>> tserver.memory.maps.max
>>>>>>>
>>>>>>> - Josh
>>>>>>>
>>>>>>>
>>>>>>> On 2/3/14, 1:33 PM, Diego Woitasen wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I've launched the cluster again and I was able to reproduce
the
>>>>>>>> error:
>>>>>>>>
>>>>>>>> In the proxy I had the same error that I mention in one of
my
>>>>>>>> previous
>>>>>>>> messages, about a failure in a table server. I checked the
log of
>>>>>>>> that
>>>>>>>> tablet server and I found:
>>>>>>>>
>>>>>>>> 2014-02-03 18:02:24,065 [thrift.ProcessFunction] ERROR: Internal
>>>>>>>> error
>>>>>>>> processing update
>>>>>>>> org.apache.accumulo.server.tabletserver.HoldTimeoutException:
Commits
>>>>>>>> are
>>>>>>>> held
>>>>>>>>
>>>>>>>> A lot of times.
>>>>>>>>
>>>>>>>> Full log if someone want to have a look:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://www.vhgroup.net/diegows/tserver_matrix-slave-07.accumulo-ec2-test.com.debug.log
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>       Diego
>>>>>>>>
>>>>>>>> On Mon, Feb 3, 2014 at 12:11 PM, Josh Elser <josh.elser@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would assume that that proxy service would become a
bottleneck
>>>>>>>>> fairly
>>>>>>>>> quickly and your throughput would benefit from running
multiple
>>>>>>>>> proxies,
>>>>>>>>> but I don't have substantive numbers to back up that
assertion.
>>>>>>>>>
>>>>>>>>> I'll put this on my list and see if I can reproduce something.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/3/14, 7:42 AM, Diego Woitasen wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have to run the tests again because they were ec2
instances and
>>>>>>>>>> I've
>>>>>>>>>> destroyed. It's easy to reproduce BTW.
>>>>>>>>>>
>>>>>>>>>> My question is, does it makes sense to run multiple
proxies? Are
>>>>>>>>>> there
>>>>>>>>>> a limit? Right now I'm trying with 10 nodes and 10
proxies (running
>>>>>>>>>> on
>>>>>>>>>> every node). May be that doesn't make sense or it's
a buggy
>>>>>>>>>> configuration.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 31, 2014 at 7:29 PM, Josh Elser <josh.elser@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When you had multiple proxies, what were the
failures on that
>>>>>>>>>>> tablet
>>>>>>>>>>> server
>>>>>>>>>>> (10.202.6.46:9997).
>>>>>>>>>>>
>>>>>>>>>>> I'm curious why using one proxy didn't cause
errors but multiple
>>>>>>>>>>> did.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 1/31/14, 4:44 PM, Diego Woitasen wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've reproduced the error and I've found
this in the proxy logs:
>>>>>>>>>>>>
>>>>>>>>>>>>           2014-01-31 19:47:50,430 [server.THsHaServer]
WARN : Got
>>>>>>>>>>>> an
>>>>>>>>>>>> IOException in internalRead!
>>>>>>>>>>>>           java.io.IOException: Connection
reset by peer
>>>>>>>>>>>>               at sun.nio.ch.FileDispatcherImpl.read0(Native
>>>>>>>>>>>> Method)
>>>>>>>>>>>>               at
>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>>>>>>>>>               at
>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>>>>>>>>>               at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>>>>>>>>               at
>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:305)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
>>>>>>>>>>>>               at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
>>>>>>>>>>>>           2014-01-31 19:51:13,185 [impl.ThriftTransportPool]
WARN
>>>>>>>>>>>> :
>>>>>>>>>>>> Server
>>>>>>>>>>>> 10.202.6.46:9997:9997 (30000) had 20 failures
in a short time
>>>>>>>>>>>> period,
>>>>>>>>>>>> will not complain anymore
>>>>>>>>>>>>
>>>>>>>>>>>> A lot of this messages appear in all the
proxies.
>>>>>>>>>>>>
>>>>>>>>>>>> I tried the same stress tests agaisnt one
proxy and I was able to
>>>>>>>>>>>> increase the load without getting any error.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>         Diego
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 30, 2014 at 2:47 PM, Keith Turner
<keith@deenlo.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you see more information in the proxy
logs?  "# exceptions 1"
>>>>>>>>>>>>> indicates
>>>>>>>>>>>>> an unexpected exception occured in the
batch writer client code.
>>>>>>>>>>>>> The
>>>>>>>>>>>>> proxy
>>>>>>>>>>>>> uses this client code, so maybe there
will be a more detailed
>>>>>>>>>>>>> stack
>>>>>>>>>>>>> trace
>>>>>>>>>>>>> in
>>>>>>>>>>>>> its logs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jan 30, 2014 at 9:46 AM, Diego
Woitasen
>>>>>>>>>>>>> <diego.woitasen@vhgroup.net>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>        I'm testing with a ten node
cluster with the proxy
>>>>>>>>>>>>>> enabled in
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> nodes. I'm doing a stress test balancing
the connection between
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> proxies using round robin. When I
increase the load (400
>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>> writting) I get this error:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>>>> # constraint violations : 0  security
codes: []  # server
>>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The complete message is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> AccumuloSecurityException:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> AccumuloSecurityException(msg='org.apache.accumulo.core.client.MutationsRejectedException:
>>>>>>>>>>>>>> # constraint violations : 0  security
codes: []  # server
>>>>>>>>>>>>>> errors 0
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> exceptions 1')
>>>>>>>>>>>>>> kvlayer-test client failed!
>>>>>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>>>>>         File "tests/kvlayer/test_accumulo_throughput.py",
line
>>>>>>>>>>>>>> 64,
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> __call__
>>>>>>>>>>>>>>           self.client.put('t1', ((u,),
self.one_mb))
>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_decorators.py",
>>>>>>>>>>>>>> line 26, in wrapper
>>>>>>>>>>>>>>           return method(*args, **kwargs)
>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/kvlayer-0.2.7-py2.7.egg/kvlayer/_accumulo.py",
>>>>>>>>>>>>>> line 154, in put
>>>>>>>>>>>>>>           batch_writer.close()
>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/__init__.py",
>>>>>>>>>>>>>> line 126, in close
>>>>>>>>>>>>>>           self._conn.client.closeWriter(self._writer)
>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>>>> line 3149, in closeWriter
>>>>>>>>>>>>>>           self.recv_closeWriter()
>>>>>>>>>>>>>>         File
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/ubuntu/kvlayer-env/local/lib/python2.7/site-packages/pyaccumulo_dev-1.5.0.2-py2.7.egg/pyaccumulo/proxy/AccumuloProxy.py",
>>>>>>>>>>>>>> line 3172, in recv_closeWriter
>>>>>>>>>>>>>>           raise result.ouch2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure if the errror is produced
by the way I'm using the
>>>>>>>>>>>>>> cluster with multiple proxies, may
be I should use one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ideas are welcome.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>         Diego
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Diego Woitasen
>>>>>>>>>>>>>> VHGroup - Linux and Open Source solutions
architect
>>>>>>>>>>>>>> www.vhgroup.net
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Diego Woitasen
>>>> VHGroup - Linux and Open Source solutions architect
>>>> www.vhgroup.net
>>>
>>>
>>>
>>>
>>
>
>
>

Mime
View raw message