Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAM1_12Y00K3OL7niXfq0uuw=tverKk-UuR5p7Ndg68vNmasQUA@mail.gmail.com>
References: 
 <CAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com>
	<CAMz+Dut5kM8ZaaO8eFfeoAR5X1bFD+86ioZQN4ea6Nd7kKAUgw@mail.gmail.com>
	<CAM1_12bV28ipWfhMvm=nG6SPodDctxZ+LMKS3CJjUKdx0r1mqg@mail.gmail.com>
	<CADczPYSa5Ff6bCtjQX-jQBwF7OX=093en8Boez0Xx54bUx-x2w@mail.gmail.com>
	<CAM1_12YtxgWQ8kF-RjyqjDjwZqj4Hu+1ZLcHFMzqfkjEsVe55w@mail.gmail.com>
	<CAM1_12ZBQFp0Zo22M4YMjCxDwWNn5ScWyRgG8QqTDhXfLKKvBw@mail.gmail.com>
	<CAM1_12Y00K3OL7niXfq0uuw=tverKk-UuR5p7Ndg68vNmasQUA@mail.gmail.com>
Date: Mon, 20 Aug 2012 13:34:10 -0400
Message-ID: 
 <CAGUtCHoeBw+hoqSXHW3KhkJySVt0kTQ2dM_qwdkn_Y=Sx=ecFA@mail.gmail.com>
Subject: Re: Using Accumulo as input to a MapReduce job frequently hangs due
 to lost Zookeeper connection
From: Keith Turner <keith@deenlo.com>
To: user@accumulo.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Arjumand

It sounds like you are joining two data sets?  Reading one data set
from Accumulo and then doing a lookup into the other dataset for each
input to the mapper? This is a good approach if one data set is small
and fits into memory.  If this is not the case, then you may want to
consider another join strategy.

Below are some of the options I know of joining two data sets, D1 and D2 :

 * Read D1 into memory on each mapper, and do lookups into the D1 map
as each is read from D2.  Good approach when D1 is small.
 * Set D1 and D2 as inputs to the map reduce job and join in the
reduce phase.  Good approach when D1 and D2 are large.
 * If D1 is an Accumulo table,  use the batch scanner to lookup
elements in D2.  Good approach when D1 is already in a table and D2 is
relatively small.   If D2 is larger, could do batch scans in reducers
or maybe mappers.
 * Compute a bloom filter for D1, load the bloom filter into memory
for each mapper.  Check existence in D1 bloom filter as each item is
read from D2.   This is can work well when lots of things in D2 do not
exist in D1.  However this is just a filter step that cuts down on
what you need to join. A join will still need to be done.

The approach you choose for joining depends on the relative size of
your two data sets.   The goal is to batch work and avoid single
lookups into Accumulo or a Map File if possible.

Keith

On Mon, Aug 20, 2012 at 1:00 PM, Arjumand Bonhomme <jumand@gmail.com> wrote:
> Hey guys,
>
> I'm back with some additional information and I'm hoping that you might be
> able to help me finally get past this issue.
>
>
> So I changed my config from the 512MB version to the 1GB version without any
> noticeable improvement.  The job still became stuck at roughly 10K input
> records, and accumulo's zookeeper session timed out as usual.  After
> thinking about this for a while it occurred to me that I had previously used
> accumulo as an input source on over 30K records with no issues at all (it
> was writing back out to accumulo as well).  So I tried to figure out why
> this new job could never reliably get past 10K records.  The only real
> difference was that the new job was using a Hadoop MapFile as a lookup table
> for record values.  So after commenting out of the use of the MapFile and
> using a single hard-coded value, I ran the job again and it made it through
> the entire input very quickly, ~3.5M entries in just over a minute.  So
> after this I assumed the MapFile was slowing my job down too much for the
> accumulo input scanner to keep its connection with zookeeper.  The MapFile
> was generated by reading some data from one of my tables in accumulo, so
> the next thing I tried (which I figured would be very bad in practice) was
> to read the lookup data directly from accumulo;  I re-wrote the mapper to do
> a scanner lookup against the other accumulo table for each input record.
> This immediately worked much better than the MapFile.   It was pretty slow,
> but it managed to get up to ~92K records before it failed.  Accumulo once
> again lost its zookeeper session due to timeout.
>
> I recognize that this is not exactly an accumulo issue, but I was hoping
> that you might be able to provide me some guidance as to how to ultimately
> get past this issue.  I'm using this against a small sample of what the
> actual input will be, and I have ~3.5M input records and ~30K values in my
> lookup table.  Both of these values will likely increase in size
> substantially when this is run against the actual input.
>
> Any suggestions about how to approach this problem will be greatly
> appreciated.
>
>
> Thanks,
> -Arjumand
>
>
> BTW, While I recognize this would not be the appropriate way to address my
> problem, I was wondering if there was a reason why the
> org.apache.accumulo.core.client.ZooKeeperInstance constructor allows you
> specify/request a specific session timeout, but the same thing is not
> available on the .setZooKeeperInstance() methods of the
> AccumuloInputFormat/AccumuloOutputFormat classe?
>
>
> On Thu, Aug 16, 2012 at 10:10 PM, Arjumand Bonhomme <jumand@gmail.com>
> wrote:
>>
>> Thanks for the input, no worries about the offline scanner bit. Although
>> if I can ever get past my current issue, I'd love to learn why the offline
>> scanner failed as well.
>>
>> So I took a look at my various other settings, and watched the paging
>> while everything was running, and it doesn't appear (at least not to me) to
>> be related to swapping.
>> I'm using the 512MB accumulo configuration, and I'm running the hadoop
>> task with a 256MB heap.  I'm running hadoop with the standard
>> pseudo-distributed config, except for bumping the  dfs.datanode.max.xcievers
>> property to 4096.  I'm running zookeeper with the OOB config, except for the
>> adjustment to maxclientcxns in zoo.cfg increased to be 100
>>
>> I did a reboot of my machine, then began another series of tests.  This
>> time I was watching vm_stat while everything was running.
>>
>> I noticed the same behavior I've been witnessing all along.  I'm including
>> the stacktrace that was emitted to the tserver log when it hung as it
>> usually does.
>> 6 20:58:12,311 [zookeeper.ZooLock] DEBUG: event null None Disconnected
>> 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for
>> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/tables/1/conf/table.compaction.minor.idle
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
>> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
>> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
>> at
>> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:120)
>> at
>> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:108)
>> at
>> org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:70)
>> at
>> org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.getMemoryManagementActions(LargestFirstMemoryManager.java:95)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:312)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.access$200(TabletServerResourceManager.java:228)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework$2.run(TabletServerResourceManager.java:252)
>> at
>> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>> at java.lang.Thread.run(Thread.java:680)
>> 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for
>> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/config/table.scan.max.memory
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
>> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
>> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
>> at
>> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:109)
>> at
>> org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:76)
>> at
>> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:103)
>> at
>> org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:47)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:960)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServerResourceManager.executeReadAhead(TabletServerResourceManager.java:699)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1316)
>> at
>> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1284)
>> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
>> at $Proxy1.startMultiScan(Unknown Source)
>> at
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.process(TabletClientService.java:2164)
>> at
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
>> at
>> org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
>> at
>> org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>> at
>> org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at
>> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>> at java.lang.Thread.run(Thread.java:680)
>> 16 20:58:13,379 [zookeeper.ZooLock] DEBUG: event null None Expired
>> 16 20:58:13,393 [tabletserver.TabletServer] FATAL: Lost tablet server lock
>> (reason = SESSION_EXPIRED), exiting.
>>
>>
>> Below is the matching stacktrace emitted by the hadoop job.
>> 12/08/16 20:57:14 INFO zookeeper.ZooKeeper: Initiating client connection,
>> connectString=localhost sessionTimeout=30000
>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@47124746
>> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/fe80:0:0:0:0:0:0:1%1:2181
>> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Socket connection established
>> to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Session establishment
>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>> 0x13931fecee3012e, negotiated timeout = 30000
>> 12/08/16 20:57:14 INFO mapred.JobClient: Running job: job_local_0001
>> 12/08/16 20:57:14 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
>> 12/08/16 20:57:14 INFO mapred.MapTask: io.sort.mb = 100
>> 12/08/16 20:57:15 INFO mapred.MapTask: data buffer = 79691776/99614720
>> 12/08/16 20:57:15 INFO mapred.MapTask: record buffer = 262144/327680
>> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> 12/08/16 20:57:15 INFO mapred.JobClient:  map 0% reduce 0%
>> 001000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@feel
>> 002000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@perhap
>> 12/08/16 20:57:20 INFO mapred.LocalJobRunner:
>> 003000
>> 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@univers
>> 12/08/16 20:57:23 INFO mapred.LocalJobRunner:
>> 004000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@doubt
>> 005000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@morwai
>> 12/08/16 20:57:26 INFO mapred.LocalJobRunner:
>> 006000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@tgame
>> 007000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@colleg
>> 12/08/16 20:57:29 INFO mapred.LocalJobRunner:
>> 008000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@led
>> 12/08/16 20:57:32 INFO mapred.LocalJobRunner:
>> 009000
>> 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@stephanopoulo
>> 010000
>> 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@brother
>> 12/08/16 20:57:35 INFO mapred.LocalJobRunner:
>> 011000 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@inabl
>> 12/08/16 20:57:38 INFO mapred.LocalJobRunner:
>> 12/08/16 20:57:41 INFO mapred.LocalJobRunner:
>> 12/08/16 20:58:12 INFO zookeeper.ClientCnxn: Client session timed out,
>> have not heard from server in 37008ms for sessionid 0x13931fecee3012e,
>> closing socket connection and attempting reconnect
>> 012000
>> 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@saturdai
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/0:0:0:0:0:0:0:1:2181
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established
>> to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> ZooKeeper service, session 0x13931fecee3012e has expired, closing socket
>> connection
>> 12/08/16 20:58:13 WARN zookeeper.ZooCache: Zookeeper error, will retry
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for
>> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
>> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
>> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
>> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
>> at
>> org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:147)
>> at
>> org.apache.accumulo.core.client.impl.TabletLocator.getInstance(TabletLocator.java:96)
>> at
>> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:245)
>> at
>> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:94)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:680)
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: EventThread shut down
>> 12/08/16 20:58:13 INFO zookeeper.ZooKeeper: Initiating client connection,
>> connectString=localhost sessionTimeout=30000
>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@7f434cb2
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/0:0:0:0:0:0:0:1:2181
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established
>> to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Session establishment
>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>> 0x13931fecee30135, negotiated timeout = 30000
>> 013000
>> 2012-06-11T09-05-19/abc/06a7b8b2de18727faa635b841b0dd043.txt@anthoni
>> 12/08/16 20:58:17 INFO mapred.LocalJobRunner:
>>
>>
>> And finally here is the output from "vm_stat 1".  Doesn't look like
>> anything was really getting swapped out to disk.  While there was a flurry
>> of paging (reading pages in) as soon as I started up the job, that paging
>> ceased very quickly after that.  Then very shortly it hung again as usual.
>> When looking at the output, it may help to know that I started the hadoop
>> job just after the 3rd banner/header was printed out, and you can see the
>> initial flurry of paging and then it stops after roughly 12 seconds.
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 595157 944925  81264    89580 385586  319170K  8990003  159901K  3094025
>> 9227006  1546121
>> 595491 944110  81292    89605 386400     1909        0     1417        0
>> 4        0
>> 595204 944911  81292    89579 385625      708        0      551        0
>> 0        0
>> 595429 944652  81293    89579 385879     1941        0     1314        0
>> 0        0
>> 595379 945101  81293    89579 385407      616        0      521        0
>> 0        0
>> 595114 944636  81294    89579 385879     1824        0     1329        0
>> 0        0
>> 595664 945080  81295    89578 385406     5365        6     4032        0
>> 1        0
>> 595221 944078  81295    89578 386410     2645        0     1840        0
>> 0        0
>> 595338 945075  81295    89578 385406      767        2      505        0
>> 0        0
>> 595061 944463  81295    89578 386059     2113        0     1433        0
>> 1        0
>> 595194 944936  81295    89578 385587      674        0      565        0
>> 0        0
>> 595267 944458  81296    89578 386059     1731        0     1293        0
>> 0        0
>> 595418 944925  81297    89578 385587      780        0      581        0
>> 0        0
>> 595124 944200  81296    89579 386419     2111       84     1397        0
>> 2        0
>> 595037 944686  81295    89579 385948      657        2      548        0
>> 2        0
>> 595054 944213  81295    89579 386420     1785        0     1350        0
>> 1        0
>> 595289 944687  81295    89579 385948      758        0      565        0
>> 1        0
>> 595052 944221  81296    89579 386420     1987        0     1356        0
>> 0        0
>> 595017 944698  81296    89579 385948      768        0      558        0
>> 0        0
>> 595272 944217  81296    89579 386420     1957        0     1334        0
>> 0        0
>> 594954 944686  81296    89579 385948      704        0      535        0
>> 0        0
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 594997 944166  81296    89579 386420  319204K  8990097  159926K  3094025
>> 9227018  1546121
>> 594977 944641  81298    89579 385948      784        0      569        0
>> 0        0
>> 595282 944160  81298    89579 386420     1874        0     1368        0
>> 1        0
>> 595252 944633  81298    89579 385948      735        0      561        0
>> 1        0
>> 595057 944186  81298    89579 386420     1972        0     1391        0
>> 0        0
>> 595039 944481  81298    89579 386129      929        0      564        0
>> 1        0
>> 595512 943232  81346    89579 387188     3109       31     2542        0
>> 1        0
>> 595395 944491  81346    89579 385948      742        0      626        0
>> 0        0
>> 595134 944016  81346    89579 386420     1844        1     1335        0
>> 0        0
>> 595329 944501  81346    89579 385948      694        0      546        0
>> 0        0
>> 595329 944016  81346    89579 386431     1801        0     1330        0
>> 0        0
>> 595471 944458  81347    89579 385949      797        0      669        0
>> 0        0
>> 595433 943918  81347    89579 386431     2380        0     1806        0
>> 0        0
>> 595118 944433  81347    89579 385948      678        0      567        0
>> 0        0
>> 595142 943989  81347    89571 386420     1863        0     1378        0
>> 1        0
>> 595097 944462  81347    89571 385948      654        0      540        0
>> 1        0
>> 595126 944060  81347    89571 386420     1874        0     1430        0
>> 0        0
>> 595068 944674  81348    89571 385948      787        0      678        0
>> 0        0
>> 595110 944174  81348    89571 386420     1863        0     1345        0
>> 0        0
>> 595362 944627  81348    89571 385948      768        2      575        0
>> 0        0
>> 595115 944131  81348    89572 386420     1755        0     1323        0
>> 0        0
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 595142 944657  81348    89571 385948  319233K  8990131  159947K  3094025
>> 9227025  1546121
>> 591569 946751  81348    89571 387281     9240      786     4725        0
>> 8        0
>> 579748 958406  81358    90168 386840    20057      287    13388        0
>> 492        0
>> 563598 959835  88682    97236 387319    20388      931     4220        0
>> 0        0
>> 535840 986697  88692    98643 386906    59619     1497    32467        0
>> 164        0
>> 534240 987263  88696    99134 387394    42744        0     6762        0
>> 98        0
>> 529479 992037  88723   100123 386187    39552        3     9006        0
>> 10        0
>> 529065 991283  88760   100107 387430    32014      153     7500        0
>> 41        0
>> 528147 992276  88784   100376 386966    37719        2     6050        0
>> 15        0
>> 527155 993536  88818   100380 386672    34999        0     6928        0
>> 1        0
>> 526748 992973  88851   100380 387747    34379        1     5240        0
>> 0        0
>> 526835 994071  88902   100404 386674    34698        2     7063        0
>> 8        0
>> 524993 994621  89207   100582 387027    28374      460     6018        0
>> 182        0
>> 524182 995184  89286   100582 387544    42552      226     7313        0
>> 0        0
>> 523704 995987  89328   100629 387076    42731        1     6357        0
>> 0        0
>> 523154 995939  89367   100629 387554    41100        0     7338        0
>> 0        0
>> 522867 996724  89405   100629 387307    39246        0     6089        0
>> 0        0
>> 522160 996665  89451   100639 387563    37614        1     7463        0
>> 0        0
>> 522414 997920  89501   100639 386324    40044        0     6995        0
>> 0        0
>> 521524 997171  89539   100639 387569    39447        0     7048        0
>> 1        0
>> 520258 998993  89575   100645 387107    37476        0     6789        0
>> 0        0
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 519291 999119  89614   100645 387771  319989K  8994481  160119K  3094025
>> 9228045  1546121
>> 518546  1000K  89653   100647 387672    39985       81     6722        0
>> 0        0
>> 518354  1000K  89701   100647 387196    33629        0     7757        0
>> 0        0
>> 516218  1002K  89737   100647 387192    39403        0     7368        0
>> 1        0
>> 513388  1004K  89764   100648 388210    36737        0    14479        0
>> 0        0
>> 507991  1011K  89784   100648 386465    32460        1    13133        0
>> 1        0
>> 506346  1012K  89816   100648 387917    39257        2     7246        0
>> 1        0
>> 506298  1012K  89816   100648 387356      626        0      313        0
>> 0        0
>> 506070  1013K  89816   100648 386259      217        1      102        0
>> 0        0
>> 506101  1013K  89816   100648 386259      386        0       99        0
>> 0        0
>> 505974  1012K  89769   100618 387548     1417       21     1250        0
>> 9        0
>> 506078  1013K  89769   100618 386259      297        0      191        0
>> 0        0
>> 506349  1013K  89769   100618 386259      200        0      100        0
>> 0        0
>> 506105  1013K  89769   100618 386259      211        0      104        0
>> 0        0
>> 506351  1013K  89769   100618 386259      187        0       89        0
>> 0        0
>> 506114  1013K  89769   100618 386599      269        0      185        0
>> 0        0
>> 506228  1013K  89769   100618 386258      228        0      129        0
>> 0        0
>> 506231  1013K  89769   100618 386258      147        0       83        0
>> 0        0
>> 506486  1013K  89769   100618 386258      306        0      119        0
>> 0        0
>> 506215  1013K  89769   100619 386258      174        0       62        0
>> 0        0
>> 505927  1014K  89769   100619 386258      593        0      524        0
>> 0        0
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 506437  1013K  89769   100619 386700  320215K  8994587  160180K  3094025
>> 9228057  1546121
>> 506670  1013K  89769   100619 386225      128        0       45        0
>> 0        0
>> 506347  1013K  89769   100619 386225      268        0       58        0
>> 0        0
>> 506033  1013K  89769   100619 386225      127        0       89        0
>> 0        0
>> 506221  1013K  89769   100619 386225       86        0       41        0
>> 0        0
>> 506157  1013K  89769   100619 386225       81        0       38        0
>> 0        0
>> 506096  1013K  89769   100619 386225      345        0      186        0
>> 0        0
>> 506044  1013K  89769   100619 386236      291        0      224        0
>> 0        0
>> 505980  1013K  89769   100619 386225       81        0       39        0
>> 0        0
>> 505916  1013K  89769   100619 386225       99        0       50        0
>> 0        0
>> 505839  1013K  89769   100619 386225       79        0       36        0
>> 0        0
>> 506028  1013K  89769   100619 386225      102        0       61        0
>> 0        0
>> 505760  1014K  89769   100619 386225      286        0      229        0
>> 0        0
>> 506151  1013K  89769   100619 386406      140        0       88        0
>> 0        0
>> 506097  1013K  89769   100619 386406      101        2       47        0
>> 0        0
>> 506028  1014K  89769   100619 386225      175        0       62        0
>> 0        0
>> 506230  1014K  89769   100619 386225      227        0       39        0
>> 0        0
>> 505939  1014K  89769   100619 386225      550        0      512        0
>> 0        0
>> 505682  1014K  89769   100619 386225       90        0       50        0
>> 0        0
>> 506772  1011K  89782   100619 388439     6973       15     1797        0
>> 1        0
>> 550258 968746  89827    99598 388175    40677        7     8407        0
>> 1        0
>> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
>>   free active   spec inactive   wire   faults     copy    0fill reactive
>> pageins  pageout
>> 585656 935138  89879    98682 387121  320304K  8994611  160203K  3094025
>> 9228060  1546121
>> 585477 935589  89892    98682 386906    10654        1     2648        0
>> 0        0
>> 586236 935269  89898    98683 386610     2120        2     1473        0
>> 1        0
>> 586650 935141  89903    98683 386138      833        0      638        0
>> 2        0
>> 586647 933883  89903    98683 387378     3028      174     1553        0
>> 0        0
>> 587883 932971  89909    98683 387427      882        0      638        0
>> 0        0
>> 588259 932579  89915    98683 386994     1988        0     1383        0
>> 0        0
>> 589697 931139  89923    98683 387427     1174        4      709        0
>> 0        0
>> 589306 931423  89925    98684 387052     2432        0     1743        0
>> 0        0
>> 589128 931606  89926    98688 387043     2309        0     1632        0
>> 5        0
>> 590286 930456  89904    98689 387413     3506        7     2453        0
>> 16        0
>> 590411 931982  89905    98689 385674     1801        0     1384        0
>> 0        0
>> 590385 931329  89905    98689 386327     2954        1     2187        0
>> 0        0
>> 590551 931731  89910    98664 385920     2060        0     1543        0
>> 0        0
>> 589471 931873  89911    98664 386856     5511      252     3120        0
>> 0        0
>> 589397 931941  89911    98659 386769     1351        0     1090        0
>> 0        0
>>
>>
>> So while I'm not 100% certain that this isn't related to swapping and
>> garbage collection, it seems like something else is probably the cause.
>>
>> Can you guys give me some more pointers/guidance on what I can check to
>> try to get past this issue?
>> Would seeing any of the configuration or hadoop driver code help you get a
>> better idea of what I might be dealing with?
>> If there's any more due diligence I can/should be doing, just let me know.
>>
>> I'm sure it's more than likely a configuration issue on my end, I'm really
>> pulling at straws at this point.
>>
>> Thanks again for all the help,
>> -Arjumand
>>
>>
>>
>> On Thu, Aug 16, 2012 at 3:54 PM, John Vines <vines@apache.org> wrote:
>>
>>> I'm sorry, I missed the use of the offline scanner being used.
>>>
>>> Issues with locks being lost like you described typically occur from
>>> swapping issues. You said you used the 512MB footprint Accumulo
>>> configuration, but you should also make sure that your memory settings from
>>> the various hadoop settings, being sure to take into account the number of
>>> map and reduce jobs occurring, do not exceed your physical memory (plus any
>>> other processes you may be running on your system). Swapping + garbage
>>> collection have proven to be quite effective at causing lost locks.
>>>
>>> John
>>
>> ===
>>
>>
>> On Thu, Aug 16, 2012 at 3:48 PM, Arjumand Bonhomme <jumand@gmail.com>
>> wrote:
>>>
>>> While this isn't my original concern/problem, so it's not terribly
>>> important, but I'm not sure I understand and I'd like to learn as much as
>>> possible.
>>>
>>> Why wouldn't it work with the cloned table offline?
>>> I followed the example laid out in
>>> org.apache.accumulo.examples.simple.mapreduce.UniqueColumns
>>> Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(),
>>> true);" setup.
>>>
>>> Can you give me some detail as to why it definitely would not have worked
>>> offline?  Is this method no longer supported.
>>>
>>> Thanks!
>>> -Arjumand
>>>
>>>
>>> On Thu, Aug 16, 2012 at 3:24 PM, John Vines <vines@apache.org> wrote:
>>>>
>>>> You're work ont he clone table failed because you took it offline. The
>>>> table needs to be online in order to do a job against it.
>>>>
>>>> John
>>>>
>>>>
>>>> On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <jumand@gmail.com>
>>>> wrote:
>>>>>
>>>>> Jim, William, Adam: Thanks for your help with this, I'm running out of
>>>>> ideas to google search for answers.
>>>>>
>>>>> I'm going to try to answer your questions:
>>>>>
>>>>> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
>>>>> didn't see anything in log indicating that it was shutting down connections
>>>>> due to reaching a connection limit.  However, throughout the zookeeper logs,
>>>>> even before the hadoop job was run.  I did see lots of lines like these.  Up
>>>>> til this point, I've assumed it's innocuous and unrelated to my issue.
>>>>> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
>>>>> client /127.0.0.1:59158 (no session established for client)
>>>>>
>>>>> 2) I had also already adjusted the dfs.datanode.max.xcievers property
>>>>> to 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
>>>>> is not being reached.  It had been being reached at some point before I
>>>>> increased the value from the default.  I had increased the value while
>>>>> troubleshooting, but prior to posting to this list; it didn't appear to have
>>>>> a noticeable affect on the behavior of the hadoop job.
>>>>>
>>>>> 3) I'm writing out to a sequence file, so accumulo is only being used
>>>>> for input.  As a side note, at one point during my troubleshooting, I
>>>>> compacted, cloned, and then took the cloned table offline and tried to use
>>>>> that instead.  That failed immediately, without processing any records.
>>>>> From the stacktrace, it appeared as though the iterator was trying to use
>>>>> one of the files for the original table (from what I understand about
>>>>> cloning, this is normal b/c no changes had been made to the original table)
>>>>> but said it did not exist.  I was, however, able to find the file on hdfs.
>>>>> So I just gave up on that.  Also, under the normal case, ie using the
>>>>> original table online, nothing is writing to the table while the hadoop job
>>>>> is running.
>>>>>
>>>>> 4) My original open file limit was the os default of 256.  So I upped
>>>>> it to 1024, and performed another attempt.  The behavior was the same as
>>>>> before.
>>>>>
>>>>> I'm including a snippet from the tserver debug log.  It looks like an
>>>>> expired session might be the root of the problem, but I'm not sure what
>>>>> would cause that:
>>>>> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess
>>>>> 127.0.0.1:49719 45 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s
>>>>> ct=0.001s)
>>>>> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
>>>>> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
>>>>> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc
>>>>> ParNew=0.07(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs
>>>>> freemem=111,234,160(+14,477,728) totalmem=132,055,040
>>>>> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>>>> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>>>>> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1]
>>>>> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1]
>>>>> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE
>>>>> minor compaction chosen
>>>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG:
>>>>> COMPACTING !0<;~  total = 120,840 ingestMemory = 120,840
>>>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG:
>>>>> chosenMem = 2,416 chosenIT = 300.23 load 3,044
>>>>> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00
>>>>> secs
>>>>> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor
>>>>> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~
>>>>> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>>>> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>>>>> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE
>>>>> compactionThreshold = 0.605 maxObserved = 120,840
>>>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER
>>>>> compactionThreshold = 0.666
>>>>> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24
>>>>> read | 4 written |  4,800 entries/sec |  0.005 secs
>>>>> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
>>>>> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c
>>>>> 16 13:47:51,956 [log.TabletServerLogger] DEBUG:  wrote MinC finish  35:
>>>>> writeTime:1ms
>>>>> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory]
>>>>> -> /default_tablet/F000051i.rf
>>>>> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs
>>>>> !0<;~
>>>>> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess
>>>>> 127.0.0.1:51480 43 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s
>>>>> ct=0.001s)
>>>>> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>>>>> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB
>>>>> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0,
>>>>> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss
>>>>> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0
>>>>> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>>>>> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032),
>>>>> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0,
>>>>> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%,
>>>>> Evicted/Run=NaN, Duplicate Reads=1
>>>>> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>>>>> secs, wait 0.00 secs
>>>>> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>>>>> secs, wait 0.00 secs
>>>>> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0<
>>>>> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>>>>> /table_info/A000051j.rf_tmp
>>>>> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<<
>>>>> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>>>>> /root_tablet/A000051k.rf_tmp
>>>>> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39
>>>>> read | 21 written |    276 entries/sec |  0.141 secs
>>>>> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56
>>>>> read | 51 written |    323 entries/sec |  0.173 secs
>>>>> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs
>>>>> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC
>>>>> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>>>>> /root_tablet/A000051k.rf
>>>>> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for
>>>>> extent !0<;~ as alias 2
>>>>> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs
>>>>> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC
>>>>> [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>>>>> /table_info/A000051j.rf
>>>>> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc
>>>>> ParNew=0.09(+0.02) secs ConcurrentMarkSweep=0.00(+0.00) secs
>>>>> freemem=97,183,880(-1,627,536) totalmem=132,055,040
>>>>> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess
>>>>> 127.0.0.1:51480 146 in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s
>>>>> ct=0.004s)
>>>>> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1]
>>>>> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc
>>>>> ParNew=0.09(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs
>>>>> freemem=98,415,928(+12,106,904) totalmem=132,055,040
>>>>> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>>>>> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>>>>> ranges:1)
>>>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>>>>> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>>>>> ranges:1)
>>>>> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>>>> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>>>> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected
>>>>> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired
>>>>> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server
>>>>> lock (reason = SESSION_EXPIRED), exiting.
>>>>>
>>>>>
>>>>> I'm also including a snippet from the zookeeper log as well.  Around
>>>>> the same the tserver logs show the loss of the tablet server lock due to
>>>>> expired session, the zookeeper logs mentions the expiration of several
>>>>> sessions due to timeout.
>>>>> 2012-08-16 13:48:07,368 - INFO
>>>>> [Thread-370:NIOServerCnxn$StatCommand@1153] - Stat command output
>>>>> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /127.0.0.1:58633 (no session established
>>>>> for client)
>>>>> 2012-08-16 13:48:08,371 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] -
>>>>> Accepted socket connection from /127.0.0.1:59158
>>>>> 2012-08-16 13:48:08,372 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing
>>>>> stat command from /127.0.0.1:59158
>>>>> 2012-08-16 13:48:08,378 - INFO
>>>>> [Thread-371:NIOServerCnxn$StatCommand@1153] - Stat command output
>>>>> 2012-08-16 13:48:08,379 - INFO  [Thread-371:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /127.0.0.1:59158 (no session established
>>>>> for client)
>>>>> 2012-08-16 13:48:09,395 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] -
>>>>> Accepted socket connection from /127.0.0.1:59635
>>>>> 2012-08-16 13:48:09,395 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing
>>>>> stat command from /127.0.0.1:59635
>>>>> 2012-08-16 13:48:09,399 - INFO
>>>>> [Thread-372:NIOServerCnxn$StatCommand@1153] - Stat command output
>>>>> 2012-08-16 13:48:09,400 - INFO  [Thread-372:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /127.0.0.1:59635 (no session established
>>>>> for client)
>>>>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:32,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:32,001 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80005
>>>>> 2012-08-16 13:48:32,001 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80004
>>>>> 2012-08-16 13:48:32,001 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80001
>>>>> 2012-08-16 13:48:32,003 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /127.0.0.1:65339 which had sessionid
>>>>> 0x1393086dfb80005
>>>>> 2012-08-16 13:48:34,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:36,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:38,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:38,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:40,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>>>> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded
>>>>> 2012-08-16 13:48:44,456 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had
>>>>> sessionid 0x1393086dfb80004
>>>>> 2012-08-16 13:48:44,458 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /127.0.0.1:65316 which had sessionid
>>>>> 0x1393086dfb80001
>>>>> 2012-08-16 13:48:44,461 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80002
>>>>> 2012-08-16 13:48:44,482 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80007
>>>>> 2012-08-16 13:48:44,484 - WARN
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception
>>>>> causing close of session 0x1393086dfb80003 due to java.io.IOException:
>>>>> Connection reset by peer
>>>>> 2012-08-16 13:48:44,486 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had
>>>>> sessionid 0x1393086dfb80002
>>>>> 2012-08-16 13:48:44,491 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb8004d
>>>>> 2012-08-16 13:48:44,493 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80006
>>>>> 2012-08-16 13:48:44,495 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>>>>> socket connection for client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid
>>>>> 0x1393086dfb80003
>>>>> 2012-08-16 13:48:44,495 - WARN
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>>> 0x1393086dfb80007, likely client has closed socket
>>>>> 2012-08-16 13:48:44,496 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>>>>> socket connection for client /0:0:0:0:0:0:0:1%0:65343 which had sessionid
>>>>> 0x1393086dfb80007
>>>>> 2012-08-16 13:48:44,496 - WARN
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>>> 0x1393086dfb8004d, likely client has closed socket
>>>>> 2012-08-16 13:48:44,497 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>>>>> socket connection for client /127.0.0.1:49770 which had sessionid
>>>>> 0x1393086dfb8004d
>>>>> 2012-08-16 13:48:44,498 - INFO
>>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination
>>>>> for sessionid: 0x1393086dfb80003
>>>>> 2012-08-16 13:48:44,500 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had
>>>>> sessionid 0x1393086dfb80006
>>>>> 2012-08-16 13:48:44,613 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] -
>>>>> Accepted socket connection from /0:0:0:0:0:0:0:1%0:59789
>>>>> 2012-08-16 13:48:44,615 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client
>>>>> attempting to renew session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789
>>>>> 2012-08-16 13:48:44,616 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid
>>>>> session 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably
>>>>> expired
>>>>> 2012-08-16 13:48:44,616 - INFO
>>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>>>>> socket connection for client /0:0:0:0:0:0:0:1%0:59789 which had sessionid
>>>>> 0x1393086dfb80001
>>>>>
>>>>>
>>>>> Sorry for the huge post of log drivel, but if you'd like to see more I
>>>>> can post (or via pastebin) the full logs.  I made sure to shut down
>>>>> everything, wipe out the logs, then start everything back up so the logs
>>>>> would be as clean as possible.
>>>>>
>>>>>
>>>>> I looked for a way to increase the session timeout on the
>>>>> AccumuloInputFormat, but there doesn't seem to be one.  I'm sure that's not
>>>>> the right fix anyway, but figured I'd mention it.
>>>>>
>>>>>
>>>>> Do you guys have any other suggestions.  I'm sure this is due to some
>>>>> configuration error on my part, but I'm running out of idea.
>>>>>
>>>>>
>>>>> Thanks a ton,
>>>>> -Arjumand
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum
>>>>> <wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>
>>>>>> What does your TServer debug log say? Also, are you writing back out
>>>>>> to Accumulo?
>>>>>>
>>>>>> To follow up what Jim said, you can check the zookeeper log to see if
>>>>>> max connections is being hit. You may also want to check and see what your
>>>>>> max xceivers is set to for HDFS and check your Accumulo and HDFS logs to see
>>>>>> if it is mentioned.
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <jumand@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may
>>>>>>> be due to poor configuration on my part, but I'm running out of ideas.
>>>>>>>
>>>>>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from
>>>>>>> cdh3u4) in pseudo-distributed mode.
>>>>>>> zookeeper version zookeeper-3.3.5 from cdh3u4
>>>>>>> I'm using the 1.4.1 release of accumulo with a configuration copied
>>>>>>> from "conf/examples/512MB/standalone"
>>>>>>>
>>>>>>> I've got a Map task that is using an accumulo table as the input.
>>>>>>> I'm fetching all rows, but just a single column family, that has
>>>>>>> hundreds or even thousands of different column qualifiers.
>>>>>>> The table has a SummingCombiner installed for the given the column
>>>>>>> family.
>>>>>>>
>>>>>>> The task runs fine at first, but after ~9-15K records (I print the
>>>>>>> record count to the console every 1K records), it hangs and the following
>>>>>>> messages are printed to the console where I'm running the job:
>>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read
>>>>>>> additional data from server sessionid 0x1392cc35b460d1c, likely server has
>>>>>>> closed socket, closing socket connection and attempting reconnect
>>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket
>>>>>>> connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection
>>>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>>>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
>>>>>>> connection
>>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client
>>>>>>> connection, connectString=localhost sessionTimeout=30000
>>>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
>>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket
>>>>>>> connection to server localhost/0:0:0:0:0:0:0:1:2181
>>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection
>>>>>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
>>>>>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>>>>>>> 0x1392cc35b460d25, negotiated timeout = 30000
>>>>>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
>>>>>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
>>>>>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>>>>>>>
>>>>>>> Sometimes the messages contain a stacktrace like this below:
>>>>>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session
>>>>>>> 0x1392cc35b460b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected
>>>>>>> error, closing socket connection and attempting reconnect
>>>>>>> java.io.IOException: Connection reset by peer
>>>>>>> at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>>>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>>>>>>> at
>>>>>>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>>>>>>> at
>>>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
>>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket
>>>>>>> connection to server localhost/127.0.0.1:2181
>>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection
>>>>>>> established to localhost/127.0.0.1:2181, initiating session
>>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>>>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
>>>>>>> connection
>>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client
>>>>>>> connection, connectString=localhost sessionTimeout=30000
>>>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
>>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket
>>>>>>> connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection
>>>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
>>>>>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>>>>>>> 0x1392cc35b460b46, negotiated timeout = 30000
>>>>>>>
>>>>>>>
>>>>>>> I've poked through the logs in accumulo, and I've noticed that when
>>>>>>> it hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
>>>>>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None
>>>>>>> Disconnected
>>>>>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state
>>>>>>> of current session : Expired
>>>>>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
>>>>>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
>>>>>>> registration at null
>>>>>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
>>>>>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>>>>>>>
>>>>>>>
>>>>>>> I've noticed that if I make the map task print out the record count
>>>>>>> more frequently (ie every 10 records), it seems to be able get through more
>>>>>>> records than when I only print every 1K records. My assumption was that this
>>>>>>> had something to do with more time being spent in the map task, and not
>>>>>>> fetching data from accumulo.  There was at least one occasion where I
>>>>>>> printed to the console for every record, and in that situation it managed to
>>>>>>> process 47K records, although I have been unable to repeat that behavior.
>>>>>>>
>>>>>>> I've also noticed that if I stop and start accumulo, the map-reduce
>>>>>>> job will pickup where it left off, but seems to fail quicker.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Could someone make some suggestions as to what my problem might be?
>>>>>>> It would be greatly appreciated.  If you need any additional information
>>>>>>> from me, just let me know.  I'd paste my config files, driver setup, and
>>>>>>> example data into this post, but I think it's probably long enough already.
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> -Arjumand
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>