accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjumand Bonhomme <jum...@gmail.com>
Subject Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Date Fri, 17 Aug 2012 02:10:47 GMT
Thanks for the input, no worries about the offline scanner bit. Although if
I can ever get past my current issue, I'd love to learn why the offline
scanner failed as well.

So I took a look at my various other settings, and watched the paging while
everything was running, and it doesn't appear (at least not to me) to be
related to swapping.
I'm using the 512MB accumulo configuration, and I'm running the hadoop task
with a 256MB heap.  I'm running hadoop with the standard pseudo-distributed
config, except for bumping the  dfs.datanode.max.xcievers property to 4096.
 I'm running zookeeper with the OOB config, except for the adjustment
to maxclientcxns
in zoo.cfg increased to be 100

I did a reboot of my machine, then began another series of tests.  This
time I was watching vm_stat while everything was running.

I noticed the same behavior I've been witnessing all along.  I'm including
the stacktrace that was emitted to the tserver log when it hung as it
usually does.
6 20:58:12,311 [zookeeper.ZooLock] DEBUG: event null None Disconnected
16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/tables/1/conf/table.compaction.minor.idle
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
 at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
 at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
at
org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:120)
 at
org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:108)
at
org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:70)
 at
org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.getMemoryManagementActions(LargestFirstMemoryManager.java:95)
at
org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:312)
 at
org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.access$200(TabletServerResourceManager.java:228)
 at
org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework$2.run(TabletServerResourceManager.java:252)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
 at java.lang.Thread.run(Thread.java:680)
16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/config/table.scan.max.memory
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
 at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
 at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
at
org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:109)
 at
org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:76)
at
org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:103)
 at
org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:47)
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:960)
 at
org.apache.accumulo.server.tabletserver.TabletServerResourceManager.executeReadAhead(TabletServerResourceManager.java:699)
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1316)
 at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1284)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
at $Proxy1.startMultiScan(Unknown Source)
 at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.process(TabletClientService.java:2164)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
 at
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
at
org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
 at
org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
 at java.lang.Thread.run(Thread.java:680)
16 20:58:13,379 [zookeeper.ZooLock] DEBUG: event null None Expired
16 20:58:13,393 [tabletserver.TabletServer] FATAL: Lost tablet server lock
(reason = SESSION_EXPIRED), exiting.


Below is the matching stacktrace emitted by the hadoop job.
12/08/16 20:57:14 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost sessionTimeout=30000
watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@47124746
12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/fe80:0:0:0:0:0:0:1%1:2181
12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Session establishment complete
on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
0x13931fecee3012e, negotiated timeout = 30000
12/08/16 20:57:14 INFO mapred.JobClient: Running job: job_local_0001
12/08/16 20:57:14 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
12/08/16 20:57:14 INFO mapred.MapTask: io.sort.mb = 100
12/08/16 20:57:15 INFO mapred.MapTask: data buffer = 79691776/99614720
12/08/16 20:57:15 INFO mapred.MapTask: record buffer = 262144/327680
12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor
12/08/16 20:57:15 INFO mapred.JobClient:  map 0% reduce 0%
001000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@feel
002000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@perhap
12/08/16 20:57:20 INFO mapred.LocalJobRunner:
003000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@univers
12/08/16 20:57:23 INFO mapred.LocalJobRunner:
004000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@doubt
005000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@morwai
12/08/16 20:57:26 INFO mapred.LocalJobRunner:
006000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@tgame
007000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@colleg
12/08/16 20:57:29 INFO mapred.LocalJobRunner:
008000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@led
12/08/16 20:57:32 INFO mapred.LocalJobRunner:
009000
2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@stephanopoulo
010000 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@brother
12/08/16 20:57:35 INFO mapred.LocalJobRunner:
011000 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@inabl
12/08/16 20:57:38 INFO mapred.LocalJobRunner:
12/08/16 20:57:41 INFO mapred.LocalJobRunner:
12/08/16 20:58:12 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 37008ms for sessionid 0x13931fecee3012e, closing
socket connection and attempting reconnect
012000 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@saturdai
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/0:0:0:0:0:0:0:1:2181
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/0:0:0:0:0:0:0:1:2181, initiating session
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x13931fecee3012e has expired, closing socket
connection
12/08/16 20:58:13 WARN zookeeper.ZooCache: Zookeeper error, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
 at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208)
at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130)
 at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
 at
org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:147)
at
org.apache.accumulo.core.client.impl.TabletLocator.getInstance(TabletLocator.java:96)
 at
org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:245)
at
org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:94)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:680)
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: EventThread shut down
12/08/16 20:58:13 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost sessionTimeout=30000
watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@7f434cb2
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/0:0:0:0:0:0:0:1:2181
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/0:0:0:0:0:0:0:1:2181, initiating session
12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Session establishment complete
on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x13931fecee30135,
negotiated timeout = 30000
013000 2012-06-11T09-05-19/abc/06a7b8b2de18727faa635b841b0dd043.txt@anthoni
12/08/16 20:58:17 INFO mapred.LocalJobRunner:


And finally here is the output from "vm_stat 1".  Doesn't look like
anything was really getting swapped out to disk.  While there was a flurry
of paging (reading pages in) as soon as I started up the job, that paging
ceased very quickly after that.  Then very shortly it hung again as usual.
 When looking at the output, it may help to know that I started the hadoop
job just after the 3rd banner/header was printed out, and you can see the
initial flurry of paging and then it stops after roughly 12 seconds.
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
595157 944925  81264    89580 385586  319170K  8990003  159901K  3094025
 9227006  1546121
595491 944110  81292    89605 386400     1909        0     1417        0
     4        0
595204 944911  81292    89579 385625      708        0      551        0
     0        0
595429 944652  81293    89579 385879     1941        0     1314        0
     0        0
595379 945101  81293    89579 385407      616        0      521        0
     0        0
595114 944636  81294    89579 385879     1824        0     1329        0
     0        0
595664 945080  81295    89578 385406     5365        6     4032        0
     1        0
595221 944078  81295    89578 386410     2645        0     1840        0
     0        0
595338 945075  81295    89578 385406      767        2      505        0
     0        0
595061 944463  81295    89578 386059     2113        0     1433        0
     1        0
595194 944936  81295    89578 385587      674        0      565        0
     0        0
595267 944458  81296    89578 386059     1731        0     1293        0
     0        0
595418 944925  81297    89578 385587      780        0      581        0
     0        0
595124 944200  81296    89579 386419     2111       84     1397        0
     2        0
595037 944686  81295    89579 385948      657        2      548        0
     2        0
595054 944213  81295    89579 386420     1785        0     1350        0
     1        0
595289 944687  81295    89579 385948      758        0      565        0
     1        0
595052 944221  81296    89579 386420     1987        0     1356        0
     0        0
595017 944698  81296    89579 385948      768        0      558        0
     0        0
595272 944217  81296    89579 386420     1957        0     1334        0
     0        0
594954 944686  81296    89579 385948      704        0      535        0
     0        0
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
594997 944166  81296    89579 386420  319204K  8990097  159926K  3094025
 9227018  1546121
594977 944641  81298    89579 385948      784        0      569        0
     0        0
595282 944160  81298    89579 386420     1874        0     1368        0
     1        0
595252 944633  81298    89579 385948      735        0      561        0
     1        0
595057 944186  81298    89579 386420     1972        0     1391        0
     0        0
595039 944481  81298    89579 386129      929        0      564        0
     1        0
595512 943232  81346    89579 387188     3109       31     2542        0
     1        0
595395 944491  81346    89579 385948      742        0      626        0
     0        0
595134 944016  81346    89579 386420     1844        1     1335        0
     0        0
595329 944501  81346    89579 385948      694        0      546        0
     0        0
595329 944016  81346    89579 386431     1801        0     1330        0
     0        0
595471 944458  81347    89579 385949      797        0      669        0
     0        0
595433 943918  81347    89579 386431     2380        0     1806        0
     0        0
595118 944433  81347    89579 385948      678        0      567        0
     0        0
595142 943989  81347    89571 386420     1863        0     1378        0
     1        0
595097 944462  81347    89571 385948      654        0      540        0
     1        0
595126 944060  81347    89571 386420     1874        0     1430        0
     0        0
595068 944674  81348    89571 385948      787        0      678        0
     0        0
595110 944174  81348    89571 386420     1863        0     1345        0
     0        0
595362 944627  81348    89571 385948      768        2      575        0
     0        0
595115 944131  81348    89572 386420     1755        0     1323        0
     0        0
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
595142 944657  81348    89571 385948  319233K  8990131  159947K  3094025
 9227025  1546121
591569 946751  81348    89571 387281     9240      786     4725        0
     8        0
579748 958406  81358    90168 386840    20057      287    13388        0
   492        0
563598 959835  88682    97236 387319    20388      931     4220        0
     0        0
535840 986697  88692    98643 386906    59619     1497    32467        0
   164        0
534240 987263  88696    99134 387394    42744        0     6762        0
    98        0
529479 992037  88723   100123 386187    39552        3     9006        0
    10        0
529065 991283  88760   100107 387430    32014      153     7500        0
    41        0
528147 992276  88784   100376 386966    37719        2     6050        0
    15        0
527155 993536  88818   100380 386672    34999        0     6928        0
     1        0
526748 992973  88851   100380 387747    34379        1     5240        0
     0        0
526835 994071  88902   100404 386674    34698        2     7063        0
     8        0
524993 994621  89207   100582 387027    28374      460     6018        0
   182        0
524182 995184  89286   100582 387544    42552      226     7313        0
     0        0
523704 995987  89328   100629 387076    42731        1     6357        0
     0        0
523154 995939  89367   100629 387554    41100        0     7338        0
     0        0
522867 996724  89405   100629 387307    39246        0     6089        0
     0        0
522160 996665  89451   100639 387563    37614        1     7463        0
     0        0
522414 997920  89501   100639 386324    40044        0     6995        0
     0        0
521524 997171  89539   100639 387569    39447        0     7048        0
     1        0
520258 998993  89575   100645 387107    37476        0     6789        0
     0        0
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
519291 999119  89614   100645 387771  319989K  8994481  160119K  3094025
 9228045  1546121
518546  1000K  89653   100647 387672    39985       81     6722        0
     0        0
518354  1000K  89701   100647 387196    33629        0     7757        0
     0        0
516218  1002K  89737   100647 387192    39403        0     7368        0
     1        0
513388  1004K  89764   100648 388210    36737        0    14479        0
     0        0
507991  1011K  89784   100648 386465    32460        1    13133        0
     1        0
506346  1012K  89816   100648 387917    39257        2     7246        0
     1        0
506298  1012K  89816   100648 387356      626        0      313        0
     0        0
506070  1013K  89816   100648 386259      217        1      102        0
     0        0
506101  1013K  89816   100648 386259      386        0       99        0
     0        0
505974  1012K  89769   100618 387548     1417       21     1250        0
     9        0
506078  1013K  89769   100618 386259      297        0      191        0
     0        0
506349  1013K  89769   100618 386259      200        0      100        0
     0        0
506105  1013K  89769   100618 386259      211        0      104        0
     0        0
506351  1013K  89769   100618 386259      187        0       89        0
     0        0
506114  1013K  89769   100618 386599      269        0      185        0
     0        0
506228  1013K  89769   100618 386258      228        0      129        0
     0        0
506231  1013K  89769   100618 386258      147        0       83        0
     0        0
506486  1013K  89769   100618 386258      306        0      119        0
     0        0
506215  1013K  89769   100619 386258      174        0       62        0
     0        0
505927  1014K  89769   100619 386258      593        0      524        0
     0        0
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
506437  1013K  89769   100619 386700  320215K  8994587  160180K  3094025
 9228057  1546121
506670  1013K  89769   100619 386225      128        0       45        0
     0        0
506347  1013K  89769   100619 386225      268        0       58        0
     0        0
506033  1013K  89769   100619 386225      127        0       89        0
     0        0
506221  1013K  89769   100619 386225       86        0       41        0
     0        0
506157  1013K  89769   100619 386225       81        0       38        0
     0        0
506096  1013K  89769   100619 386225      345        0      186        0
     0        0
506044  1013K  89769   100619 386236      291        0      224        0
     0        0
505980  1013K  89769   100619 386225       81        0       39        0
     0        0
505916  1013K  89769   100619 386225       99        0       50        0
     0        0
505839  1013K  89769   100619 386225       79        0       36        0
     0        0
506028  1013K  89769   100619 386225      102        0       61        0
     0        0
505760  1014K  89769   100619 386225      286        0      229        0
     0        0
506151  1013K  89769   100619 386406      140        0       88        0
     0        0
506097  1013K  89769   100619 386406      101        2       47        0
     0        0
506028  1014K  89769   100619 386225      175        0       62        0
     0        0
506230  1014K  89769   100619 386225      227        0       39        0
     0        0
505939  1014K  89769   100619 386225      550        0      512        0
     0        0
505682  1014K  89769   100619 386225       90        0       50        0
     0        0
506772  1011K  89782   100619 388439     6973       15     1797        0
     1        0
550258 968746  89827    99598 388175    40677        7     8407        0
     1        0
Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%)
  free active   spec inactive   wire   faults     copy    0fill reactive
 pageins  pageout
585656 935138  89879    98682 387121  320304K  8994611  160203K  3094025
 9228060  1546121
585477 935589  89892    98682 386906    10654        1     2648        0
     0        0
586236 935269  89898    98683 386610     2120        2     1473        0
     1        0
586650 935141  89903    98683 386138      833        0      638        0
     2        0
586647 933883  89903    98683 387378     3028      174     1553        0
     0        0
587883 932971  89909    98683 387427      882        0      638        0
     0        0
588259 932579  89915    98683 386994     1988        0     1383        0
     0        0
589697 931139  89923    98683 387427     1174        4      709        0
     0        0
589306 931423  89925    98684 387052     2432        0     1743        0
     0        0
589128 931606  89926    98688 387043     2309        0     1632        0
     5        0
590286 930456  89904    98689 387413     3506        7     2453        0
    16        0
590411 931982  89905    98689 385674     1801        0     1384        0
     0        0
590385 931329  89905    98689 386327     2954        1     2187        0
     0        0
590551 931731  89910    98664 385920     2060        0     1543        0
     0        0
589471 931873  89911    98664 386856     5511      252     3120        0
     0        0
589397 931941  89911    98659 386769     1351        0     1090        0
     0        0


So while I'm not 100% certain that this isn't related to swapping and
garbage collection, it seems like something else is probably the cause.

Can you guys give me some more pointers/guidance on what I can check to try
to get past this issue?
Would seeing any of the configuration or hadoop driver code help you get a
better idea of what I might be dealing with?
If there's any more due diligence I can/should be doing, just let me know.

I'm sure it's more than likely a configuration issue on my end, I'm really
pulling at straws at this point.

Thanks again for all the help,
-Arjumand



On Thu, Aug 16, 2012 at 3:54 PM, John Vines <vines@apache.org> wrote:

> I'm sorry, I missed the use of the offline scanner being used.
>
> Issues with locks being lost like you described typically occur from
> swapping issues. You said you used the 512MB footprint Accumulo
> configuration, but you should also make sure that your memory settings from
> the various hadoop settings, being sure to take into account the number of
> map and reduce jobs occurring, do not exceed your physical memory (plus any
> other processes you may be running on your system). Swapping + garbage
> collection have proven to be quite effective at causing lost locks.
>
> John

===

On Thu, Aug 16, 2012 at 3:48 PM, Arjumand Bonhomme <jumand@gmail.com> wrote:

> While this isn't my original concern/problem, so it's not terribly
> important, but I'm not sure I understand and I'd like to learn as much as
> possible.
>
> Why wouldn't it work with the cloned table offline?
> I followed the example laid out in
> org.apache.accumulo.examples.simple.mapreduce.UniqueColumns
> Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(),
> true);" setup.
>
> Can you give me some detail as to why it definitely would not have worked
> offline?  Is this method no longer supported.
>
> Thanks!
> -Arjumand
>
>
> On Thu, Aug 16, 2012 at 3:24 PM, John Vines <vines@apache.org> wrote:
>
>> You're work ont he clone table failed because you took it offline. The
>> table needs to be online in order to do a job against it.
>>
>> John
>>
>>
>> On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <jumand@gmail.com>wrote:
>>
>>> Jim, William, Adam: Thanks for your help with this, I'm running out of
>>> ideas to google search for answers.
>>>
>>> I'm going to try to answer your questions:
>>>
>>> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
>>> didn't see anything in log indicating that it was shutting down connections
>>> due to reaching a connection limit.  However, throughout the zookeeper
>>> logs, even before the hadoop job was run.  I did see lots of lines like
>>> these.  Up til this point, I've assumed it's innocuous and unrelated to my
>>> issue.
>>> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
>>> client /127.0.0.1:59158 (no session established for client)
>>>
>>> 2) I had also already adjusted the dfs.datanode.max.xcievers property to
>>> 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
>>> is not being reached.  It had been being reached at some point before I
>>> increased the value from the default.  I had increased the value while
>>> troubleshooting, but prior to posting to this list; it didn't appear to
>>> have a noticeable affect on the behavior of the hadoop job.
>>>
>>> 3) I'm writing out to a sequence file, so accumulo is only being used
>>> for input.  As a side note, at one point during my troubleshooting, I
>>> compacted, cloned, and then took the cloned table offline and tried to use
>>> that instead.  That failed immediately, without processing any records.
>>>  From the stacktrace, it appeared as though the iterator was trying to use
>>> one of the files for the original table (from what I understand about
>>> cloning, this is normal b/c no changes had been made to the original table)
>>> but said it did not exist.  I was, however, able to find the file on hdfs.
>>>  So I just gave up on that.  Also, under the normal case, ie using the
>>> original table online, nothing is writing to the table while the hadoop job
>>> is running.
>>>
>>> 4) My original open file limit was the os default of 256.  So I upped it
>>> to 1024, and performed another attempt.  The behavior was the same as
>>> before.
>>>
>>> I'm including a snippet from the tserver debug log.  It looks like an
>>> expired session might be the root of the problem, but I'm not sure what
>>> would cause that:
>>> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess
>>> 127.0.0.1:49719 45 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s
>>> lt=0.002s ct=0.001s)
>>> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
>>> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
>>> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
>>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
>>> totalmem=132,055,040
>>> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>>> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1]
>>> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1]
>>> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE
>>> minor compaction chosen
>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG:
>>> COMPACTING !0<;~  total = 120,840 ingestMemory = 120,840
>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG:
>>> chosenMem = 2,416 chosenIT = 300.23 load 3,044
>>> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs
>>> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor
>>> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~
>>> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>>> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE
>>> compactionThreshold = 0.605 maxObserved = 120,840
>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER
>>> compactionThreshold = 0.666
>>> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read
>>> | 4 written |  4,800 entries/sec |  0.005 secs
>>> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
>>> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c
>>>  16 13:47:51,956 [log.TabletServerLogger] DEBUG:  wrote MinC finish  35:
>>> writeTime:1ms
>>> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory]
>>> -> /default_tablet/F000051i.rf
>>> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs
>>> !0<;~
>>> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess
>>> 127.0.0.1:51480 43 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s
>>> lt=0.003s ct=0.001s)
>>> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>>> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB
>>> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0,
>>> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss
>>> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0
>>> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>>> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032),
>>> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0,
>>> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%,
>>> Evicted/Run=NaN, Duplicate Reads=1
>>> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>>> secs, wait 0.00 secs
>>> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>>> secs, wait 0.00 secs
>>> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0<
>>> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>>> /table_info/A000051j.rf_tmp
>>> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<<
>>> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>>> /root_tablet/A000051k.rf_tmp
>>> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39
>>> read | 21 written |    276 entries/sec |  0.141 secs
>>> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56
>>> read | 51 written |    323 entries/sec |  0.173 secs
>>> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs
>>> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC
>>> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>>> /root_tablet/A000051k.rf
>>> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for
>>> extent !0<;~ as alias 2
>>> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs
>>> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC
>>> [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>>> /table_info/A000051j.rf
>>> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02)
>>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536)
>>> totalmem=132,055,040
>>> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess
>>> 127.0.0.1:51480 146 in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s
>>> lt=0.008s ct=0.004s)
>>> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1]
>>> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01)
>>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904)
>>> totalmem=132,055,040
>>> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>>> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>>> ranges:1)
>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>>> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>>> ranges:1)
>>> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid
>>> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>>> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected
>>> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired
>>> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server
>>> lock (reason = SESSION_EXPIRED), exiting.
>>>
>>>
>>> I'm also including a snippet from the zookeeper log as well.  Around the
>>> same the tserver logs show the loss of the tablet server lock due to
>>> expired session, the zookeeper logs mentions the expiration of several
>>> sessions due to timeout.
>>> 2012-08-16 13:48:07,368 - INFO
>>>  [Thread-370:NIOServerCnxn$StatCommand@1153] - Stat command output
>>> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn@1435] -
>>> Closed socket connection for client /127.0.0.1:58633 (no session
>>> established for client)
>>> 2012-08-16 13:48:08,371 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>>> connection from /127.0.0.1:59158
>>> 2012-08-16 13:48:08,372 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from
>>> /127.0.0.1:59158
>>> 2012-08-16 13:48:08,378 - INFO
>>>  [Thread-371:NIOServerCnxn$StatCommand@1153] - Stat command output
>>> 2012-08-16 13:48:08,379 - INFO  [Thread-371:NIOServerCnxn@1435] -
>>> Closed socket connection for client /127.0.0.1:59158 (no session
>>> established for client)
>>> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>>> connection from /127.0.0.1:59635
>>> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from
>>> /127.0.0.1:59635
>>> 2012-08-16 13:48:09,399 - INFO
>>>  [Thread-372:NIOServerCnxn$StatCommand@1153] - Stat command output
>>> 2012-08-16 13:48:09,400 - INFO  [Thread-372:NIOServerCnxn@1435] -
>>> Closed socket connection for client /127.0.0.1:59635 (no session
>>> established for client)
>>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:32,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:32,001 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80005
>>> 2012-08-16 13:48:32,001 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80004
>>> 2012-08-16 13:48:32,001 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80001
>>> 2012-08-16 13:48:32,003 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>> Closed socket connection for client /127.0.0.1:65339 which had
>>> sessionid 0x1393086dfb80005
>>> 2012-08-16 13:48:34,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:36,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:38,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:38,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:40,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>>> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded
>>> 2012-08-16 13:48:44,456 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had
>>> sessionid 0x1393086dfb80004
>>> 2012-08-16 13:48:44,458 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>> Closed socket connection for client /127.0.0.1:65316 which had
>>> sessionid 0x1393086dfb80001
>>> 2012-08-16 13:48:44,461 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80002
>>> 2012-08-16 13:48:44,482 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80007
>>> 2012-08-16 13:48:44,484 - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of
>>> session 0x1393086dfb80003 due to java.io.IOException: Connection reset by
>>> peer
>>> 2012-08-16 13:48:44,486 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had
>>> sessionid 0x1393086dfb80002
>>> 2012-08-16 13:48:44,491 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb8004d
>>> 2012-08-16 13:48:44,493 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80006
>>> 2012-08-16 13:48:44,495 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>>> client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003
>>> 2012-08-16 13:48:44,495 - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable
>>> to read additional data from client sessionid 0x1393086dfb80007, likely
>>> client has closed socket
>>> 2012-08-16 13:48:44,496 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>>> client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007
>>> 2012-08-16 13:48:44,496 - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable
>>> to read additional data from client sessionid 0x1393086dfb8004d, likely
>>> client has closed socket
>>> 2012-08-16 13:48:44,497 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>>> client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d
>>> 2012-08-16 13:48:44,498 - INFO
>>>  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session
>>> termination for sessionid: 0x1393086dfb80003
>>> 2012-08-16 13:48:44,500 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had
>>> sessionid 0x1393086dfb80006
>>> 2012-08-16 13:48:44,613 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>>> connection from /0:0:0:0:0:0:0:1%0:59789
>>> 2012-08-16 13:48:44,615 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
>>> session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789
>>> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session
>>> 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired
>>> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>>> client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001
>>>
>>>
>>> Sorry for the huge post of log drivel, but if you'd like to see more I
>>> can post (or via pastebin) the full logs.  I made sure to shut down
>>> everything, wipe out the logs, then start everything back up so the logs
>>> would be as clean as possible.
>>>
>>>
>>> I looked for a way to increase the session timeout on the
>>> AccumuloInputFormat, but there doesn't seem to be one.  I'm sure that's not
>>> the right fix anyway, but figured I'd mention it.
>>>
>>>
>>> Do you guys have any other suggestions.  I'm sure this is due to some
>>> configuration error on my part, but I'm running out of idea.
>>>
>>>
>>> Thanks a ton,
>>> -Arjumand
>>>
>>>
>>>
>>> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum <
>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>
>>>> What does your TServer debug log say? Also, are you writing back out to
>>>> Accumulo?
>>>>
>>>> To follow up what Jim said, you can check the zookeeper log to see if
>>>> max connections is being hit. You may also want to check and see what your
>>>> max xceivers is set to for HDFS and check your Accumulo and HDFS logs to
>>>> see if it is mentioned.
>>>>
>>>>
>>>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <jumand@gmail.com>wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may
>>>>> be due to poor configuration on my part, but I'm running out of ideas.
>>>>>
>>>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from
>>>>> cdh3u4) in pseudo-distributed mode.
>>>>> zookeeper version zookeeper-3.3.5 from cdh3u4
>>>>> I'm using the 1.4.1 release of accumulo with a configuration copied
>>>>> from "conf/examples/512MB/standalone"
>>>>>
>>>>> I've got a Map task that is using an accumulo table as the input.
>>>>> I'm fetching all rows, but just a single column family, that has
>>>>> hundreds or even thousands of different column qualifiers.
>>>>> The table has a SummingCombiner installed for the given the column
>>>>> family.
>>>>>
>>>>> The task runs fine at first, but after ~9-15K records (I print the
>>>>> record count to the console every 1K records), it hangs and the following
>>>>> messages are printed to the console where I'm running the job:
>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
>>>>> data from server sessionid 0x1392cc35b460d1c, likely server has closed
>>>>> socket, closing socket connection and attempting reconnect
>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection
>>>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection
>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
>>>>> connection
>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client
>>>>> connection, connectString=localhost sessionTimeout=30000
>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection
>>>>> to server localhost/0:0:0:0:0:0:0:1:2181
>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection
>>>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
>>>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>>>>> 0x1392cc35b460d25, negotiated timeout = 30000
>>>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
>>>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
>>>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>>>>>
>>>>> Sometimes the messages contain a stacktrace like this below:
>>>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40
>>>>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
>>>>> socket connection and attempting reconnect
>>>>> java.io.IOException: Connection reset by peer
>>>>>  at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>>>>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>>>>>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>>>>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>>>>>  at
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection
>>>>> to server localhost/127.0.0.1:2181
>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection
>>>>> established to localhost/127.0.0.1:2181, initiating session
>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
>>>>> connection
>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client
>>>>> connection, connectString=localhost sessionTimeout=30000
>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection
>>>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection
>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
>>>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>>>>> 0x1392cc35b460b46, negotiated timeout = 30000
>>>>>
>>>>>
>>>>> I've poked through the logs in accumulo, and I've noticed that when it
>>>>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
>>>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
>>>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state
>>>>> of current session : Expired
>>>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
>>>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
>>>>> registration at null
>>>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
>>>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>>>>>
>>>>>
>>>>> I've noticed that if I make the map task print out the record count
>>>>> more frequently (ie every 10 records), it seems to be able get through more
>>>>> records than when I only print every 1K records. My assumption was that
>>>>> this had something to do with more time being spent in the map task, and
>>>>> not fetching data from accumulo.  There was at least one occasion where I
>>>>> printed to the console for every record, and in that situation it managed
>>>>> to process 47K records, although I have been unable to repeat that behavior.
>>>>>
>>>>> I've also noticed that if I stop and start accumulo, the map-reduce
>>>>> job will pickup where it left off, but seems to fail quicker.
>>>>>
>>>>>
>>>>>
>>>>> Could someone make some suggestions as to what my problem might be? It
>>>>> would be greatly appreciated.  If you need any additional information from
>>>>> me, just let me know.  I'd paste my config files, driver setup, and example
>>>>> data into this post, but I think it's probably long enough already.
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>> -Arjumand
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message