Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF824F20 for ; Mon, 20 Aug 2012 17:34:37 +0000 (UTC) Received: (qmail 82556 invoked by uid 500); 20 Aug 2012 17:34:37 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 82471 invoked by uid 500); 20 Aug 2012 17:34:37 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 82461 invoked by uid 99); 20 Aug 2012 17:34:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Aug 2012 17:34:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.220.169] (HELO mail-vc0-f169.google.com) (209.85.220.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Aug 2012 17:34:32 +0000 Received: by vcbfl10 with SMTP id fl10so6040954vcb.0 for ; Mon, 20 Aug 2012 10:34:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=YmlHVx2H+P6BzF8Hd57hAzrcEhNsgzuQSiFErP4A2/4=; b=UAcsx45MuluLyio3Lky6Mwv4ccozWzU2U8n//9gEXBnZv07NTK9tnhIScleCX779Vk ncYz6VIDQbMTwSqdskrifAukpJKgiFbRT/649q1WVxII4GO9Cme55NfRkTYF76OJnFIn exQkzg7ZaR5dzlI4Fk6o8li8fyzJlHNbkbtu6qCzkN/Ov/Mx/WuDibQgD8EQAS8TX2KS 3O0iNabTOQGg+fdJF6Zq3Og+IyRqQhlXUKHMFbPn/ZI6mul3mXX6vGS8Csw7fitdqSB5 jpvup+OVjTBlU1ZdmygGEFiIfCvBLVnab1gBtsxsq0Iz+luvcHcxL+wjPkiQ94OsayoL C9SA== MIME-Version: 1.0 Received: by 10.220.214.2 with SMTP id gy2mr8972526vcb.69.1345484050299; Mon, 20 Aug 2012 10:34:10 -0700 (PDT) Received: by 10.58.74.200 with HTTP; Mon, 20 Aug 2012 10:34:10 -0700 (PDT) In-Reply-To: References: Date: Mon, 20 Aug 2012 13:34:10 -0400 Message-ID: Subject: Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection From: Keith Turner To: user@accumulo.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQlmve5/zNp1FtxlNVQi97i7puweCxPKWegxSYg8FdGmbZf5agnecybtNbT/U+vFftO7VNo2 X-Virus-Checked: Checked by ClamAV on apache.org Arjumand It sounds like you are joining two data sets? Reading one data set from Accumulo and then doing a lookup into the other dataset for each input to the mapper? This is a good approach if one data set is small and fits into memory. If this is not the case, then you may want to consider another join strategy. Below are some of the options I know of joining two data sets, D1 and D2 : * Read D1 into memory on each mapper, and do lookups into the D1 map as each is read from D2. Good approach when D1 is small. * Set D1 and D2 as inputs to the map reduce job and join in the reduce phase. Good approach when D1 and D2 are large. * If D1 is an Accumulo table, use the batch scanner to lookup elements in D2. Good approach when D1 is already in a table and D2 is relatively small. If D2 is larger, could do batch scans in reducers or maybe mappers. * Compute a bloom filter for D1, load the bloom filter into memory for each mapper. Check existence in D1 bloom filter as each item is read from D2. This is can work well when lots of things in D2 do not exist in D1. However this is just a filter step that cuts down on what you need to join. A join will still need to be done. The approach you choose for joining depends on the relative size of your two data sets. The goal is to batch work and avoid single lookups into Accumulo or a Map File if possible. Keith On Mon, Aug 20, 2012 at 1:00 PM, Arjumand Bonhomme wrote: > Hey guys, > > I'm back with some additional information and I'm hoping that you might be > able to help me finally get past this issue. > > > So I changed my config from the 512MB version to the 1GB version without any > noticeable improvement. The job still became stuck at roughly 10K input > records, and accumulo's zookeeper session timed out as usual. After > thinking about this for a while it occurred to me that I had previously used > accumulo as an input source on over 30K records with no issues at all (it > was writing back out to accumulo as well). So I tried to figure out why > this new job could never reliably get past 10K records. The only real > difference was that the new job was using a Hadoop MapFile as a lookup table > for record values. So after commenting out of the use of the MapFile and > using a single hard-coded value, I ran the job again and it made it through > the entire input very quickly, ~3.5M entries in just over a minute. So > after this I assumed the MapFile was slowing my job down too much for the > accumulo input scanner to keep its connection with zookeeper. The MapFile > was generated by reading some data from one of my tables in accumulo, so > the next thing I tried (which I figured would be very bad in practice) was > to read the lookup data directly from accumulo; I re-wrote the mapper to do > a scanner lookup against the other accumulo table for each input record. > This immediately worked much better than the MapFile. It was pretty slow, > but it managed to get up to ~92K records before it failed. Accumulo once > again lost its zookeeper session due to timeout. > > I recognize that this is not exactly an accumulo issue, but I was hoping > that you might be able to provide me some guidance as to how to ultimately > get past this issue. I'm using this against a small sample of what the > actual input will be, and I have ~3.5M input records and ~30K values in my > lookup table. Both of these values will likely increase in size > substantially when this is run against the actual input. > > Any suggestions about how to approach this problem will be greatly > appreciated. > > > Thanks, > -Arjumand > > > BTW, While I recognize this would not be the appropriate way to address my > problem, I was wondering if there was a reason why the > org.apache.accumulo.core.client.ZooKeeperInstance constructor allows you > specify/request a specific session timeout, but the same thing is not > available on the .setZooKeeperInstance() methods of the > AccumuloInputFormat/AccumuloOutputFormat classe? > > > On Thu, Aug 16, 2012 at 10:10 PM, Arjumand Bonhomme > wrote: >> >> Thanks for the input, no worries about the offline scanner bit. Although >> if I can ever get past my current issue, I'd love to learn why the offline >> scanner failed as well. >> >> So I took a look at my various other settings, and watched the paging >> while everything was running, and it doesn't appear (at least not to me) to >> be related to swapping. >> I'm using the 512MB accumulo configuration, and I'm running the hadoop >> task with a 256MB heap. I'm running hadoop with the standard >> pseudo-distributed config, except for bumping the dfs.datanode.max.xcievers >> property to 4096. I'm running zookeeper with the OOB config, except for the >> adjustment to maxclientcxns in zoo.cfg increased to be 100 >> >> I did a reboot of my machine, then began another series of tests. This >> time I was watching vm_stat while everything was running. >> >> I noticed the same behavior I've been witnessing all along. I'm including >> the stacktrace that was emitted to the tserver log when it hung as it >> usually does. >> 6 20:58:12,311 [zookeeper.ZooLock] DEBUG: event null None Disconnected >> 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired for >> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/tables/1/conf/table.compaction.minor.idle >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) >> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208) >> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188) >> at >> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:120) >> at >> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:108) >> at >> org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:70) >> at >> org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.getMemoryManagementActions(LargestFirstMemoryManager.java:95) >> at >> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:312) >> at >> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.access$200(TabletServerResourceManager.java:228) >> at >> org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework$2.run(TabletServerResourceManager.java:252) >> at >> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) >> at java.lang.Thread.run(Thread.java:680) >> 16 20:58:13,375 [zookeeper.ZooCache] WARN : Zookeeper error, will retry >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired for >> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7/config/table.scan.max.memory >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) >> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208) >> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188) >> at >> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:109) >> at >> org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:76) >> at >> org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:103) >> at >> org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:47) >> at >> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:960) >> at >> org.apache.accumulo.server.tabletserver.TabletServerResourceManager.executeReadAhead(TabletServerResourceManager.java:699) >> at >> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1316) >> at >> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1284) >> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) >> at $Proxy1.startMultiScan(Unknown Source) >> at >> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.process(TabletClientService.java:2164) >> at >> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037) >> at >> org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) >> at >> org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) >> at >> org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at >> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) >> at java.lang.Thread.run(Thread.java:680) >> 16 20:58:13,379 [zookeeper.ZooLock] DEBUG: event null None Expired >> 16 20:58:13,393 [tabletserver.TabletServer] FATAL: Lost tablet server lock >> (reason = SESSION_EXPIRED), exiting. >> >> >> Below is the matching stacktrace emitted by the hadoop job. >> 12/08/16 20:57:14 INFO zookeeper.ZooKeeper: Initiating client connection, >> connectString=localhost sessionTimeout=30000 >> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@47124746 >> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/fe80:0:0:0:0:0:0:1%1:2181 >> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Socket connection established >> to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >> 12/08/16 20:57:14 INFO zookeeper.ClientCnxn: Session establishment >> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = >> 0x13931fecee3012e, negotiated timeout = 30000 >> 12/08/16 20:57:14 INFO mapred.JobClient: Running job: job_local_0001 >> 12/08/16 20:57:14 INFO mapred.Task: Using ResourceCalculatorPlugin : null >> 12/08/16 20:57:14 INFO mapred.MapTask: io.sort.mb = 100 >> 12/08/16 20:57:15 INFO mapred.MapTask: data buffer = 79691776/99614720 >> 12/08/16 20:57:15 INFO mapred.MapTask: record buffer = 262144/327680 >> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/16 20:57:15 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/16 20:57:15 INFO mapred.JobClient: map 0% reduce 0% >> 001000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@feel >> 002000 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@perhap >> 12/08/16 20:57:20 INFO mapred.LocalJobRunner: >> 003000 >> 2012-06-11T09-05-19/abc/00f9a47cc64c2ee929f1b9e81ee31e71.txt@univers >> 12/08/16 20:57:23 INFO mapred.LocalJobRunner: >> 004000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@doubt >> 005000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@morwai >> 12/08/16 20:57:26 INFO mapred.LocalJobRunner: >> 006000 2012-06-11T09-05-19/abc/027e57027b4117a427c4313280bc8448.txt@tgame >> 007000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@colleg >> 12/08/16 20:57:29 INFO mapred.LocalJobRunner: >> 008000 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@led >> 12/08/16 20:57:32 INFO mapred.LocalJobRunner: >> 009000 >> 2012-06-11T09-05-19/abc/04602aad24f50293c701c319561f5c74.txt@stephanopoulo >> 010000 >> 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@brother >> 12/08/16 20:57:35 INFO mapred.LocalJobRunner: >> 011000 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@inabl >> 12/08/16 20:57:38 INFO mapred.LocalJobRunner: >> 12/08/16 20:57:41 INFO mapred.LocalJobRunner: >> 12/08/16 20:58:12 INFO zookeeper.ClientCnxn: Client session timed out, >> have not heard from server in 37008ms for sessionid 0x13931fecee3012e, >> closing socket connection and attempting reconnect >> 012000 >> 2012-06-11T09-05-19/abc/05a8d74773ac2379f1e960fb036fcc24.txt@saturdai >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/0:0:0:0:0:0:0:1:2181 >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established >> to localhost/0:0:0:0:0:0:0:1:2181, initiating session >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Unable to reconnect to >> ZooKeeper service, session 0x13931fecee3012e has expired, closing socket >> connection >> 12/08/16 20:58:13 WARN zookeeper.ZooCache: Zookeeper error, will retry >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired for >> /accumulo/56cdf8d0-4881-4ec1-8bf2-728a2b8d0da7 >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) >> at org.apache.accumulo.core.zookeeper.ZooCache$2.run(ZooCache.java:208) >> at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:130) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233) >> at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188) >> at >> org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:147) >> at >> org.apache.accumulo.core.client.impl.TabletLocator.getInstance(TabletLocator.java:96) >> at >> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:245) >> at >> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:94) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:680) >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: EventThread shut down >> 12/08/16 20:58:13 INFO zookeeper.ZooKeeper: Initiating client connection, >> connectString=localhost sessionTimeout=30000 >> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@7f434cb2 >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/0:0:0:0:0:0:0:1:2181 >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Socket connection established >> to localhost/0:0:0:0:0:0:0:1:2181, initiating session >> 12/08/16 20:58:13 INFO zookeeper.ClientCnxn: Session establishment >> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = >> 0x13931fecee30135, negotiated timeout = 30000 >> 013000 >> 2012-06-11T09-05-19/abc/06a7b8b2de18727faa635b841b0dd043.txt@anthoni >> 12/08/16 20:58:17 INFO mapred.LocalJobRunner: >> >> >> And finally here is the output from "vm_stat 1". Doesn't look like >> anything was really getting swapped out to disk. While there was a flurry >> of paging (reading pages in) as soon as I started up the job, that paging >> ceased very quickly after that. Then very shortly it hung again as usual. >> When looking at the output, it may help to know that I started the hadoop >> job just after the 3rd banner/header was printed out, and you can see the >> initial flurry of paging and then it stops after roughly 12 seconds. >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 595157 944925 81264 89580 385586 319170K 8990003 159901K 3094025 >> 9227006 1546121 >> 595491 944110 81292 89605 386400 1909 0 1417 0 >> 4 0 >> 595204 944911 81292 89579 385625 708 0 551 0 >> 0 0 >> 595429 944652 81293 89579 385879 1941 0 1314 0 >> 0 0 >> 595379 945101 81293 89579 385407 616 0 521 0 >> 0 0 >> 595114 944636 81294 89579 385879 1824 0 1329 0 >> 0 0 >> 595664 945080 81295 89578 385406 5365 6 4032 0 >> 1 0 >> 595221 944078 81295 89578 386410 2645 0 1840 0 >> 0 0 >> 595338 945075 81295 89578 385406 767 2 505 0 >> 0 0 >> 595061 944463 81295 89578 386059 2113 0 1433 0 >> 1 0 >> 595194 944936 81295 89578 385587 674 0 565 0 >> 0 0 >> 595267 944458 81296 89578 386059 1731 0 1293 0 >> 0 0 >> 595418 944925 81297 89578 385587 780 0 581 0 >> 0 0 >> 595124 944200 81296 89579 386419 2111 84 1397 0 >> 2 0 >> 595037 944686 81295 89579 385948 657 2 548 0 >> 2 0 >> 595054 944213 81295 89579 386420 1785 0 1350 0 >> 1 0 >> 595289 944687 81295 89579 385948 758 0 565 0 >> 1 0 >> 595052 944221 81296 89579 386420 1987 0 1356 0 >> 0 0 >> 595017 944698 81296 89579 385948 768 0 558 0 >> 0 0 >> 595272 944217 81296 89579 386420 1957 0 1334 0 >> 0 0 >> 594954 944686 81296 89579 385948 704 0 535 0 >> 0 0 >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 594997 944166 81296 89579 386420 319204K 8990097 159926K 3094025 >> 9227018 1546121 >> 594977 944641 81298 89579 385948 784 0 569 0 >> 0 0 >> 595282 944160 81298 89579 386420 1874 0 1368 0 >> 1 0 >> 595252 944633 81298 89579 385948 735 0 561 0 >> 1 0 >> 595057 944186 81298 89579 386420 1972 0 1391 0 >> 0 0 >> 595039 944481 81298 89579 386129 929 0 564 0 >> 1 0 >> 595512 943232 81346 89579 387188 3109 31 2542 0 >> 1 0 >> 595395 944491 81346 89579 385948 742 0 626 0 >> 0 0 >> 595134 944016 81346 89579 386420 1844 1 1335 0 >> 0 0 >> 595329 944501 81346 89579 385948 694 0 546 0 >> 0 0 >> 595329 944016 81346 89579 386431 1801 0 1330 0 >> 0 0 >> 595471 944458 81347 89579 385949 797 0 669 0 >> 0 0 >> 595433 943918 81347 89579 386431 2380 0 1806 0 >> 0 0 >> 595118 944433 81347 89579 385948 678 0 567 0 >> 0 0 >> 595142 943989 81347 89571 386420 1863 0 1378 0 >> 1 0 >> 595097 944462 81347 89571 385948 654 0 540 0 >> 1 0 >> 595126 944060 81347 89571 386420 1874 0 1430 0 >> 0 0 >> 595068 944674 81348 89571 385948 787 0 678 0 >> 0 0 >> 595110 944174 81348 89571 386420 1863 0 1345 0 >> 0 0 >> 595362 944627 81348 89571 385948 768 2 575 0 >> 0 0 >> 595115 944131 81348 89572 386420 1755 0 1323 0 >> 0 0 >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 595142 944657 81348 89571 385948 319233K 8990131 159947K 3094025 >> 9227025 1546121 >> 591569 946751 81348 89571 387281 9240 786 4725 0 >> 8 0 >> 579748 958406 81358 90168 386840 20057 287 13388 0 >> 492 0 >> 563598 959835 88682 97236 387319 20388 931 4220 0 >> 0 0 >> 535840 986697 88692 98643 386906 59619 1497 32467 0 >> 164 0 >> 534240 987263 88696 99134 387394 42744 0 6762 0 >> 98 0 >> 529479 992037 88723 100123 386187 39552 3 9006 0 >> 10 0 >> 529065 991283 88760 100107 387430 32014 153 7500 0 >> 41 0 >> 528147 992276 88784 100376 386966 37719 2 6050 0 >> 15 0 >> 527155 993536 88818 100380 386672 34999 0 6928 0 >> 1 0 >> 526748 992973 88851 100380 387747 34379 1 5240 0 >> 0 0 >> 526835 994071 88902 100404 386674 34698 2 7063 0 >> 8 0 >> 524993 994621 89207 100582 387027 28374 460 6018 0 >> 182 0 >> 524182 995184 89286 100582 387544 42552 226 7313 0 >> 0 0 >> 523704 995987 89328 100629 387076 42731 1 6357 0 >> 0 0 >> 523154 995939 89367 100629 387554 41100 0 7338 0 >> 0 0 >> 522867 996724 89405 100629 387307 39246 0 6089 0 >> 0 0 >> 522160 996665 89451 100639 387563 37614 1 7463 0 >> 0 0 >> 522414 997920 89501 100639 386324 40044 0 6995 0 >> 0 0 >> 521524 997171 89539 100639 387569 39447 0 7048 0 >> 1 0 >> 520258 998993 89575 100645 387107 37476 0 6789 0 >> 0 0 >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 519291 999119 89614 100645 387771 319989K 8994481 160119K 3094025 >> 9228045 1546121 >> 518546 1000K 89653 100647 387672 39985 81 6722 0 >> 0 0 >> 518354 1000K 89701 100647 387196 33629 0 7757 0 >> 0 0 >> 516218 1002K 89737 100647 387192 39403 0 7368 0 >> 1 0 >> 513388 1004K 89764 100648 388210 36737 0 14479 0 >> 0 0 >> 507991 1011K 89784 100648 386465 32460 1 13133 0 >> 1 0 >> 506346 1012K 89816 100648 387917 39257 2 7246 0 >> 1 0 >> 506298 1012K 89816 100648 387356 626 0 313 0 >> 0 0 >> 506070 1013K 89816 100648 386259 217 1 102 0 >> 0 0 >> 506101 1013K 89816 100648 386259 386 0 99 0 >> 0 0 >> 505974 1012K 89769 100618 387548 1417 21 1250 0 >> 9 0 >> 506078 1013K 89769 100618 386259 297 0 191 0 >> 0 0 >> 506349 1013K 89769 100618 386259 200 0 100 0 >> 0 0 >> 506105 1013K 89769 100618 386259 211 0 104 0 >> 0 0 >> 506351 1013K 89769 100618 386259 187 0 89 0 >> 0 0 >> 506114 1013K 89769 100618 386599 269 0 185 0 >> 0 0 >> 506228 1013K 89769 100618 386258 228 0 129 0 >> 0 0 >> 506231 1013K 89769 100618 386258 147 0 83 0 >> 0 0 >> 506486 1013K 89769 100618 386258 306 0 119 0 >> 0 0 >> 506215 1013K 89769 100619 386258 174 0 62 0 >> 0 0 >> 505927 1014K 89769 100619 386258 593 0 524 0 >> 0 0 >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 506437 1013K 89769 100619 386700 320215K 8994587 160180K 3094025 >> 9228057 1546121 >> 506670 1013K 89769 100619 386225 128 0 45 0 >> 0 0 >> 506347 1013K 89769 100619 386225 268 0 58 0 >> 0 0 >> 506033 1013K 89769 100619 386225 127 0 89 0 >> 0 0 >> 506221 1013K 89769 100619 386225 86 0 41 0 >> 0 0 >> 506157 1013K 89769 100619 386225 81 0 38 0 >> 0 0 >> 506096 1013K 89769 100619 386225 345 0 186 0 >> 0 0 >> 506044 1013K 89769 100619 386236 291 0 224 0 >> 0 0 >> 505980 1013K 89769 100619 386225 81 0 39 0 >> 0 0 >> 505916 1013K 89769 100619 386225 99 0 50 0 >> 0 0 >> 505839 1013K 89769 100619 386225 79 0 36 0 >> 0 0 >> 506028 1013K 89769 100619 386225 102 0 61 0 >> 0 0 >> 505760 1014K 89769 100619 386225 286 0 229 0 >> 0 0 >> 506151 1013K 89769 100619 386406 140 0 88 0 >> 0 0 >> 506097 1013K 89769 100619 386406 101 2 47 0 >> 0 0 >> 506028 1014K 89769 100619 386225 175 0 62 0 >> 0 0 >> 506230 1014K 89769 100619 386225 227 0 39 0 >> 0 0 >> 505939 1014K 89769 100619 386225 550 0 512 0 >> 0 0 >> 505682 1014K 89769 100619 386225 90 0 50 0 >> 0 0 >> 506772 1011K 89782 100619 388439 6973 15 1797 0 >> 1 0 >> 550258 968746 89827 99598 388175 40677 7 8407 0 >> 1 0 >> Mach Virtual Memory Statistics: (page size of 4096 bytes, cache hits 0%) >> free active spec inactive wire faults copy 0fill reactive >> pageins pageout >> 585656 935138 89879 98682 387121 320304K 8994611 160203K 3094025 >> 9228060 1546121 >> 585477 935589 89892 98682 386906 10654 1 2648 0 >> 0 0 >> 586236 935269 89898 98683 386610 2120 2 1473 0 >> 1 0 >> 586650 935141 89903 98683 386138 833 0 638 0 >> 2 0 >> 586647 933883 89903 98683 387378 3028 174 1553 0 >> 0 0 >> 587883 932971 89909 98683 387427 882 0 638 0 >> 0 0 >> 588259 932579 89915 98683 386994 1988 0 1383 0 >> 0 0 >> 589697 931139 89923 98683 387427 1174 4 709 0 >> 0 0 >> 589306 931423 89925 98684 387052 2432 0 1743 0 >> 0 0 >> 589128 931606 89926 98688 387043 2309 0 1632 0 >> 5 0 >> 590286 930456 89904 98689 387413 3506 7 2453 0 >> 16 0 >> 590411 931982 89905 98689 385674 1801 0 1384 0 >> 0 0 >> 590385 931329 89905 98689 386327 2954 1 2187 0 >> 0 0 >> 590551 931731 89910 98664 385920 2060 0 1543 0 >> 0 0 >> 589471 931873 89911 98664 386856 5511 252 3120 0 >> 0 0 >> 589397 931941 89911 98659 386769 1351 0 1090 0 >> 0 0 >> >> >> So while I'm not 100% certain that this isn't related to swapping and >> garbage collection, it seems like something else is probably the cause. >> >> Can you guys give me some more pointers/guidance on what I can check to >> try to get past this issue? >> Would seeing any of the configuration or hadoop driver code help you get a >> better idea of what I might be dealing with? >> If there's any more due diligence I can/should be doing, just let me know. >> >> I'm sure it's more than likely a configuration issue on my end, I'm really >> pulling at straws at this point. >> >> Thanks again for all the help, >> -Arjumand >> >> >> >> On Thu, Aug 16, 2012 at 3:54 PM, John Vines wrote: >> >>> I'm sorry, I missed the use of the offline scanner being used. >>> >>> Issues with locks being lost like you described typically occur from >>> swapping issues. You said you used the 512MB footprint Accumulo >>> configuration, but you should also make sure that your memory settings from >>> the various hadoop settings, being sure to take into account the number of >>> map and reduce jobs occurring, do not exceed your physical memory (plus any >>> other processes you may be running on your system). Swapping + garbage >>> collection have proven to be quite effective at causing lost locks. >>> >>> John >> >> === >> >> >> On Thu, Aug 16, 2012 at 3:48 PM, Arjumand Bonhomme >> wrote: >>> >>> While this isn't my original concern/problem, so it's not terribly >>> important, but I'm not sure I understand and I'd like to learn as much as >>> possible. >>> >>> Why wouldn't it work with the cloned table offline? >>> I followed the example laid out in >>> org.apache.accumulo.examples.simple.mapreduce.UniqueColumns >>> Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(), >>> true);" setup. >>> >>> Can you give me some detail as to why it definitely would not have worked >>> offline? Is this method no longer supported. >>> >>> Thanks! >>> -Arjumand >>> >>> >>> On Thu, Aug 16, 2012 at 3:24 PM, John Vines wrote: >>>> >>>> You're work ont he clone table failed because you took it offline. The >>>> table needs to be online in order to do a job against it. >>>> >>>> John >>>> >>>> >>>> On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme >>>> wrote: >>>>> >>>>> Jim, William, Adam: Thanks for your help with this, I'm running out of >>>>> ideas to google search for answers. >>>>> >>>>> I'm going to try to answer your questions: >>>>> >>>>> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I >>>>> didn't see anything in log indicating that it was shutting down connections >>>>> due to reaching a connection limit. However, throughout the zookeeper logs, >>>>> even before the hadoop job was run. I did see lots of lines like these. Up >>>>> til this point, I've assumed it's innocuous and unrelated to my issue. >>>>> INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for >>>>> client /127.0.0.1:59158 (no session established for client) >>>>> >>>>> 2) I had also already adjusted the dfs.datanode.max.xcievers property >>>>> to 4096 in hdfs-site.xml. An investigation of the log shows that this limit >>>>> is not being reached. It had been being reached at some point before I >>>>> increased the value from the default. I had increased the value while >>>>> troubleshooting, but prior to posting to this list; it didn't appear to have >>>>> a noticeable affect on the behavior of the hadoop job. >>>>> >>>>> 3) I'm writing out to a sequence file, so accumulo is only being used >>>>> for input. As a side note, at one point during my troubleshooting, I >>>>> compacted, cloned, and then took the cloned table offline and tried to use >>>>> that instead. That failed immediately, without processing any records. >>>>> From the stacktrace, it appeared as though the iterator was trying to use >>>>> one of the files for the original table (from what I understand about >>>>> cloning, this is normal b/c no changes had been made to the original table) >>>>> but said it did not exist. I was, however, able to find the file on hdfs. >>>>> So I just gave up on that. Also, under the normal case, ie using the >>>>> original table online, nothing is writing to the table while the hadoop job >>>>> is running. >>>>> >>>>> 4) My original open file limit was the os default of 256. So I upped >>>>> it to 1024, and performed another attempt. The behavior was the same as >>>>> before. >>>>> >>>>> I'm including a snippet from the tserver debug log. It looks like an >>>>> expired session might be the root of the problem, but I'm not sure what >>>>> would cause that: >>>>> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess >>>>> 127.0.0.1:49719 45 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s >>>>> ct=0.001s) >>>>> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] >>>>> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] >>>>> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc >>>>> ParNew=0.07(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs >>>>> freemem=111,234,160(+14,477,728) totalmem=132,055,040 >>>>> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >>>>> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] >>>>> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1] >>>>> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1] >>>>> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE >>>>> minor compaction chosen >>>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: >>>>> COMPACTING !0<;~ total = 120,840 ingestMemory = 120,840 >>>>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: >>>>> chosenMem = 2,416 chosenIT = 300.23 load 3,044 >>>>> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 >>>>> secs >>>>> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor >>>>> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~ >>>>> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >>>>> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1] >>>>> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >>>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE >>>>> compactionThreshold = 0.605 maxObserved = 120,840 >>>>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER >>>>> compactionThreshold = 0.666 >>>>> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 >>>>> read | 4 written | 4,800 entries/sec | 0.005 secs >>>>> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted: >>>>> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c >>>>> 16 13:47:51,956 [log.TabletServerLogger] DEBUG: wrote MinC finish 35: >>>>> writeTime:1ms >>>>> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] >>>>> -> /default_tablet/F000051i.rf >>>>> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs >>>>> !0<;~ >>>>> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess >>>>> 127.0.0.1:51480 43 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s >>>>> ct=0.001s) >>>>> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: >>>>> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB >>>>> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0, >>>>> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss >>>>> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0 >>>>> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: >>>>> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032), >>>>> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0, >>>>> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%, >>>>> Evicted/Run=NaN, Duplicate Reads=1 >>>>> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 >>>>> secs, wait 0.00 secs >>>>> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 >>>>> secs, wait 0.00 secs >>>>> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0< >>>>> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] --> >>>>> /table_info/A000051j.rf_tmp >>>>> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<< >>>>> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> >>>>> /root_tablet/A000051k.rf_tmp >>>>> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 >>>>> read | 21 written | 276 entries/sec | 0.141 secs >>>>> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56 >>>>> read | 51 written | 323 entries/sec | 0.173 secs >>>>> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs >>>>> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC >>>>> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> >>>>> /root_tablet/A000051k.rf >>>>> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for >>>>> extent !0<;~ as alias 2 >>>>> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs >>>>> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC >>>>> [/table_info/A000051d.rf, /table_info/F000051h.rf] --> >>>>> /table_info/A000051j.rf >>>>> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc >>>>> ParNew=0.09(+0.02) secs ConcurrentMarkSweep=0.00(+0.00) secs >>>>> freemem=97,183,880(-1,627,536) totalmem=132,055,040 >>>>> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess >>>>> 127.0.0.1:51480 146 in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s >>>>> ct=0.004s) >>>>> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1] >>>>> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc >>>>> ParNew=0.09(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs >>>>> freemem=98,415,928(+12,106,904) totalmem=132,055,040 >>>>> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >>>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess >>>>> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 >>>>> ranges:1) >>>>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess >>>>> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 >>>>> ranges:1) >>>>> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid >>>>> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >>>>> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected >>>>> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired >>>>> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server >>>>> lock (reason = SESSION_EXPIRED), exiting. >>>>> >>>>> >>>>> I'm also including a snippet from the zookeeper log as well. Around >>>>> the same the tserver logs show the loss of the tablet server lock due to >>>>> expired session, the zookeeper logs mentions the expiration of several >>>>> sessions due to timeout. >>>>> 2012-08-16 13:48:07,368 - INFO >>>>> [Thread-370:NIOServerCnxn$StatCommand@1153] - Stat command output >>>>> 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /127.0.0.1:58633 (no session established >>>>> for client) >>>>> 2012-08-16 13:48:08,371 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - >>>>> Accepted socket connection from /127.0.0.1:59158 >>>>> 2012-08-16 13:48:08,372 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing >>>>> stat command from /127.0.0.1:59158 >>>>> 2012-08-16 13:48:08,378 - INFO >>>>> [Thread-371:NIOServerCnxn$StatCommand@1153] - Stat command output >>>>> 2012-08-16 13:48:08,379 - INFO [Thread-371:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /127.0.0.1:59158 (no session established >>>>> for client) >>>>> 2012-08-16 13:48:09,395 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - >>>>> Accepted socket connection from /127.0.0.1:59635 >>>>> 2012-08-16 13:48:09,395 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing >>>>> stat command from /127.0.0.1:59635 >>>>> 2012-08-16 13:48:09,399 - INFO >>>>> [Thread-372:NIOServerCnxn$StatCommand@1153] - Stat command output >>>>> 2012-08-16 13:48:09,400 - INFO [Thread-372:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /127.0.0.1:59635 (no session established >>>>> for client) >>>>> 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:32,001 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:32,001 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80005 >>>>> 2012-08-16 13:48:32,001 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80004 >>>>> 2012-08-16 13:48:32,001 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80001 >>>>> 2012-08-16 13:48:32,003 - INFO [SyncThread:0:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /127.0.0.1:65339 which had sessionid >>>>> 0x1393086dfb80005 >>>>> 2012-08-16 13:48:34,001 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:36,001 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:38,000 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:38,001 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:40,001 - INFO [SessionTracker:ZooKeeperServer@316] - >>>>> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded >>>>> 2012-08-16 13:48:44,456 - INFO [SyncThread:0:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had >>>>> sessionid 0x1393086dfb80004 >>>>> 2012-08-16 13:48:44,458 - INFO [SyncThread:0:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /127.0.0.1:65316 which had sessionid >>>>> 0x1393086dfb80001 >>>>> 2012-08-16 13:48:44,461 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80002 >>>>> 2012-08-16 13:48:44,482 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80007 >>>>> 2012-08-16 13:48:44,484 - WARN >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception >>>>> causing close of session 0x1393086dfb80003 due to java.io.IOException: >>>>> Connection reset by peer >>>>> 2012-08-16 13:48:44,486 - INFO [SyncThread:0:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had >>>>> sessionid 0x1393086dfb80002 >>>>> 2012-08-16 13:48:44,491 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb8004d >>>>> 2012-08-16 13:48:44,493 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80006 >>>>> 2012-08-16 13:48:44,495 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed >>>>> socket connection for client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid >>>>> 0x1393086dfb80003 >>>>> 2012-08-16 13:48:44,495 - WARN >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - >>>>> EndOfStreamException: Unable to read additional data from client sessionid >>>>> 0x1393086dfb80007, likely client has closed socket >>>>> 2012-08-16 13:48:44,496 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed >>>>> socket connection for client /0:0:0:0:0:0:0:1%0:65343 which had sessionid >>>>> 0x1393086dfb80007 >>>>> 2012-08-16 13:48:44,496 - WARN >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - >>>>> EndOfStreamException: Unable to read additional data from client sessionid >>>>> 0x1393086dfb8004d, likely client has closed socket >>>>> 2012-08-16 13:48:44,497 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed >>>>> socket connection for client /127.0.0.1:49770 which had sessionid >>>>> 0x1393086dfb8004d >>>>> 2012-08-16 13:48:44,498 - INFO >>>>> [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination >>>>> for sessionid: 0x1393086dfb80003 >>>>> 2012-08-16 13:48:44,500 - INFO [SyncThread:0:NIOServerCnxn@1435] - >>>>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had >>>>> sessionid 0x1393086dfb80006 >>>>> 2012-08-16 13:48:44,613 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - >>>>> Accepted socket connection from /0:0:0:0:0:0:0:1%0:59789 >>>>> 2012-08-16 13:48:44,615 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client >>>>> attempting to renew session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789 >>>>> 2012-08-16 13:48:44,616 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid >>>>> session 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably >>>>> expired >>>>> 2012-08-16 13:48:44,616 - INFO >>>>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed >>>>> socket connection for client /0:0:0:0:0:0:0:1%0:59789 which had sessionid >>>>> 0x1393086dfb80001 >>>>> >>>>> >>>>> Sorry for the huge post of log drivel, but if you'd like to see more I >>>>> can post (or via pastebin) the full logs. I made sure to shut down >>>>> everything, wipe out the logs, then start everything back up so the logs >>>>> would be as clean as possible. >>>>> >>>>> >>>>> I looked for a way to increase the session timeout on the >>>>> AccumuloInputFormat, but there doesn't seem to be one. I'm sure that's not >>>>> the right fix anyway, but figured I'd mention it. >>>>> >>>>> >>>>> Do you guys have any other suggestions. I'm sure this is due to some >>>>> configuration error on my part, but I'm running out of idea. >>>>> >>>>> >>>>> Thanks a ton, >>>>> -Arjumand >>>>> >>>>> >>>>> >>>>> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum >>>>> wrote: >>>>>> >>>>>> What does your TServer debug log say? Also, are you writing back out >>>>>> to Accumulo? >>>>>> >>>>>> To follow up what Jim said, you can check the zookeeper log to see if >>>>>> max connections is being hit. You may also want to check and see what your >>>>>> max xceivers is set to for HDFS and check your Accumulo and HDFS logs to see >>>>>> if it is mentioned. >>>>>> >>>>>> >>>>>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme >>>>>> wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may >>>>>>> be due to poor configuration on my part, but I'm running out of ideas. >>>>>>> >>>>>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from >>>>>>> cdh3u4) in pseudo-distributed mode. >>>>>>> zookeeper version zookeeper-3.3.5 from cdh3u4 >>>>>>> I'm using the 1.4.1 release of accumulo with a configuration copied >>>>>>> from "conf/examples/512MB/standalone" >>>>>>> >>>>>>> I've got a Map task that is using an accumulo table as the input. >>>>>>> I'm fetching all rows, but just a single column family, that has >>>>>>> hundreds or even thousands of different column qualifiers. >>>>>>> The table has a SummingCombiner installed for the given the column >>>>>>> family. >>>>>>> >>>>>>> The task runs fine at first, but after ~9-15K records (I print the >>>>>>> record count to the console every 1K records), it hangs and the following >>>>>>> messages are printed to the console where I'm running the job: >>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read >>>>>>> additional data from server sessionid 0x1392cc35b460d1c, likely server has >>>>>>> closed socket, closing socket connection and attempting reconnect >>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket >>>>>>> connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection >>>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to >>>>>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket >>>>>>> connection >>>>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down >>>>>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client >>>>>>> connection, connectString=localhost sessionTimeout=30000 >>>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c >>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket >>>>>>> connection to server localhost/0:0:0:0:0:0:0:1:2181 >>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection >>>>>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session >>>>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment >>>>>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = >>>>>>> 0x1392cc35b460d25, negotiated timeout = 30000 >>>>>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner: >>>>>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner: >>>>>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner: >>>>>>> >>>>>>> Sometimes the messages contain a stacktrace like this below: >>>>>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session >>>>>>> 0x1392cc35b460b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected >>>>>>> error, closing socket connection and attempting reconnect >>>>>>> java.io.IOException: Connection reset by peer >>>>>>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >>>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) >>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166) >>>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) >>>>>>> at >>>>>>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) >>>>>>> at >>>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) >>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket >>>>>>> connection to server localhost/127.0.0.1:2181 >>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection >>>>>>> established to localhost/127.0.0.1:2181, initiating session >>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to >>>>>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket >>>>>>> connection >>>>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down >>>>>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client >>>>>>> connection, connectString=localhost sessionTimeout=30000 >>>>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 >>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket >>>>>>> connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection >>>>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>>>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment >>>>>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = >>>>>>> 0x1392cc35b460b46, negotiated timeout = 30000 >>>>>>> >>>>>>> >>>>>>> I've poked through the logs in accumulo, and I've noticed that when >>>>>>> it hangs, the following is written to the "logger_HOSTNAME.debug.log" file: >>>>>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None >>>>>>> Disconnected >>>>>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state >>>>>>> of current session : Expired >>>>>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired >>>>>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper >>>>>>> registration at null >>>>>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down >>>>>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down >>>>>>> >>>>>>> >>>>>>> I've noticed that if I make the map task print out the record count >>>>>>> more frequently (ie every 10 records), it seems to be able get through more >>>>>>> records than when I only print every 1K records. My assumption was that this >>>>>>> had something to do with more time being spent in the map task, and not >>>>>>> fetching data from accumulo. There was at least one occasion where I >>>>>>> printed to the console for every record, and in that situation it managed to >>>>>>> process 47K records, although I have been unable to repeat that behavior. >>>>>>> >>>>>>> I've also noticed that if I stop and start accumulo, the map-reduce >>>>>>> job will pickup where it left off, but seems to fail quicker. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Could someone make some suggestions as to what my problem might be? >>>>>>> It would be greatly appreciated. If you need any additional information >>>>>>> from me, just let me know. I'd paste my config files, driver setup, and >>>>>>> example data into this post, but I think it's probably long enough already. >>>>>>> >>>>>>> >>>>>>> Thanks in advance, >>>>>>> -Arjumand >>>>>>> >>>>>> >>>>> >>>> >>> >> >