accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Accumulo on MapR - BatchScanSplitTest
Date Wed, 11 Apr 2012 17:12:45 GMT
Your tablet server crashed (the master reported the loss of the server).
 My guess is that it had a stop-the-world gc that lasted longer than the
zookeeper timeout.

Is there some reason why you aren't using the suggested

-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75

jvm arguments?

-Eric

On Tue, Apr 10, 2012 at 2:21 PM, Keys Botzum <kbotzum@maprtech.com> wrote:

> At this point all but two of the Accumulo test/system/auto tests have
> completed successfully. This test is failing and I'm not quite sure why:
> simple.batchScanSplit.BatchScanSplitTest
>
> When I run it, this is the output I see:
> ./run.py -t batchandsplittest -d -v10
> ….
> DEBUG:test.auto:localhost: /opt/accumulo-1.4.0/bin/accumulo
> org.apache.accumulo.server.test.functional.FunctionalTest -m localhost -u
> root -p secret -i SE-test-04-17205
> org.apache.accumulo.server.test.functional.BatchScanSplitTest run
> DEBUG:test.auto:Waiting for /opt/accumulo-1.4.0/bin/accumulo
> org.apache.accumulo.server.test.functional.FunctionalTest -m localhost -u
> root -p secret -i SE-test-04-17205
> org.apache.accumulo.server.test.functional.BatchScanSplitTest run to stop
> in 240 secs
> DEBUG:test.auto:out: 10 11:01:17,382 [admin.TableOperations] INFO :
> Problem with metadata table, first entry for table 1- 1<;00002 - has non
> null prev end row ... retrying ...
> DEBUG:test.auto:out: splits : [0000019cd, 0000026b4, 000004082, 000005a50,
> 000006737, 00000741e, 000007a0f, 000008, 000008dec, 000009ad3, 00000b4a1,
> 00000ba5, 00000c, 00000ce6f, 00000db56, 00000f524, 00000fa92, 00001,
> 000010ef2, 000011bd9, 0000135a7, 000013ad4, 000014, 000014f75, 000015c5c,
> 00001762a, 000017b15, 000018, 000018ff8, 000019cdf, 00001b6ad, 00001bb57,
> 00001c, 00001d07b, 00001dd62, 00001f730, 00001fb98, 00002, 0000210fe,
> 000021de5, 0000237b3, 000023bda, 000024, 000025181, 000025e68, 000027836,
> 000027c1b, 000028, 000029204, 000029eeb, 00002b8b9, 00002bc5d, 00002c,
> 00002d287, 00002df6e, 00002f93c, 00002fc9e, 00003, 00003130a, 000031ff1,
> 0000339bf, 000033ce, 000034, 00003538d, 000036074, 0000374b5, 000037a5b,
> 000038, 000038e83, 000039b6a, 00003b538, 00003c, 00003dbed, 00003e8d4]
> DEBUG:test.auto:out:
> DEBUG:test.auto:out: rate :
> DEBUG:test.auto:out: 205.34
> DEBUG:test.auto:out: 10 11:03:01,983 [impl.ThriftScanner] WARN : Security
> Violation in scan request to 10.250.99.204:39256:
> ThriftSecurityException(user:root, code:null)
> DEBUG:test.auto:err: Thread
> "org.apache.accumulo.server.test.functional.FunctionalTest" died null
> DEBUG:test.auto:err: java.lang.reflect.InvocationTargetException
> DEBUG:test.auto:err:    at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> DEBUG:test.auto:err:    at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.accumulo.start.Main$1.run(Main.java:89)
>        at java.lang.Thread.run(Thread.java:662)
> DEBUG:test.auto:err: Caused by: java.lang.NullPointerException
> DEBUG:test.auto:err:    at
> org.apache.accumulo.core.client.AccumuloSecurityException.getDefaultErrorMessage(AccumuloSecurityException.java:30)
>        at
> org.apache.accumulo.core.client.AccumuloSecurityException.<init>(AccumuloSecurityException.java:70)
>        at
> org.apache.accumulo.core.client.impl.ThriftScanner.getBatchFromServer(ThriftScanner.java:153)
>        at
> org.apache.accumulo.core.client.impl.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:88)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:393)
> DEBUG:test.auto:err:    at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:536)
>        at
> org.apache.accumulo.core.client.impl.TabletLocator$1._locateTablet(TabletLocator.java:115)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:370)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:390)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:536)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:215)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:288)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:236)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:301)
> DEBUG:test.auto:err:    at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$900(TabletServerBatchReaderIterator.java:73)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:374)
>        at
> org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        ... 1 more
> DEBUG:test.auto:Output from command: 10 11:01:17,382
> [admin.TableOperations] INFO : Problem with metadata table, first entry for
> table 1- 1<;00002 - has non null prev end row ... retrying ...
> splits : [0000019cd, 0000026b4, 000004082, 000005a50, 000006737,
> 00000741e, 000007a0f, 000008, 000008dec, 000009ad3, 00000b4a1, 00000ba5,
> 00000c, 00000ce6f, 00000db56, 00000f524, 00000fa92, 00001, 000010ef2,
> 000011bd9, 0000135a7, 000013ad4, 000014, 000014f75, 000015c5c, 00001762a,
> 000017b15, 000018, 000018ff8, 000019cdf, 00001b6ad, 00001bb57, 00001c,
> 00001d07b, 00001dd62, 00001f730, 00001fb98, 00002, 0000210fe, 000021de5,
> 0000237b3, 000023bda, 000024, 000025181, 000025e68, 000027836, 000027c1b,
> 000028, 000029204, 000029eeb, 00002b8b9, 00002bc5d, 00002c, 00002d287,
> 00002df6e, 00002f93c, 00002fc9e, 00003, 00003130a, 000031ff1, 0000339bf,
> 000033ce, 000034, 00003538d, 000036074, 0000374b5, 000037a5b, 000038,
> 000038e83, 000039b6a, 00003b538, 00003c, 00003dbed, 00003e8d4]
> rate : 205.34
> 10 11:03:01,983 [impl.ThriftScanner] WARN : Security Violation in scan
> request to 10.250.99.204:39256: ThriftSecurityException(user:root,
> code:null)
> ERROR:test.auto:This looks like a stack trace: Thread
> "org.apache.accumulo.server.test.functional.FunctionalTest" died null
> java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.accumulo.start.Main$1.run(Main.java:89)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.NullPointerException
>        at
> org.apache.accumulo.core.client.AccumuloSecurityException.getDefaultErrorMessage(AccumuloSecurityException.java:30)
>        at
> org.apache.accumulo.core.client.AccumuloSecurityException.<init>(AccumuloSecurityException.java:70)
>        at
> org.apache.accumulo.core.client.impl.ThriftScanner.getBatchFromServer(ThriftScanner.java:153)
>        at
> org.apache.accumulo.core.client.impl.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:88)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:393)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:536)
>        at
> org.apache.accumulo.core.client.impl.TabletLocator$1._locateTablet(TabletLocator.java:115)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:370)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:390)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:536)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:215)
>        at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:288)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:236)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:301)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$900(TabletServerBatchReaderIterator.java:73)
>        at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:374)
>        at
> org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        ... 1 more
>
> FAIL
> ======================================================================
> FAIL: runTest (simple.batchScanSplit.BatchScanSplitTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>  File "/opt/accumulo-1.4.0/test/system/auto/JavaTest.py", line 57, in
> runTest
>    self.waitForStop(handle, self.maxRuntime)
>  File "/opt/accumulo-1.4.0/test/system/auto/TestUtils.py", line 368, in
> waitForStop
>    self.assert_(self.processResult(out, err, handle.returncode))
> AssertionError: False is not true
>
>
> ======================================================================
> FAIL: runTest (simple.batchScanSplit.BatchScanSplitTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>  File "/opt/accumulo-1.4.0/test/system/auto/JavaTest.py", line 57, in
> runTest
>    self.waitForStop(handle, self.maxRuntime)
>  File "/opt/accumulo-1.4.0/test/system/auto/TestUtils.py", line 368, in
> waitForStop
>    self.assert_(self.processResult(out, err, handle.returncode))
> AssertionError: False is not true
>
> ----------------------------------------------------------------------
> Ran 1 test in 118.172s
>
> FAILED (failures=1)
>
>
> In this case, I found nothing in the logs that seemed helpful, but perhaps
> I don't understand what I'm looking at. All I see in the
> tserver_xxx.debug.log prior to the error is this and the times are nearly 2
> minutes before the failure.
>
> 10 11:01:20,906 [tabletserver.Tablet] DEBUG: completeClose(saveState=true
> completeClose=false) 1;00001bb57;00001b6ad
> 10 11:01:20,908 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 10.250.99.204:58769 !0 1 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 10 11:01:20,911 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 10.250.99.204:58769 !0 1 entries in 0.00 secs, nbTimes = [0 0 0.00 1]
> 10 11:01:20,911 [file.FileUtil] DEBUG: Found midPoint from indexes in
> 0.01 secs.
>
> 10 11:01:20,913 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 10.250.99.204:58769 !0 1 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 10 11:01:20,914 [file.FileUtil] WARN : Failed to find mid point using
> indexes, falling back to data files which is slower. No entries between
> 00000fa92 and 00000fd49 for
> [/user/mapr/accumulo-SE-test-04-17205/tables/1/default_tablet/F0000000.rf]
> 10 11:01:20,930 [file.FileUtil] DEBUG: Found midPoint from indexes in
> 0.02 secs.
>
> 10 11:01:20,935 [file.FileUtil] WARN : Failed to find mid point using
> indexes, falling back to data files which is slower. No entries between
> 00000fd49 and 00001 for
> [/user/mapr/accumulo-SE-test-04-17205/tables/1/default_tablet/F0000000.rf]
> 10 11:01:20,952 [file.FileUtil] DEBUG: Found midPoint from indexes in
> 0.02 secs.
>
> There is no further output. Here is the complete tar and gzip of the logs:
>
>
>
>
> I really have no idea what is wrong here. I suspect some kind of security
> issue based on the stack trace, but its hard to tell. It looks like
> Accumulo was trying to report a security error but the error handling code
> had an issue. Is there some setting I need to provide to ensure that
> Accumulo has access to its error messages?
>
> If anyone has suggestions on how to look into this further from the
> Accumulo side, I'd really appreciate it.
>
> Thanks,
> Keys
> ________________________________
> Keys Botzum
> Senior Principal Technologist
> WW Systems Engineering
> kbotzum@maprtech.com
> 443-718-0098
> MapR Technologies
> http://www.mapr.com
>
>
>

Mime
View raw message