phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Soldatov <sergeysolda...@gmail.com>
Subject Re: Jenkins build failures?
Date Wed, 04 May 2016 20:31:01 GMT
James,
Regarding HivePhoenixStoreIT. Are you talking about
Phoenix-4.x-HBase-1.0  job? Last build passed it successfully.


On Wed, May 4, 2016 at 10:15 AM, James Taylor <jamestaylor@apache.org> wrote:
> Our Jenkins builds have improved, but we're seeing some issues:
> - timeouts with the new org.apache.phoenix.hive.HivePhoenixStoreIT test.
> - consistent failure with 4.x-HBase-1.1 build. I suspect that Jenkins build
> is out-of-date, as we haven't had a 4.x-HBase-1.1 branch for quite a while.
> There's likely some changes that were made to the other Jenkins build
> scripts that weren't made to this one
> - flapping of
> the org.apache.phoenix.end2end.index.ReadOnlyIndexFailureIT.testWriteFailureReadOnlyIndex
> test in 0.98 and 1.0
> - no email sent for 0.98 build (as far as I can tell)
>
> If folks have time to look into these, that'd be much appreciated.
>
>     James
>
>
>
> On Sat, Apr 30, 2016 at 11:55 AM, James Taylor <jamestaylor@apache.org>
> wrote:
>
>> The defaults when tests are running are much lower than the standard
>> Phoenix defaults (see QueryServicesTestImpl and
>> BaseTest.setUpConfigForMiniCluster()). It's unclear to me why the
>> HashJoinIT and SortMergeJoinIT tests (I think these are the culprits) do
>> not seem to adhere to these (or maybe override them?). They fail for me on
>> my Mac, but they do pass on a Linux box. Would be awesome if someone could
>> investigate and submit a patch to fix these.
>>
>> Thanks,
>> James
>>
>> On Sat, Apr 30, 2016 at 11:47 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>
>>> The default thread pool sizes for HDFS, HBase, ZK, and the Phoenix client
>>> are all contributing to this huge thread count.
>>>
>>> A good starting point would be to take a jstack of the IT process and
>>> count, group by threads with similar name. Reconfigure to reduce all those
>>> groups to something like 10 each, see if the test still runs reliably on
>>> local hardware.
>>>
>>> On Friday, April 29, 2016, Sergey Soldatov <sergeysoldatov@gmail.com>
>>> wrote:
>>>
>>> > but the way, we need to do something with those OOMs and "unable to
>>> > create new native thread" in ITs. It's quite strange to see in 10
>>> > lines test such kind of failures. Especially when queries for table
>>> > with less than 10 rows generate over 2500 threads. Does anybody know
>>> > whether it's zk related issue?
>>> >
>>> > On Fri, Apr 29, 2016 at 7:51 AM, James Taylor <jamestaylor@apache.org
>>> > <javascript:;>> wrote:
>>> > > A patch would be much appreciated, Sergey.
>>> > >
>>> > > On Fri, Apr 29, 2016 at 3:26 AM, Sergey Soldatov <
>>> > sergeysoldatov@gmail.com <javascript:;>>
>>> > > wrote:
>>> > >
>>> > >> As for flume module - flume-ng is coming with commons-io 2.1 while
>>> > >> hadoop & hbase require org.apache.commons.io.Charsets which
was
>>> > >> introduced in 2.3. Easy way is to move dependency on flume-ng after
>>> > >> the dependencies on hbase/hadoop.
>>> > >>
>>> > >> The last thing about ConcurrentHashMap - it definitely means that
the
>>> > >> code was compiled with 1.8 since 1.7 returns a simple Set while
1.8
>>> > >> returns KeySetView
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Thu, Apr 28, 2016 at 4:08 PM, Josh Elser <josh.elser@gmail.com
>>> > <javascript:;>> wrote:
>>> > >> > *tl;dr*
>>> > >> >
>>> > >> > * I'm removing ubuntu-us1 from all pools
>>> > >> > * Phoenix-Flume ITs look busted
>>> > >> > * UpsertValuesIT looks busted
>>> > >> > * Something is weirdly wrong with Phoenix-4.x-HBase-1.1 in
its
>>> > entirety.
>>> > >> >
>>> > >> > Details below...
>>> > >> >
>>> > >> > It looks like we have a bunch of different reasons for the
>>> failures.
>>> > >> > Starting with Phoenix-master:
>>> > >> >
>>> > >> >>>>
>>> > >> > org.apache.phoenix.schema.NewerTableAlreadyExistsException:
ERROR
>>> 1013
>>> > >> > (42M04): Table already exists. tableName=T
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.phoenix.end2end.UpsertValuesIT.testBatchedUpsert(UpsertValuesIT.java:476)
>>> > >> > <<<
>>> > >> >
>>> > >> > I've seen this coming out of a few different tests (I think
I've
>>> also
>>> > run
>>> > >> > into it on my own, but that's another thing)
>>> > >> >
>>> > >> > Some of them look like the Jenkins build host is just over-taxed:
>>> > >> >
>>> > >> >>>>
>>> > >> > Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>> > >> > os::commit_memory(0x00000007e7600000, 331350016, 0) failed;
>>> > error='Cannot
>>> > >> > allocate memory' (errno=12)
>>> > >> > #
>>> > >> > # There is insufficient memory for the Java Runtime Environment
to
>>> > >> continue.
>>> > >> > # Native memory allocation (malloc) failed to allocate 331350016
>>> bytes
>>> > >> for
>>> > >> > committing reserved memory.
>>> > >> > # An error report file with more information is saved as:
>>> > >> > #
>>> > >> >
>>> > >>
>>> >
>>> /home/jenkins/jenkins-slave/workspace/Phoenix-master/phoenix-core/hs_err_pid26454.log
>>> > >> > Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>> > >> > os::commit_memory(0x00000007ea600000, 273678336, 0) failed;
>>> > error='Cannot
>>> > >> > allocate memory' (errno=12)
>>> > >> > #
>>> > >> > <<<
>>> > >> >
>>> > >> > and
>>> > >> >
>>> > >> >>>>
>>> > >> > -------------------------------------------------------
>>> > >> >  T E S T S
>>> > >> > -------------------------------------------------------
>>> > >> > Build step 'Invoke top-level Maven targets' marked build as
failure
>>> > >> > <<<
>>> > >> >
>>> > >> > Both of these issues are limited to the host "ubuntu-us1".
Let me
>>> just
>>> > >> > remove him from the pool (on Phoenix-master) and see if that
helps
>>> at
>>> > >> all.
>>> > >> >
>>> > >> > I also see some sporadic failures of some Flume tests
>>> > >> >
>>> > >> >>>>
>>> > >> > Running org.apache.phoenix.flume.PhoenixSinkIT
>>> > >> > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
>>> 0.004
>>> > sec
>>> > >> > <<< FAILURE! - in org.apache.phoenix.flume.PhoenixSinkIT
>>> > >> > org.apache.phoenix.flume.PhoenixSinkIT  Time elapsed: 0.004
sec
>>> <<<
>>> > >> ERROR!
>>> > >> > java.lang.RuntimeException: java.io.IOException: Failed to
save in
>>> any
>>> > >> > storage directories while saving namespace.
>>> > >> > Caused by: java.io.IOException: Failed to save in any storage
>>> > directories
>>> > >> > while saving namespace.
>>> > >> >
>>> > >> > Running org.apache.phoenix.flume.RegexEventSerializerIT
>>> > >> > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
>>> 0.005
>>> > sec
>>> > >> > <<< FAILURE! - in org.apache.phoenix.flume.RegexEventSerializerIT
>>> > >> > org.apache.phoenix.flume.RegexEventSerializerIT  Time elapsed:
>>> 0.004
>>> > sec
>>> > >> > <<< ERROR!
>>> > >> > java.lang.RuntimeException: java.io.IOException: Failed to
save in
>>> any
>>> > >> > storage directories while saving namespace.
>>> > >> > Caused by: java.io.IOException: Failed to save in any storage
>>> > directories
>>> > >> > while saving namespace.
>>> > >> > <<<
>>> > >> >
>>> > >> > I'm not sure what the error message means at a glance.
>>> > >> >
>>> > >> > For Phoenix-HBase-1.1:
>>> > >> >
>>> > >> >>>>
>>> > >> > org.apache.hadoop.hbase.DoNotRetryIOException:
>>> > >> java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>>> > >> >         at
>>> > >> > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>>> > >> >         at java.lang.Thread.run(Thread.java:745)
>>> > >> > Caused by: java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
>>> > >> >         ... 4 more
>>> > >> > 2016-04-28 22:54:35,497 WARN  [RS:0;hemera:41302]
>>> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer(2279):
error
>>> > telling
>>> > >> > master we are up
>>> > >> > com.google.protobuf.ServiceException:
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
>>> > >> > org.apache.hadoop.hbase.DoNotRetryIOException:
>>> > >> java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>>> > >> >         at
>>> > >> > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>>> > >> >         at java.lang.Thread.run(Thread.java:745)
>>> > >> > Caused by: java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
>>> > >> >         ... 4 more
>>> > >> >
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:318)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2269)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:893)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:156)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:108)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:140)
>>> > >> >         at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> > >> >         at javax.security.auth.Subject.doAs(Subject.java:356)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:307)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:138)
>>> > >> >         at java.lang.Thread.run(Thread.java:745)
>>> > >> > Caused by:
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
>>> > >> > org.apache.hadoop.hbase.DoNotRetryIOException:
>>> > >> java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>>> > >> >         at
>>> > >> > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>>> > >> >         at java.lang.Thread.run(Thread.java:745)
>>> > >> > Caused by: java.lang.NoSuchMethodError:
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
>>> > >> >         at
>>> > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
>>> > >> >         ... 4 more
>>> > >> >
>>> > >> >         at
>>> > >> >
>>> > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1235)
>>> > >> >         at
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:217)
>>> > >> >         ... 13 more
>>> > >> > <<<
>>> > >> >
>>> > >> > We have hit-or-miss on this error message which keeps
>>> hbase:namespace
>>> > >> from
>>> > >> > being assigned (as the RS's can never report into the hmaster).
>>> This
>>> > is
>>> > >> > happening across a couple of the nodes (ubuntu-[3,4,6]). I
had
>>> tried
>>> > to
>>> > >> look
>>> > >> > into this one over the weekend (and was lead to a JDK8 built
jar,
>>> > >> running on
>>> > >> > JDK7), but if I look at META-INF/MANIFEST.mf in the
>>> > >> hbase-server-1.1.3.jar
>>> > >> > from central, I see it was built with 1.7.0_80 (which I think
means
>>> > the
>>> > >> JDK8
>>> > >> > thought is a red-herring). I'm really confused by this one,
>>> actually.
>>> > >> > Something must be amiss here.
>>> > >> >
>>> > >> > For Phoenix-HBase-1.0:
>>> > >> >
>>> > >> > We see the same Phoenix-Flume failures, UpsertValuesIT failure,
and
>>> > >> timeouts
>>> > >> > on ubuntu-us1. There is one crash on H10, but that might just
be
>>> bad
>>> > >> luck.
>>> > >> >
>>> > >> > For Phoenix-HBase-0.98:
>>> > >> >
>>> > >> > Same UpsertValuesIT failure and failures on ubuntu-us1.
>>> > >> >
>>> > >> >
>>> > >> > James Taylor wrote:
>>> > >> >>
>>> > >> >> Anyone know why our Jenkins builds keep failing? Is it
>>> environmental
>>> > and
>>> > >> >> is
>>> > >> >> there anything we can do about it?
>>> > >> >>
>>> > >> >> Thanks,
>>> > >> >> James
>>> > >> >>
>>> > >> >
>>> > >>
>>> >
>>>
>>
>>

Mime
View raw message