Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CFAAF89D7 for ; Tue, 9 Aug 2011 01:12:29 +0000 (UTC) Received: (qmail 6337 invoked by uid 500); 9 Aug 2011 01:12:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 6217 invoked by uid 500); 9 Aug 2011 01:12:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 6209 invoked by uid 99); 9 Aug 2011 01:12:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 01:12:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anthony.ikeda.dev@gmail.com designates 209.85.215.170 as permitted sender) Received: from [209.85.215.170] (HELO mail-ey0-f170.google.com) (209.85.215.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 01:12:22 +0000 Received: by eyd10 with SMTP id 10so3360838eyd.1 for ; Mon, 08 Aug 2011 18:12:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=P7W9GLjCNq/nJNJH1GfKeOEObM/l/33+XR/fTHguFGg=; b=jvZcDwfzuNitgEhWFsXjLJvXYj6Ft2fnUn+fWe9B9Sf2c02eAAwkiV0yR+urrhuiDS B0FMX70XvrfP4NHOMt0NA8L4g2OHyn+REjtHJx6Th8wul1gBD8R3xXKTPvg0e2GxlnW2 L+b97klCE/U59yADoIcwJDx24BZUSdXlhBI60= MIME-Version: 1.0 Received: by 10.213.10.134 with SMTP id p6mr1300690ebp.41.1312852320111; Mon, 08 Aug 2011 18:12:00 -0700 (PDT) Received: by 10.213.6.208 with HTTP; Mon, 8 Aug 2011 18:11:59 -0700 (PDT) In-Reply-To: <20110804231331.050558854cd8e3604bbb7dcb1d74d208.4412c733c4.wbe@email13.secureserver.net> References: <20110804231331.050558854cd8e3604bbb7dcb1d74d208.4412c733c4.wbe@email13.secureserver.net> Date: Mon, 8 Aug 2011 18:11:59 -0700 Message-ID: Subject: Re: Trying to find the problem with a broken pipe From: Anthony Ikeda To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd3a07a2ece6f04aa084060 --000e0cd3a07a2ece6f04aa084060 Content-Type: text/plain; charset=ISO-8859-1 Tim do you know if this is the actual reason that is causing the broken pipe? I'm having a hard time convincing my team that modifying this value will fix the issue. Jonathan, do you know if there is a valid explanation on why Tim no longer has the problem based on this change? Anthony Ikeda On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder wrote: > I no longer get the error on the loader program. The steps I took to fix > it are increasing the thrift_max_message_length_in_mb msg length, > stopping cassandra, blowing away the prior data store, and then > restarting cassandra. > > Tim > > > -------- Original Message -------- > Subject: Re: Trying to find the problem with a broken pipe > From: aaron morton > Date: Fri, August 05, 2011 12:58 am > To: user@cassandra.apache.org > > It's probably a network thing. > > The only thing I can think of in cassandra is > thrift_max_message_length_in_mb in the config. That config setting will > result in a TException thrown on the server side (i think), not sure if > that makes the server kill the socket. I would hope the error returns to > the client. > > Perhaps check the server log. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 4 Aug 2011, at 23:05, Tim Snyder wrote: > > > I am getting the same problem (Broken Pipe) on a loader program, after > > about 8 million read, write pairs. I am pushing serialized objects into > > a column with the program, the object it seems to be doing it on is much > > larger than the prior objects, so I am wondering if it is possibly a > > column size streaming issue through the thrift api? I am using Cassandra > > 0.8.0 and Hector 0.8.0-1 > > > > Tim > > > > -------- Original Message -------- > > Subject: Re: Trying to find the problem with a broken pipe > > From: Anthony Ikeda > > Date: Tue, August 02, 2011 10:43 pm > > To: user@cassandra.apache.org > > > >> Very interesting. After the second host goes down do you see > >> "me.prettyprint.hector.api.exceptions.HectorException: All host pools > >> marked down. Retry burden pushed out to client"? > > > > No, the last message is: > > 2011-08-02 08:43:06,561 INFO > > [me.prettyprint.cassandra.connection.HConnectionManager] - Client > > CassandraClient released to inactive or dead > > pool. Closing.> Does your client recover after a period of time? > > > > > > > > The application seems to be fine for now but my concern is the > > connection pooling as well - I mean do we have one pool or multiple? > > I'll post to the Hector user group about the pooling because the > > incident seems so isolated. We also have our infrastructure team looking > > into the communication between the application server and the cassandra > > nodes. > > > > > > So far it's still a mystery. > > > > > > > > > > > > On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona wrote: > > On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda > > wrote: > > > >> The link (which I may be misreading) > >> is > > > http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e > > > > > > I hadn't found that one, but I doubt that our issue is related to that. > > > > > >> It's only started happening today and happened on 2 occassions (8:43 > > and > >> 10:21) performing the same function (querying a column family). > >> It seems to be trying to access a connection on one of the servers > >> The client accesses the first node: > >> > >> 2011-08-02 08:43:06,541 ERROR > >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush > >> transport (to be expected if the pool is shutting down) in close for > > client: > >> CassandraClient > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Broken pipe > >> > >> ... > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host > > CassandraClient > >> > >> ... > >> > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > > AS DOWN > >> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > > on > >> shutdown: > >> > > > :{cassandradevrk1(10.130.202.34):9393}; > >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > > 49 > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> triggered on > >> > > > :{cassandradevrk1(10.130.202.34):9393} > >> 2011-08-02 08:43:06,544 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> complete on > >> > > > :{cassandradevrk1(10.130.202.34):9393} > >> 2011-08-02 08:43:06,544 INFO > >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >> detected as down was added to retry queue: > >> cassandradevrk1(10.130.202.34):9393 > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host CassandraClient< > >> cassandradevrk1:9393-33> > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: > >> me.prettyprint.hector.api.exceptions.HectorTransportException: > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Connection reset > >> > >> > >> Then it appears to try the second node and fails: > >> > >> 2011-08-02 08:43:06,556 INFO > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Client > >> CassandraClient released to inactive or dead > > pool. > >> Closing. > >> 2011-08-02 08:43:06,557 ERROR > >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush > >> transport (to be expected if the pool is shutting down) in close for > > client: > >> CassandraClient > >> > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Broken pipe > >> > >> 2011-08-02 08:43:06,558 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > > AS DOWN > >> TRIGGERED for host cassandradevrk2(10.130.202.35):9393 > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > > on > >> shutdown: > >> > > > :{cassandradevrk2(10.130.202.35):9393}; > >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > > 49 > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> triggered on > >> > > > :{cassandradevrk2(10.130.202.35):9393} > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> complete on > >> > > > :{cassandradevrk2(10.130.202.35):9393} > >> 2011-08-02 08:43:06,559 INFO > >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >> detected as down was added to retry queue: > >> cassandradevrk2(10.130.202.35):9393 > >> 2011-08-02 08:43:06,560 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host > > CassandraClient > >> 2011-08-02 08:43:06,560 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: > >> me.prettyprint.hector.api.exceptions.HectorTransportException: > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Connection reset > > > > > > > > Very interesting. After the second host goes down do you see > > "me.prettyprint.hector.api.exceptions.HectorException: All host pools > > marked down. Retry burden pushed out to client"? > > > > Does your client recover after a period of time? > > > >> > >> The process is the same at 10:21. > >> Are the exceptions related to any external events (e.g. node > >> restarts, network issues...)? > >> Not that I'm aware, unless there are firewall timeouts between the > >> application and the node servers. Let me find out. The cassandra log > > files > >> have no errors reported. > >> What versions of Hector and Cassandra are you running? > >> Cassandra 0.8.1, Hector 0.8.0-1 > > > > > > Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan > > to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive. > > Are you using that? We're also using tcpdump to capture packets when > > failures occur to see if there are anomalies in the network traffic. > > > > Jim > > > > > > > >> > >> > >> > >> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona > > wrote: > >>> > >>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda > >>> wrote: > >>>> I'm not sure if this is a problem with Hector or with Cassandra. > >>>> We seem to be seeing broken pipe issues with our connections on > > the > >>>> client > >>>> side (Exception below). A bit of googling finds possibly a problem > > with > >>>> the > >>>> amount of data we are trying to store, although I'm certain our > > datasets > >>>> are > >>>> not all that large. > >>> > >>> I'm not sure what you're referring to here. Large requests could > > lead > >>> to timeouts, but that's not what you're seeing here. Could you link > > to > >>> the page you're referencing? > >>> > >>>> A nodetool ring command doesn't seem to present any downed nodes: > >>>> Address DC Rack Status State Load > >>>> Owns > >>>> Token > >>>> > >>>> 153951716904446304929228999025275230571 > >>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB > >>>> 79.19% 118538200848404459763384037192174096102 > >>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB > >>>> 20.81% 153951716904446304929228999025275230571 > >>>> > >>>> There are no errors in the cassandra server logs. > >>>> > >>>> Are there any particular timeouts on connections that we need to > > be > >>>> aware > >>>> of? Or perhaps configure on the Cassandra nodes? Is this purely > > and > >>>> issue > >>>> with the Hector API configuration? > >>> > >>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml) > >>> and a Hector client-side timeout > >>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again, > >>> the "Broken pipe" error is not a timeout, it indicates that > > something > >>> happened to the underlying network socket. For example you will see > >>> those when a server node is restarted. > >>> > >>> Some questions that might help troubleshoot this: > >>> How often are these occurring? > >>> Does this affect both nodes in the cluster or just one? > >>> Are the exceptions related to any external events (e.g. node > > restarts, > >>> network issues...)? > >>> What versions of Hector and Cassandra are you running? > >>> > >>> Keep in mind that failures like this will normally be retried by > >>> Hector, resulting in no loss of data. For that reason, I think that > >>> exception is logged as a warning in the newest Hector versions. > >>> > >>> We've seen something similar, but more catastrophic because it > > affects > >>> connectivity to the entire cluster, not just a single node. See this > >>> post for more details: http://goo.gl/hrgkw So far we haven't > >>> identified the cause. > >>> > >>> Jim > >>> > >>>> Anthony > >>>> > >>>> 2011-08-02 08:43:06,541 ERROR > >>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not > > flush > >>>> transport (to be expected if the pool is shutting down) in close > > for > >>>> client: > >>>> CassandraClient > >>>> org.apache.thrift.transport.TTransportException: > >>>> java.net.SocketException: > >>>> Broken pipe > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247) > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > >>>> at > >>>> > >>>> > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>>> at > >>>> > >>>> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>> at > >>>> > >>>> > > > org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549) > >>>> at > > javax.servlet.http.HttpServlet.service(HttpServlet.java:617) > >>>> at > > javax.servlet.http.HttpServlet.service(HttpServlet.java:717) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >>>> at > >>>> > >>>> > > > org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77) > >>>> at > >>>> > >>>> > > > org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > >>>> at > >>>> > >>>> > > > org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > >>>> at > >>>> > >>>> > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > >>>> at > >>>> > >>>> > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) > >>>> at > >>>> > > org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) > >>>> at > >>>> > > org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) > >>>> at > >>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774) > >>>> at > >>>> > >>>> > > > org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703) > >>>> at > >>>> > >>>> > > > org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896) > >>>> at > >>>> > >>>> > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> Caused by: java.net.SocketException: Broken pipe > >>>> at java.net.SocketOutputStream.socketWrite0(Native Method) > >>>> at > >>>> > > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > >>>> at > > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) > >>>> ... 47 more > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK > > HOST AS > >>>> DOWN > >>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool > > state on > >>>> shutdown: > >>>> > >>>> > > > :{cassandradevrk1(10.130.202.34):9393}; > >>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; > > NumBeforeExhausted: 49 > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >>>> triggered on > >>>> > >>>> > > > :{cassandradevrk1(10.130.202.34):9393} > >>>> 2011-08-02 08:43:06,544 ERROR > >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >>>> complete on > >>>> > >>>> > > > :{cassandradevrk1(10.130.202.34):9393} > >>>> 2011-08-02 08:43:06,544 INFO > >>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >>>> detected as down was added to retry queue: > >>>> cassandradevrk1(10.130.202.34):9393 > >>>> 2011-08-02 08:43:06,544 WARN > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could > > not > >>>> fullfill request on this host > > CassandraClient > >>>> > >> > >> > > > > --000e0cd3a07a2ece6f04aa084060 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Tim do you know if this is the actual reason that is causing the broken pip= e? I'm having a hard time convincing my team that modifying this value = will fix the issue.

Jonathan, do you know if there is a = valid explanation on why Tim no longer has the problem based on this change= ?

Anthony Ikeda



On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder <tim@proinnovations.com<= /a>> wrote:
I no longer get the error on the loader pro= gram. The steps I took to fix
it are increasing the thrift_max_message_length_in_mb msg length,
stopping cassandra, blowing away the prior data store, and then
restarting cassandra.

Tim


-------- Original Message --------
Subject: Re: Trying to find the problem with a broken pipe
From: aaron morton <aaron@thelastpickle.com>
Date: Fri, August 05, 2011 12:58 am
To: user@cassandra.apache.org<= /a>

It's probably a network thing.

The only thing I can think of in cassandra is
thrift_max_message_length_in_mb in the config. That config setting will
result in a TException thrown on the server side (i think), not sure if
that makes the server kill the socket. I would hope the error returns to the client.

Perhaps check the server log.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thela= stpickle.com

On 4 Aug 2011, at 23:05, Tim Snyder wrote:

> I am getting the same problem (Broken Pipe) on a loader program, after=
> about 8 million read, write pairs. I am pushing serialized objects int= o
> a column with the program, the object it seems to be doing it on is mu= ch
> larger than the prior objects, so I am wondering if it is possibly a > column size streaming issue through the thrift api? I am using Cassand= ra
> 0.8.0 and Hector 0.8.0-1
>
> Tim
>
> -------- Original Message --------
> Subject: Re: Trying to find the problem with a broken pipe
> From: Anthony Ikeda <anthony.ikeda.dev@gmail.com>
> Date: Tue, August 02, 2011 10:43 pm
> To: user@cassandra.apache= .org
>
>> Very interesting. After the second host goes down do you see
>> "me.prettyprint.hector.api.exceptions.HectorException: All ho= st pools
>> marked down. Retry burden pushed out to client"?
>
> No, the last message is:
> 2011-08-02 08:43:06,561 INFO
> [me.prettyprint.cassandra.connection.HConnectionManager] - Client
> CassandraClient<cassandradevrk2:9393-49> released to inactive or= dead
> pool. Closing.> Does your client recover after a period of time? >
>
>
> The application seems to be fine for now but my concern is the
> connection pooling as well - I mean do we have one pool or multiple? > I'll post to the Hector user group about the pooling because the > incident seems so isolated. We also have our infrastructure team looki= ng
> into the communication between the application server and the cassandr= a
> nodes.
>
>
> So far it's still a mystery.
>
>
>
>
>
> On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <jim@anconafamily.com> wrote:
> On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda
> <anthony.ikeda.dev@g= mail.com> wrote:
>
>> The link (which I may be misreading)
>> is
> http://groups.google.com/group/he= ctor-users/browse_thread/thread/8d7004b6f85a0f2e
>
>
> I hadn't found that one, but I doubt that our issue is related to = that.
>
>
>> It's only started happening today and happened on 2 occassions= (8:43
> and
>> 10:21) performing the same function (querying a column family). >> It seems to be trying to access a connection on one of the servers=
>> The client accesses the first node:
>>
>> 2011-08-02 08:43:06,541 ERROR
>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not fl= ush
>> transport (to be expected if the pool is shutting down) in close f= or
> client:
>> CassandraClient<cassandradevrk1:9393-33>
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Broken pipe
>>
>> ...
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could n= ot
>> fullfill request on this host
> CassandraClient<cassandradevrk1:9393-33>
>>
>> ...
>>
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HO= ST
> AS DOWN
>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool st= ate
> on
>> shutdown:
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393};
>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhaust= ed:
> 49
>> 2011-08-02 08:43:06,543 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> triggered on
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393}
>> 2011-08-02 08:43:06,544 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> complete on
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393}
>> 2011-08-02 08:43:06,544 INFO
>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -<= br> > Host
>> detected as down was added to retry queue:
>> cassandradevrk1(10.130.202.34):9393
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could n= ot
>> fullfill request on this host CassandraClient<
>> cassandradevrk1:9393-33>
>> 2011-08-02 08:43:06,544 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Excepti= on:
>> me.prettyprint.hector.api.exceptions.HectorTransportException:
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Connection reset
>>
>>
>> Then it appears to try the second node and fails:
>>
>> 2011-08-02 08:43:06,556 INFO
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Client<= br> >> CassandraClient<cassandradevrk1:9393-33> released to inactiv= e or dead
> pool.
>> Closing.
>> 2011-08-02 08:43:06,557 ERROR
>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not fl= ush
>> transport (to be expected if the pool is shutting down) in close f= or
> client:
>> CassandraClient<cassandradevrk2:9393-49>
>>
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Broken pipe
>>
>> 2011-08-02 08:43:06,558 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HO= ST
> AS DOWN
>> TRIGGERED for host cassandradevrk2(10.130.202.35):9393
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool st= ate
> on
>> shutdown:
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.20= 2.35):9393};
>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhaust= ed:
> 49
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> triggered on
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.20= 2.35):9393}
>> 2011-08-02 08:43:06,559 ERROR
>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> Shutdown
>> complete on
>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.20= 2.35):9393}
>> 2011-08-02 08:43:06,559 INFO
>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -<= br> > Host
>> detected as down was added to retry queue:
>> cassandradevrk2(10.130.202.35):9393
>> 2011-08-02 08:43:06,560 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could n= ot
>> fullfill request on this host
> CassandraClient<cassandradevrk2:9393-49>
>> 2011-08-02 08:43:06,560 WARN
>> [me.prettyprint.cassandra.connection.HConnectionManager] - Excepti= on:
>> me.prettyprint.hector.api.exceptions.HectorTransportException:
>> org.apache.thrift.transport.TTransportException:
> java.net.SocketException:
>> Connection reset
>
>
>
> Very interesting. After the second host goes down do you see
> "me.prettyprint.hector.api.exceptions.HectorException: All host p= ools
> marked down. Retry burden pushed out to client"?
>
> Does your client recover after a period of time?
>
>>
>> The process is the same at 10:21.
>> Are the exceptions related to any external events (e.g. node
>> restarts, network issues...)?
>> Not that I'm aware, unless there are firewall timeouts between= the
>> application and the node servers. Let me find out. The cassandra l= og
> files
>> have no errors reported.
>> What versions of Hector and Cassandra are you running?
>> Cassandra 0.8.1, Hector 0.8.0-1
>
>
> Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan=
> to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive. > Are you using that? We're also using tcpdump to capture packets wh= en
> failures occur to see if there are anomalies in the network traffic. >
> Jim
>
>
>
>>
>>
>>
>> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <jim@anconafamily.com>
> wrote:
>>>
>>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
>>> <anthony.ike= da.dev@gmail.com> wrote:
>>>> I'm not sure if this is a problem with Hector or with = Cassandra.
>>>> We seem to be seeing broken pipe issues with our connectio= ns on
> the
>>>> client
>>>> side (Exception below). A bit of googling finds possibly a= problem
> with
>>>> the
>>>> amount of data we are trying to store, although I'm ce= rtain our
> datasets
>>>> are
>>>> not all that large.
>>>
>>> I'm not sure what you're referring to here. Large requ= ests could
> lead
>>> to timeouts, but that's not what you're seeing here. C= ould you link
> to
>>> the page you're referencing?
>>>
>>>> A nodetool ring command doesn't seem to present any do= wned nodes:
>>>> Address DC Rack Status State Load
>>>> Owns
>>>> Token
>>>>
>>>> 153951716904446304929228999025275230571
>>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB
>>>> 79.19% 118538200848404459763384037192174096102
>>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB
>>>> 20.81% 153951716904446304929228999025275230571
>>>>
>>>> There are no errors in the cassandra server logs.
>>>>
>>>> Are there any particular timeouts on connections that we n= eed to
> be
>>>> aware
>>>> of? Or perhaps configure on the Cassandra nodes? Is this p= urely
> and
>>>> issue
>>>> with the Hector API configuration?
>>>
>>> There is a server side timeout (rpc_timeout_in_ms in cassandra= .yaml)
>>> and a Hector client-side timeout
>>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But = again,
>>> the "Broken pipe" error is not a timeout, it indicat= es that
> something
>>> happened to the underlying network socket. For example you wil= l see
>>> those when a server node is restarted.
>>>
>>> Some questions that might help troubleshoot this:
>>> How often are these occurring?
>>> Does this affect both nodes in the cluster or just one?
>>> Are the exceptions related to any external events (e.g. node > restarts,
>>> network issues...)?
>>> What versions of Hector and Cassandra are you running?
>>>
>>> Keep in mind that failures like this will normally be retried = by
>>> Hector, resulting in no loss of data. For that reason, I think= that
>>> exception is logged as a warning in the newest Hector versions= .
>>>
>>> We've seen something similar, but more catastrophic becaus= e it
> affects
>>> connectivity to the entire cluster, not just a single node. Se= e this
>>> post for more details: http://goo.gl/hrgkw So far we haven't
>>> identified the cause.
>>>
>>> Jim
>>>
>>>> Anthony
>>>>
>>>> 2011-08-02 08:43:06,541 ERROR
>>>> [me.prettyprint.cassandra.connection.HThriftClient] - Coul= d not
> flush
>>>> transport (to be expected if the pool is shutting down) in= close
> for
>>>> client:
>>>> CassandraClient<cassandradevrk1:9393-33>
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.SocketException:
>>>> Broken pipe
>>>> at
>>>>
>>>>
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTranspor= t.java:147)
>>>> at
>>>>
>>>>
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.ja= va:156)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.= java:85)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail= over(HConnectionManager.java:232)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailov= er(KeyspaceServiceImpl.java:131)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(Keyspace= ServiceImpl.java:289)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(= ThriftSliceQuery.java:53)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(= ThriftSliceQuery.java:49)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceA= ndMeasure(KeyspaceOperationCallback.java:20)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKe= yspace.java:85)
>>>> at
>>>>
>>>>
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftS= liceQuery.java:48)
>>>> at
>>>>
>>>>
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.= read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(Standa= rdFindRegistryPersistenceStrategy.java:237)
>>>> at
>>>>
>>>>
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.= read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPe= rsistenceStrategy.java:277)
>>>> at
>>>>
>>>>
> com.wsgc.services.registry.registryservice.impl.service.StandardRegist= ryService.getRegistriesByProfileId(StandardRegistryService.java:327)
>>>> at
>>>>
>>>>
> com.wsgc.services.registry.webapp.impl.RegistryServicesController.getR= egistriesByProfileId(RegistryServicesController.java:247)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>>>> at
>>>>
>>>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j= ava:39)
>>>> at
>>>>
>>>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess= orImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at
>>>>
>>>>
> org.springframework.web.bind.annotation.support.HandlerMethodInvoker.i= nvokeHandlerMethod(HandlerMethodInvoker.java:175)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandler= Adapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandler= Adapter.handle(AnnotationMethodHandlerAdapter.java:409)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.DispatcherServlet.doDispatch(Dispatche= rServlet.java:774)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.DispatcherServlet.doService(Dispatcher= Servlet.java:719)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.FrameworkServlet.processRequest(Framew= orkServlet.java:644)
>>>> at
>>>>
>>>>
> org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServle= t.java:549)
>>>> at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
>>>> at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli= cationFilterChain.java:290)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi= lterChain.java:206)
>>>> at
>>>>
>>>>
> org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal= (HiddenHttpMethodFilter.java:77)
>>>> at
>>>>
>>>>
> org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRe= questFilter.java:76)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli= cationFilterChain.java:235)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi= lterChain.java:206)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa= lve.java:233)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa= lve.java:191)
>>>> at
>>>>
>>>>
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticat= orBase.java:563)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja= va:127)
>>>> at
>>>>
>>>>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja= va:102)
>>>> at
>>>>
>>>>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv= e.java:109)
>>>> at
>>>>
>>>>
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java= :298)
>>>> at
>>>>
> org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)<= br> >>>> at
>>>>
> org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) >>>> at
>>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.ja= va:774)
>>>> at
>>>>
>>>>
> org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.jav= a:703)
>>>> at
>>>>
>>>>
> org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocke= t.java:896)
>>>> at
>>>>
>>>>
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPo= ol.java:690)
>>>> at java.lang.Thread.run(Thread.java:662)
>>>> Caused by: java.net.SocketException: Broken pipe
>>>> at java.net.SocketOutputStream.socketWrite0(Native Method)=
>>>> at
>>>>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) >>>> at
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>> at
>>>>
>>>>
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTranspor= t.java:145)
>>>> ... 47 more
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] -= MARK
> HOST AS
>>>> DOWN
>>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] -= Pool
> state on
>>>> shutdown:
>>>>
>>>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393};
>>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
> NumBeforeExhausted: 49
>>>> 2011-08-02 08:43:06,543 ERROR
>>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool= ] -
> Shutdown
>>>> triggered on
>>>>
>>>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393}
>>>> 2011-08-02 08:43:06,544 ERROR
>>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool= ] -
> Shutdown
>>>> complete on
>>>>
>>>>
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.20= 2.34):9393}
>>>> 2011-08-02 08:43:06,544 INFO
>>>> [me.prettyprint.cassandra.connection.CassandraHostRetrySer= vice] -
> Host
>>>> detected as down was added to retry queue:
>>>> cassandradevrk1(10.130.202.34):9393
>>>> 2011-08-02 08:43:06,544 WARN
>>>> [me.prettyprint.cassandra.connection.HConnectionManager] -= Could
> not
>>>> fullfill request on this host
> CassandraClient<cassandradevrk1:9393-33>
>>>>
>>
>>
>


--000e0cd3a07a2ece6f04aa084060--