Return-Path: Delivered-To: apmail-jakarta-commons-httpclient-dev-archive@www.apache.org Received: (qmail 80887 invoked from network); 30 Sep 2004 00:00:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 30 Sep 2004 00:00:48 -0000 Received: (qmail 24931 invoked by uid 500); 30 Sep 2004 00:00:43 -0000 Delivered-To: apmail-jakarta-commons-httpclient-dev-archive@jakarta.apache.org Received: (qmail 24890 invoked by uid 500); 30 Sep 2004 00:00:42 -0000 Mailing-List: contact commons-httpclient-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Commons HttpClient Project" Reply-To: "Commons HttpClient Project" Delivered-To: mailing list commons-httpclient-dev@jakarta.apache.org Received: (qmail 24765 invoked by uid 99); 30 Sep 2004 00:00:41 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [209.237.232.202] (HELO ia00524.archive.org) (209.237.232.202) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 29 Sep 2004 17:00:39 -0700 Received: (qmail 25081 invoked by uid 100); 29 Sep 2004 23:49:58 -0000 Received: from debord.archive.org (HELO ?207.241.238.140?) (stack@archive.org@207.241.238.140) by mail-dev.archive.org with SMTP; 29 Sep 2004 23:49:58 -0000 Message-ID: <415B4B13.40102@archive.org> Date: Wed, 29 Sep 2004 16:53:55 -0700 From: stack User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040820 Debian/1.7.2-4 X-Accept-Language: en-us MIME-Version: 1.0 To: Commons HttpClient Project Subject: Re: ATTN Open-source projects using HttpClient References: <1095713420.5262.44.camel@localhost.localdomain> <1096310636.2639.1.camel@localhost.localdomain> <1096387771.5691.26.camel@localhost.localdomain> <03d601c4a58a$331c9e80$6501a8c0@sybase.com> <1096397079.2677.9.camel@localhost.localdomain> In-Reply-To: <1096397079.2677.9.camel@localhost.localdomain> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-DCC: : X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on ia00524.archive.org X-Spam-Level: X-Spam-Status: No, hits=0.1 required=7.0 tests=AWL autolearn=ham version=2.63 X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Oleg Kalnichevski wrote: >Thanks, Adam > >Should we decide to go on a spamming spree, these may also become >potential victims ;-) > > > Let me preempt the spam (smile). I'm part of the webgroup at the Internet Archive (archive.org). We've been using httpclient at the heart of our open source crawler Heritrix (crawler.archive.org) for near on a year now (I've added on the end a submission for the httpclient applications page). I've just upgraded HEAD to use 3.0alpha2 and was going to say a few words about the experience and how its running. Heads up. The following message is a little long. The upgrade took way longer than I anticipated, a couple of days rather than a couple of hours. While some of the time was spent on refactoring only slightly related to the httpcilent upgrade and testing to see all httpclient used features still work post upgrade, the bulk of the time was spent on redoing our auth system to fit the redesigned httpclient auth system. I had trouble figuring out how things work now in the absence of example. Our usage is a little out-of-the-ordinary in that we manage own store of credentials and manage when to load them onto a httpmethod. Previous, HttpAuthenticator would select the scheme for me. Now it seems like I have to do it myself using AuthChallengeParser and then iterate over the returns. In general the new auth system changes look to be for the best. It just cost time exploring. Thereafter, the remainder was spent undoing our own custom retry to instead use a custom HttpMethodRetryHandler that is now part of httpclient core, study of the new configuration system to ensure correct usage, undoing old ways of specifying preferences, and exploiting new preference granularity particularly where it could make the crawler more robust (UNAMBIGUOUS_STATUS_LINE, STRICT_TRANSFER_ENCODING, STATUS_LINE_GARBAGE_LIMIT, etc). With the new lib in place, we pass all of our little suite of selftests (includes Auth tests of logins and Basic and Digest Auth), and random broad crawling shows performance as comparable with the only weird exceptions having to do with timeouts on IBM SSL Sockets. Later I should have more detailed feedback on performance and robustness. The IBM SSL socket timeout issues I'm seeing when I get an SSLSocket with a timeout (I set the timeout by getting a socket with the null arg constructor then doing an SSLSocket$connect with a timeout). The exceptions do not happen when I use SUN JVM 1.4.2. These are probably IBM JVM issues but I'll list them here anyways: 1. The IBM JVM 141 (cxia321411-20030930) NPEs setting the NoTcpDelay. Is anyone else seeing this? java.lang.NullPointerException at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:683) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328) 2. Using the IBM JVM 142, its saying SSL connection not open when we go to use inputstreams. java.net.SocketException: Socket is not connected at java.net.Socket.getInputStream(Socket.java:726) at com.ibm.jsse.bs.getInputStream(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:715) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328) By way of feedback on the 3.0 API, I'll describe the two places where the API is lacking regards our requirements forcing us to do yucky overlays. First some context. The crawler must record the response headers and response content exactly as it comes back over the wire and its supposed to be tenacious. Regards recording exactly what the server sent us, we overlay HttpConnection with a version that wraps the socket input and output streams. Here's the diff: +// HERITRIX import. +import org.archive.util.HttpRecorder; + /** * An abstraction of an HTTP {@link InputStream} and {@link OutputStream} * pair, together with the relevant attributes. @@ -676,7 +679,6 @@ highly interactive environments, such as some client/server situations. In such cases, nagling may be turned off through use of the TCP_NODELAY sockets option." */ - socket.setTcpNoDelay(this.params.getTcpNoDelay()); socket.setSoTimeout(this.params.getSoTimeout()); @@ -701,8 +703,23 @@ if (inbuffersize > 2048) { inbuffersize = 2048; } - inputStream = new BufferedInputStream(socket.getInputStream(), inbuffersize); - outputStream = new BufferedOutputStream(socket.getOutputStream(), outbuffersize); + // START HERITRIX Change + HttpRecorder httpRecorder = HttpRecorder.getHttpRecorder(); + if (httpRecorder == null) { + inputStream = new BufferedInputStream( + socket.getInputStream(), inbuffersize); + outputStream = new BufferedOutputStream( + socket.getOutputStream(), outbuffersize); + } else { + inputStream = httpRecorder.inputWrap((InputStream) + (new BufferedInputStream(socket.getInputStream(), + inbuffersize))); + outputStream = httpRecorder.outputWrap((OutputStream) + (new BufferedOutputStream(socket.getOutputStream(), + outbuffersize))); + } + // END HERITRIX change. + The other overlay we make is of HttpParser so we can persist through a bad header parse: /apache/commons/httpclient/HttpParser.java src/java/org/apache/commons/httpclient/HttpParser.java --- /home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpParser.java 2004-09-19 13:41:05.000000000 -0700 +++ src/java/org/apache/commons/httpclient/HttpParser.java 2004-09-29 14:23:03.000000000 -0700 @@ -185,11 +185,21 @@ // Otherwise we should have normal HTTP header line // Parse the header name and value int colon = line.indexOf(":"); + // START HERITRIX Change + // Don't throw an exception if can't parse. We want to keep + // going even though header is bad. Rather, create + // pseudo-header. if (colon < 0) { - throw new ProtocolException("Unable to parse header: " + line); + // throw new ProtocolException("Unable to parse header: " ++ // line); - throw new ProtocolException("Unable to parse header: " + line); + // throw new ProtocolException("Unable to parse header: " ++ // line); + name = "HttpClient-Bad-Header-Line-Failed-Parse"; + value = new StringBuffer(line); + + } else { + name = line.substring(0, colon).trim(); + value = new StringBuffer(line.substring(colon + 1).trim()); } - name = line.substring(0, colon).trim(); - value = new StringBuffer(line.substring(colon + 1).trim()); + // END HERITRIX change. } I don't see ye ever making the socket streams available via the API. I've been following the list long enough to see that exposing these streams is a no-no -- and I can appreciate all the work done in the software keeping them encapsulated. The second patch might be something to consider. Apart from these two cases, the API is most amenable. Thanks for the great software. Yours, St.Ack P.S: Here is a submission for the http://jakarta.apache.org/commons/httpclient/applications.html page: Heritrix (http://crawler.archive.org is the Internet Archive's (http://www.archive.org) open-source, extensible, web-scale, archival-quality web crawler project. >Oleg > > > >On Tue, 2004-09-28 at 20:37, Adam R. B. Jack wrote: > > >>>>On Mon, 2004-09-20 at 22:50, Oleg Kalnichevski wrote: >>>> >>>> >>>>>As far as I know the following projects rely on HttpClient 2.0 as a >>>>>required or optional dependency >>>>> >>>>>* Apache Jakarta Slide (http://jakarta.apache.org/slide/) >>>>>* Apache Jakarta Cactus (http://jakarta.apache.org/cactus/) >>>>>* Apache Axis (http://ws.apache.org/axis/) >>>>>* Apache XML-RPC (http://ws.apache.org/xmlrpc/) >>>>>* Spring Framework (http://www.springframework.org/) >>>>>* HtmlUntit (http://htmlunit.sourceforge.net/) >>>>>* XINS (http://xins.sourceforge.net/) >>>>> >>>>> >>Just stumbled over this mail. Does the "Dependees" list here help give you >>other possibles? >> >> >>http://brutus.apache.org/gump/public/jakarta-commons/commons-httpclient/details.html >> >>regards, >> >>Adam >> >> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org >>For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org >> >> >> > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org >For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org