hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@archive.org>
Subject Re: ATTN Open-source projects using HttpClient
Date Wed, 29 Sep 2004 23:53:55 GMT
Oleg Kalnichevski wrote:

>Thanks, Adam
>
>Should we decide to go on a spamming spree, these may also become
>potential victims ;-)
>
>  
>
Let me preempt the spam (smile).

I'm part of the webgroup at the Internet Archive (archive.org).  We've 
been using httpclient at the heart of our open source crawler Heritrix 
(crawler.archive.org) for near on a year now (I've added on the end a 
submission for the httpclient applications page).  I've just upgraded 
HEAD to use 3.0alpha2 and was going to say a few words about the 
experience and how its running.

Heads up.  The following message is a little long.

The upgrade took way longer than I anticipated, a couple of days rather 
than a couple of hours.  While some of the time was spent on refactoring 
only slightly related to the httpcilent upgrade and testing to see all 
httpclient used features still work post upgrade, the bulk of the time 
was spent on redoing our auth system to fit the redesigned httpclient 
auth system. I had trouble figuring out how things work now in the 
absence of example. Our usage is a little out-of-the-ordinary in that we 
manage own store of credentials and manage when to load them onto a 
httpmethod.  Previous, HttpAuthenticator would select the scheme for 
me.  Now it seems like I have to do it myself using AuthChallengeParser 
and then iterate over the returns.  In general the new auth system 
changes look to be for the best.  It just cost time exploring.  
Thereafter, the remainder was spent undoing our own custom retry to 
instead use a custom HttpMethodRetryHandler that is now part of 
httpclient core, study of the new configuration system to ensure correct 
usage, undoing old ways of specifying preferences, and exploiting new 
preference granularity particularly where it could make the crawler more 
robust (UNAMBIGUOUS_STATUS_LINE, STRICT_TRANSFER_ENCODING, 
STATUS_LINE_GARBAGE_LIMIT, etc).

With the new lib in place, we pass all of our little suite of selftests 
(includes Auth tests of logins and Basic and Digest Auth), and random 
broad crawling shows performance as comparable with the only weird 
exceptions having to do with timeouts on IBM SSL Sockets. Later I should 
have more detailed feedback on performance and robustness.

The IBM SSL socket timeout issues I'm seeing when I get an SSLSocket 
with a timeout (I set the timeout by getting a socket with the null arg 
constructor then doing an SSLSocket$connect with a timeout).  The 
exceptions do not happen when I use SUN JVM 1.4.2.  These are probably 
IBM JVM issues but I'll list them here anyways:

1. The IBM JVM 141 (cxia321411-20030930) NPEs setting the NoTcpDelay.  
Is anyone else seeing this?
 java.lang.NullPointerException
    at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source)
    at 
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:683)
    at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)

2. Using the IBM JVM 142, its saying SSL connection not open when we go 
to use inputstreams.
 java.net.SocketException: Socket is not connected
    at java.net.Socket.getInputStream(Socket.java:726)     at 
com.ibm.jsse.bs.getInputStream(Unknown Source)
    at 
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:715)
    at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)

By way of feedback on the 3.0 API, I'll describe the two places where 
the API is lacking regards our requirements forcing us to do yucky 
overlays.  First some context.  The crawler must record the response 
headers and response content exactly as it comes back over the wire and 
its supposed to be tenacious.

Regards recording exactly what the server sent us, we overlay 
HttpConnection with a version that wraps the socket input and output 
streams.  Here's the diff:

+// HERITRIX import.
+import org.archive.util.HttpRecorder;
+
 /**
  * An abstraction of an HTTP {@link InputStream} and {@link OutputStream}
  * pair, together with the relevant attributes.
@@ -676,7 +679,6 @@
             highly interactive environments, such as some client/server
             situations. In such cases, nagling may be turned off through
             use of the TCP_NODELAY sockets option." */
-
             socket.setTcpNoDelay(this.params.getTcpNoDelay());
             socket.setSoTimeout(this.params.getSoTimeout());

@@ -701,8 +703,23 @@
             if (inbuffersize > 2048) {
                 inbuffersize = 2048;              }
-            inputStream = new 
BufferedInputStream(socket.getInputStream(), inbuffersize);
-            outputStream = new 
BufferedOutputStream(socket.getOutputStream(), outbuffersize);
+            // START HERITRIX Change
+            HttpRecorder httpRecorder = HttpRecorder.getHttpRecorder();
+            if (httpRecorder == null) {
+                inputStream = new BufferedInputStream(
+                    socket.getInputStream(), inbuffersize);
+                outputStream = new BufferedOutputStream(
+                    socket.getOutputStream(), outbuffersize);
+            } else {
+                inputStream = httpRecorder.inputWrap((InputStream)
+                    (new BufferedInputStream(socket.getInputStream(),
+                    inbuffersize)));
+                outputStream = httpRecorder.outputWrap((OutputStream)
+                    (new BufferedOutputStream(socket.getOutputStream(),
+                    outbuffersize)));
+            }
+            // END HERITRIX change.
+

The other overlay we make is of HttpParser so we can persist through a 
bad header parse:

 /apache/commons/httpclient/HttpParser.java 
src/java/org/apache/commons/httpclient/HttpParser.java --- 
/home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/commons/httpclient/HttpParser.java
       
2004-09-19 13:41:05.000000000 -0700 +++ 
src/java/org/apache/commons/httpclient/HttpParser.java      2004-09-29 
14:23:03.000000000 -0700
@@ -185,11 +185,21 @@
                 // Otherwise we should have normal HTTP header line
                 // Parse the header name and value
                 int colon = line.indexOf(":");
+                // START HERITRIX Change
+                // Don't throw an exception if can't parse.  We want to 
keep
+                // going even though header is bad. Rather, create
+                // pseudo-header.
                 if (colon < 0) { -                    throw new 
ProtocolException("Unable to parse header: " + line); 
+                    // throw new ProtocolException("Unable to parse 
header: " ++                    //      line);
-                    throw new ProtocolException("Unable to parse 
header: " + line); +                    // throw new 
ProtocolException("Unable to parse header: " ++                    
//      line);
+                    name = "HttpClient-Bad-Header-Line-Failed-Parse";
+                    value = new StringBuffer(line);
+
+                } else {
+                    name = line.substring(0, colon).trim();
+                    value = new StringBuffer(line.substring(colon + 
1).trim());                 }
-                name = line.substring(0, colon).trim();
-                value = new StringBuffer(line.substring(colon + 
1).trim()); +               // END HERITRIX change.
             }

I don't see ye ever making the socket streams available via the API.  
I've been following the list long enough to see that exposing these 
streams is a no-no -- and I can appreciate all the work done in the 
software keeping them encapsulated.  The second patch might be something 
to consider.  Apart from these two cases, the API is most amenable.

Thanks for the great software.
Yours,
St.Ack

P.S: Here is a submission for the 
http://jakarta.apache.org/commons/httpclient/applications.html page:

Heritrix (http://crawler.archive.org is the Internet Archive's
(http://www.archive.org) open-source, extensible, web-scale, 
archival-quality
web crawler project.


>Oleg
>
>
>
>On Tue, 2004-09-28 at 20:37, Adam R. B. Jack wrote:
>  
>
>>>>On Mon, 2004-09-20 at 22:50, Oleg Kalnichevski wrote:
>>>>        
>>>>
>>>>>As far as I know the following projects rely on HttpClient 2.0 as a
>>>>>required or optional dependency
>>>>>
>>>>>* Apache Jakarta Slide (http://jakarta.apache.org/slide/)
>>>>>* Apache Jakarta Cactus (http://jakarta.apache.org/cactus/)
>>>>>* Apache Axis (http://ws.apache.org/axis/)
>>>>>* Apache XML-RPC (http://ws.apache.org/xmlrpc/)
>>>>>* Spring Framework (http://www.springframework.org/)
>>>>>* HtmlUntit (http://htmlunit.sourceforge.net/)
>>>>>* XINS (http://xins.sourceforge.net/)
>>>>>          
>>>>>
>>Just stumbled over this mail. Does the "Dependees" list here help give you
>>other possibles?
>>
>>
>>http://brutus.apache.org/gump/public/jakarta-commons/commons-httpclient/details.html
>>
>>regards,
>>
>>Adam
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org


Mime
View raw message