hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Becke <be...@u.washington.edu>
Subject Re: ATTN Open-source projects using HttpClient
Date Thu, 30 Sep 2004 03:24:39 GMT
Hi St.Ack,

Many thanks for taking the time to make such a detailed response.  I  
haven't had time to fully digest (no pun intended) your message, but  
I'll take a look at it again in the morning.  In the mean time I've  
added Heritrix to the applications page in CVS, which will be added to  
the site next time it's published.

Mike

On Sep 29, 2004, at 7:53 PM, stack wrote:

> Oleg Kalnichevski wrote:
>
>> Thanks, Adam
>>
>> Should we decide to go on a spamming spree, these may also become
>> potential victims ;-)
>>
>>
> Let me preempt the spam (smile).
>
> I'm part of the webgroup at the Internet Archive (archive.org).  We've  
> been using httpclient at the heart of our open source crawler Heritrix  
> (crawler.archive.org) for near on a year now (I've added on the end a  
> submission for the httpclient applications page).  I've just upgraded  
> HEAD to use 3.0alpha2 and was going to say a few words about the  
> experience and how its running.
>
> Heads up.  The following message is a little long.
>
> The upgrade took way longer than I anticipated, a couple of days  
> rather than a couple of hours.  While some of the time was spent on  
> refactoring only slightly related to the httpcilent upgrade and  
> testing to see all httpclient used features still work post upgrade,  
> the bulk of the time was spent on redoing our auth system to fit the  
> redesigned httpclient auth system. I had trouble figuring out how  
> things work now in the absence of example. Our usage is a little  
> out-of-the-ordinary in that we manage own store of credentials and  
> manage when to load them onto a httpmethod.  Previous,  
> HttpAuthenticator would select the scheme for me.  Now it seems like I  
> have to do it myself using AuthChallengeParser and then iterate over  
> the returns.  In general the new auth system changes look to be for  
> the best.  It just cost time exploring.  Thereafter, the remainder was  
> spent undoing our own custom retry to instead use a custom  
> HttpMethodRetryHandler that is now part of httpclient core, study of  
> the new configuration system to ensure correct usage, undoing old ways  
> of specifying preferences, and exploiting new preference granularity  
> particularly where it could make the crawler more robust  
> (UNAMBIGUOUS_STATUS_LINE, STRICT_TRANSFER_ENCODING,  
> STATUS_LINE_GARBAGE_LIMIT, etc).
>
> With the new lib in place, we pass all of our little suite of  
> selftests (includes Auth tests of logins and Basic and Digest Auth),  
> and random broad crawling shows performance as comparable with the  
> only weird exceptions having to do with timeouts on IBM SSL Sockets.  
> Later I should have more detailed feedback on performance and  
> robustness.
>
> The IBM SSL socket timeout issues I'm seeing when I get an SSLSocket  
> with a timeout (I set the timeout by getting a socket with the null  
> arg constructor then doing an SSLSocket$connect with a timeout).  The  
> exceptions do not happen when I use SUN JVM 1.4.2.  These are probably  
> IBM JVM issues but I'll list them here anyways:
>
> 1. The IBM JVM 141 (cxia321411-20030930) NPEs setting the NoTcpDelay.   
> Is anyone else seeing this?
> java.lang.NullPointerException
>    at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source)
>    at  
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java: 
> 683)
>    at  
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCo 
> nnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
>
> 2. Using the IBM JVM 142, its saying SSL connection not open when we  
> go to use inputstreams.
> java.net.SocketException: Socket is not connected
>    at java.net.Socket.getInputStream(Socket.java:726)     at  
> com.ibm.jsse.bs.getInputStream(Unknown Source)
>    at  
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java: 
> 715)
>    at  
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCo 
> nnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
>
> By way of feedback on the 3.0 API, I'll describe the two places where  
> the API is lacking regards our requirements forcing us to do yucky  
> overlays.  First some context.  The crawler must record the response  
> headers and response content exactly as it comes back over the wire  
> and its supposed to be tenacious.
>
> Regards recording exactly what the server sent us, we overlay  
> HttpConnection with a version that wraps the socket input and output  
> streams.  Here's the diff:
>
> +// HERITRIX import.
> +import org.archive.util.HttpRecorder;
> +
> /**
>  * An abstraction of an HTTP {@link InputStream} and {@link  
> OutputStream}
>  * pair, together with the relevant attributes.
> @@ -676,7 +679,6 @@
>             highly interactive environments, such as some client/server
>             situations. In such cases, nagling may be turned off  
> through
>             use of the TCP_NODELAY sockets option." */
> -
>             socket.setTcpNoDelay(this.params.getTcpNoDelay());
>             socket.setSoTimeout(this.params.getSoTimeout());
>
> @@ -701,8 +703,23 @@
>             if (inbuffersize > 2048) {
>                 inbuffersize = 2048;              }
> -            inputStream = new  
> BufferedInputStream(socket.getInputStream(), inbuffersize);
> -            outputStream = new  
> BufferedOutputStream(socket.getOutputStream(), outbuffersize);
> +            // START HERITRIX Change
> +            HttpRecorder httpRecorder =  
> HttpRecorder.getHttpRecorder();
> +            if (httpRecorder == null) {
> +                inputStream = new BufferedInputStream(
> +                    socket.getInputStream(), inbuffersize);
> +                outputStream = new BufferedOutputStream(
> +                    socket.getOutputStream(), outbuffersize);
> +            } else {
> +                inputStream = httpRecorder.inputWrap((InputStream)
> +                    (new BufferedInputStream(socket.getInputStream(),
> +                    inbuffersize)));
> +                outputStream = httpRecorder.outputWrap((OutputStream)
> +                    (new  
> BufferedOutputStream(socket.getOutputStream(),
> +                    outbuffersize)));
> +            }
> +            // END HERITRIX change.
> +
>
> The other overlay we make is of HttpParser so we can persist through a  
> bad header parse:
>
> /apache/commons/httpclient/HttpParser.java  
> src/java/org/apache/commons/httpclient/HttpParser.java ---  
> /home/stack/bin/commons-httpclient-3.0-alpha2/src/java/org/apache/ 
> commons/httpclient/HttpParser.java        2004-09-19  
> 13:41:05.000000000 -0700 +++  
> src/java/org/apache/commons/httpclient/HttpParser.java      2004-09-29  
> 14:23:03.000000000 -0700
> @@ -185,11 +185,21 @@
>                 // Otherwise we should have normal HTTP header line
>                 // Parse the header name and value
>                 int colon = line.indexOf(":");
> +                // START HERITRIX Change
> +                // Don't throw an exception if can't parse.  We want  
> to keep
> +                // going even though header is bad. Rather, create
> +                // pseudo-header.
>                 if (colon < 0) { -                    throw new  
> ProtocolException("Unable to parse header: " + line); +                 
>     // throw new ProtocolException("Unable to parse header: " ++        
>              //      line);
> -                    throw new ProtocolException("Unable to parse  
> header: " + line); +                    // throw new  
> ProtocolException("Unable to parse header: " ++                    //   
>     line);
> +                    name = "HttpClient-Bad-Header-Line-Failed-Parse";
> +                    value = new StringBuffer(line);
> +
> +                } else {
> +                    name = line.substring(0, colon).trim();
> +                    value = new StringBuffer(line.substring(colon +  
> 1).trim());                 }
> -                name = line.substring(0, colon).trim();
> -                value = new StringBuffer(line.substring(colon +  
> 1).trim()); +               // END HERITRIX change.
>             }
>
> I don't see ye ever making the socket streams available via the API.   
> I've been following the list long enough to see that exposing these  
> streams is a no-no -- and I can appreciate all the work done in the  
> software keeping them encapsulated.  The second patch might be  
> something to consider.  Apart from these two cases, the API is most  
> amenable.
>
> Thanks for the great software.
> Yours,
> St.Ack
>
> P.S: Here is a submission for the  
> http://jakarta.apache.org/commons/httpclient/applications.html page:
>
> Heritrix (http://crawler.archive.org is the Internet Archive's
> (http://www.archive.org) open-source, extensible, web-scale,  
> archival-quality
> web crawler project.
>
>
>> Oleg
>>
>>
>>
>> On Tue, 2004-09-28 at 20:37, Adam R. B. Jack wrote:
>>
>>>>> On Mon, 2004-09-20 at 22:50, Oleg Kalnichevski wrote:
>>>>>
>>>>>> As far as I know the following projects rely on HttpClient 2.0 as
 
>>>>>> a
>>>>>> required or optional dependency
>>>>>>
>>>>>> * Apache Jakarta Slide (http://jakarta.apache.org/slide/)
>>>>>> * Apache Jakarta Cactus (http://jakarta.apache.org/cactus/)
>>>>>> * Apache Axis (http://ws.apache.org/axis/)
>>>>>> * Apache XML-RPC (http://ws.apache.org/xmlrpc/)
>>>>>> * Spring Framework (http://www.springframework.org/)
>>>>>> * HtmlUntit (http://htmlunit.sourceforge.net/)
>>>>>> * XINS (http://xins.sourceforge.net/)
>>>>>>
>>> Just stumbled over this mail. Does the "Dependees" list here help  
>>> give you
>>> other possibles?
>>>
>>>
>>> http://brutus.apache.org/gump/public/jakarta-commons/commons- 
>>> httpclient/details.html
>>>
>>> regards,
>>>
>>> Adam
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:  
>>> commons-httpclient-dev-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail:  
>>> commons-httpclient-dev-help@jakarta.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:  
>> commons-httpclient-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail:  
>> commons-httpclient-dev-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:  
> commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:  
> commons-httpclient-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org


Mime
View raw message