hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Trying to follow 301 redirects results in 404 error
Date Sun, 25 Mar 2012 19:07:23 GMT
On Sun, 2012-03-25 at 14:19 -0400, Uncle wrote:
> > It is not HttpClient reporting a wrong response status. It is the server
> > behaving incorrectly. I get the same 404 when accessing the location
> > directly.
> 
> What do you mean "directly"?
> 

Without redirect.

> > The problem is that the server does not correctly handle URI
> > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> > state how fragments in redirect locations should be handled. So, in my
> > opinion it is a server side issue. 
> 
> In my opinion, if 5 clients (HttpURLConnection, HttpClient, Chrome, Safari, Firefox)
try to hit the URL, and 4 of them do so successfully and one does not, the issue is with the
one client, not with the server.  Many URL's are poorly formed or ambiguous, yet most clients
take extra steps to access them, which makes them more useful. 

HttpClient is not a browser but you are certainly entitled to have a
different opinion. 

>  I think that HttpClient should either do that or provide facilities for doing so.
> 

It does. One can handle redirects differently by implementing a custom
RedirectStrategy and rewriting malformed redirect URIs in a way which is
acceptable in the context of a specific application 

> > The URL has illegal character(s), which is the reason why the redirect
> > fails. 
> 
> The Java toolkit and browsers URLEncode the URL, which avoids this problem. This seems
like a good general approach when redirecting.
> 

See above.

Oleg

> Randy
> 
> On Mar 24, 2012, at 7:59 PM, Oleg Kalnichevski wrote:
> 
> > On Sat, 2012-03-24 at 16:46 -0400, Uncle wrote:
> >> On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote:
> >> 
> >>> On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
> >>>> Apologies if this has been addressed, I searched the archives and was
unable to find anything directly relating to this, though it seems straightforward.
> >>>> 
> >>>> I am trying to use httpclient to obtain the redirect URL for a url such
as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code
301).  This code:
> >>>> 
> >>>>       String url = "http://bit.ly/GGviSv";
> >>>>       HttpGet httpget = new HttpGet(url);
> >>>>       HttpContext context = new BasicHttpContext();
> >>>>       HttpClient httpclient = new DefaultHttpClient();
> >>>> 
> >>>>       HttpResponse response = httpclient.execute(httpget, context);
> >>>> 
> >>>>       RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
> >>>> 
> >>>>       log.info("isRedirected = " + redirectStrategy.isRedirected(httpget,
response, context));
> >>>>       for(Header header : response.getAllHeaders())
> >>>>           log.info("header: " + header);
> >>>> 
> >>>>       log.info("status = " + response.getStatusLine());
> >>>> 
> >>>> outputs:
> >>>> 
> >>>> isRedirected = false
> >>>> header: Server: nginx
> >>>> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
> >>>> header: Content-Type: text/html; charset=UTF-8                     
                                                                                         
          
> >>>> header: Transfer-Encoding: chunked
> >>>> header: Connection: keep-alive
> >>>> header: Vary: Cookie
> >>>> header: X-CF-Powered-By: WP 1.2.0
> >>>> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
> >>>> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
> >>>> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
> >>>> header: Cache-Control: no-cache, must-revalidate, max-age=0
> >>>> header: Pragma: no-cache
> >>>> status = HTTP/1.1 404 Not Found
> >>>> 
> >>>> I expected 1) isRedirected to be true, 2) the response code to be 301,
and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore
the 404 and continue getting the URL:
> >>>> 
> >>>>       HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute(
ExecutionContext.HTTP_REQUEST );
> >>>>       HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
> >>>>       String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString()
: (currentHost.toURI() + currentReq.getURI());
> >>>>       httpclient.getConnectionManager().shutdown();
> >>>>       log.info("Redirected URL = " + currentUrl);
> >>>> 
> >>>> This does the right thing and provides me with the correct URL.  So,
why the 404 error?  I am processing a large quantity of URL's and need to accurately determine
which ones are errors, redirects, etc.
> >>>> 
> >>>> Thanks for any assistance.
> >>>> 
> >>>> Randy
> >>>> 
> >>> 
> >>> As far as I can tell HttpClient correctly redirects to the new location,
> >>> but the resource is simply no longer there.
> >>> 
> >>> [DEBUG] headers - >> GET /GGviSv HTTP/1.1
> >>> [DEBUG] headers - >> Host: bit.ly
> >>> [DEBUG] headers - >> Connection: Keep-Alive
> >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> >>> (java 1.5)
> >>> [DEBUG] headers - << HTTP/1.1 301 Moved
> >>> [DEBUG] headers - << Server: nginx
> >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
> >>> [DEBUG] headers - << Content-Type: text/html; charset=utf-8
> >>> [DEBUG] headers - << Connection: keep-alive
> >>> [DEBUG] headers - << Set-Cookie:
> >>> _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
> >>> 18:46:44 2012;path=/; HttpOnly
> >>> [DEBUG] headers - << Cache-control: private; max-age=90
> >>> [DEBUG] headers - << Location:
> >>> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >>> [DEBUG] headers - << MIME-Version: 1.0
> >>> [DEBUG] headers - << Content-Length: 185
> >>> [DEBUG] headers - >>
> >>> GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
HTTP/1.1
> >>> [DEBUG] headers - >> Host: lavamagazine.com
> >>> [DEBUG] headers - >> Connection: Keep-Alive
> >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> >>> (java 1.5)
> >>> [DEBUG] headers - << HTTP/1.1 404 Not Found
> >>> [DEBUG] headers - << Server: nginx
> >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
> >>> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> >>> [DEBUG] headers - << Transfer-Encoding: chunked
> >>> [DEBUG] headers - << Connection: keep-alive
> >>> [DEBUG] headers - << Vary: Cookie
> >>> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> >>> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> >>> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> >>> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
> >>> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> >>> [DEBUG] headers - << Pragma: no-cache
> >>> 
> >>> Oleg
> >>> 
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> >>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> >>> 
> >> 
> >> Yet, if you hit the URL: 
> >> 
> >> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >> 
> >> with your browser, the content comes up fine.  
> >> 
> >> Hitting the redirect URL with the standard Java HttpURLConnetion class does
not produce the 404:
> >> 
> >>       String url = "http://bit.ly/GGviSv";
> >>        URL urlObj = new URL(url);
> >>        HttpURLConnection urlConnection = (HttpURLConnection)urlObj.openConnection();
> >>        urlConnection.setRequestMethod("GET");
> >>        urlConnection.setConnectTimeout(15000);
> >>        urlConnection.setReadTimeout(30000);
> >>        urlConnection.connect();
> >>        log.info("Response code = " + urlConnection.getResponseCode());
> >>        InputStream inputStream = urlConnection.getInputStream();
> >>        log.info("Redirected URL = " + urlConnection.getURL().toString());
> >> 
> >> This outputs:
> >> 
> >> Response code = 200
> >> Redirected URL = http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >> 
> >> So HttpClient reports a 404, but HttpURLConnection reports a 200 and my browsers
(Safari, Chrome, and FireFox) all hit the link fine.
> >> 
> > 
> > It is not HttpClient reporting a wrong response status. It is the server
> > behaving incorrectly. I get the same 404 when accessing the location
> > directly. The problem is that the server does not correctly handle URI
> > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> > state how fragments in redirect locations should be handled. So, in my
> > opinion it is a server side issue. 
> > 
> > You can work the problem around by using a custom redirect strategy and
> > rewrites redirect location and strips away the fragment if present.
> > 
> > [DEBUG] headers - >>
> > GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> > [DEBUG] headers - >> Host: lavamagazine.com
> > [DEBUG] headers - >> Connection: Keep-Alive
> > [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> > (java 1.5)
> > [DEBUG] headers - << HTTP/1.1 404 Not Found
> > [DEBUG] headers - << Server: nginx
> > [DEBUG] headers - << Date: Sat, 24 Mar 2012 23:31:10 GMT
> > [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> > [DEBUG] headers - << Transfer-Encoding: chunked
> > [DEBUG] headers - << Connection: keep-alive
> > [DEBUG] headers - << Vary: Cookie
> > [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> > [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> > [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> > [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 23:31:10 GMT
> > [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> > [DEBUG] headers - << Pragma: no-cache
> > 
> > 
> >> Here is another URL that is problematic:
> >> 
> >> http://on.wsj.com/GHGlfS
> >> 
> >> this produces:
> >> 
> >> org.apache.http.client.ClientProtocolException
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
> >> ... snip ...
> >> Caused by: org.apache.http.ProtocolException: Invalid redirect URI: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonĂ¢??s-death-an-accident/?mod=e2tw
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193)
> >> 	at org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035)
> >> 	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492)
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
> >> 	... 28 more
> >> Caused by: java.net.URISyntaxException: Illegal character in path at index 72:
http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonĂ¢??s-death-an-accident/?mod=e2tw
> >> 	at java.net.URI$Parser.fail(URI.java:2809)
> >> 	at java.net.URI$Parser.checkChars(URI.java:2982)
> >> 	at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> >> 	at java.net.URI$Parser.parse(URI.java:3014)
> >> 	at java.net.URI.<init>(URI.java:578)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183)
> >> 	... 33 more
> >> 
> >> The redirected URL has a special character in it (single quote), and the client
doesn't handle that.  The Java code that I pasted above produces
> >> 
> > 
> > The URL has illegal character(s), which is the reason why the redirect
> > fails. 
> > 
> > Oleg
> > 
> >> Response code = 200
> >> Redirected URL = http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw
> >> 
> >> Randy
> >> 
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> >> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> >> 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message