hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Hatch <jack.hatch...@gmail.com>
Subject HTTPClient - HTTP Gets broken with there is a #anchor in the Redirect (301) URL
Date Mon, 24 Oct 2011 03:15:21 GMT
Hey all,

Bit of a weird one. I'm using HTTPClient 4.1.2, and it seems that whenever
it finds are URL with something like a '#' in it, it does a full get with
the # in the URL.

For example, trying to get the URL http://stks.co/eWt will redirect to the
URL
http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter.
Now this URL is live, but the problem is the HTTPClient sends a get request
with the URI set to URI:
/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitterwhich
causes the server to send back a 404 page not found.

Looking at the GET sent by IE, Firefox and cURL, they all strip out the #...
from the end of the URI, so for example the cURL GET request URI is set as
URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/ -
all the #... have been removed. This is for the exact same entry URL of
http://stks.co/eWt.

As a test, sending this raw URL into HTTPClient (i.e. HttpGet httpget = new
HttpGet("
http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter
");) gives the same 404 not found result.
The issue is I dont know if the url has an #anchor in it, as it from a short
URL service...

So the question is are there any settings in HTTPClient that can be set so
that things like the trailing #... can be auto removed from URLs. Or how
would I go about manually removing this from URLs (remember that I would
need to capture all redirect URLs as well)?

Cheers!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message