hc-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From s...@apache.org
Subject svn commit: r1490813 - /httpcomponents/site/httpcomponents-client-4.2.5/primer.html
Date Fri, 07 Jun 2013 20:46:32 GMT
Author: sebb
Date: Fri Jun  7 20:46:32 2013
New Revision: 1490813

URL: http://svn.apache.org/r1490813
Log:
Fix up bad anchors

Modified:
    httpcomponents/site/httpcomponents-client-4.2.5/primer.html

Modified: httpcomponents/site/httpcomponents-client-4.2.5/primer.html
URL: http://svn.apache.org/viewvc/httpcomponents/site/httpcomponents-client-4.2.5/primer.html?rev=1490813&r1=1490812&r2=1490813&view=diff
==============================================================================
--- httpcomponents/site/httpcomponents-client-4.2.5/primer.html (original)
+++ httpcomponents/site/httpcomponents-client-4.2.5/primer.html Fri Jun  7 20:46:32 2013
@@ -1,5 +1,5 @@
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-<!-- Generated by Apache Maven Doxia at Apr 23, 2013 ( $Revision$ ) -->
+<!-- Generated by Apache Maven Doxia at Jun 7, 2013 ( $Revision$ ) -->
 <!-- $HeadURL: https://svn.apache.org/repos/asf/httpcomponents/maven-skin/trunk/src/main/resources/META-INF/maven/site.vm
$ -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
@@ -11,7 +11,7 @@
       @import url("./css/site.css");
     </style>
     <link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
-    <meta name="Date-Revision-yyyymmdd" content="20130423" />
+    <meta name="Date-Revision-yyyymmdd" content="20130607" />
     <meta http-equiv="Content-Language" content="en" />
         
         </head>
@@ -31,7 +31,7 @@
             
         
                 <div class="xleft">
-        <span id="publishDate">Last Published: 2013-04-23</span>
+        <span id="publishDate">Last Published: 2013-06-07</span>
                   &nbsp;| <span id="projectVersion">Version: 4.2.5</span>
                       </div>
             <div class="xright">                    <a href="http://www.apache.org/"
class="externalLink" title="Apache">Apache</a>
@@ -150,7 +150,7 @@
     </div>
     <div id="bodyColumn">
       <div id="contentBox">
-        <!-- ==================================================================== --><!--
Licensed to the Apache Software Foundation (ASF) under one --><!-- or more contributor
license agreements.  See the NOTICE file --><!-- distributed with this work for additional
information --><!-- regarding copyright ownership.  The ASF licenses this file --><!--
to you under the Apache License, Version 2.0 (the --><!-- "License"); you may not use
this file except in compliance --><!-- with the License.  You may obtain a copy of the
License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--
 --><!-- Unless required by applicable law or agreed to in writing, --><!-- software
distributed under the License is distributed on an --><!-- "AS IS" BASIS, WITHOUT WARRANTIES
OR CONDITIONS OF ANY --><!-- KIND, either express or implied.  See the License for the
--><!-- specific language governing permissions and limitations --><!-- under
the License. --><!-- ==================
 ================================================== --><!--  --><!-- This software
consists of voluntary contributions made by many --><!-- individuals on behalf of the
Apache Software Foundation.  For more --><!-- information on the Apache Software Foundation,
please see --><!-- <http://www.apache.org/>. --><div class="section"><h2>Client
HTTP Programming Primer<a name="Client_HTTP_Programming_Primer"></a></h2><div
class="section"><h3><a name="About">About</a></h3><p>This
document is intended for people who suddenly have to or want to implement an application that
automates something usually done with a browser, but are missing the background to understand
what they actually need to do. It provides guidance on the steps required to implement a program
that interacts with a web site which is designed to be used with a browser. It does not save
you from eventually learning the background of what you are doing, but it should help you
to get started quickly and learn the details
  later.</p><p>This document has evolved from discussions on the HttpClient mailing
lists. Although it refers to HttpClient, the concepts described here apply equally to HttpComponents
or SUN's <a class="externalLink" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/HttpURLConnection.html">HttpURLConnection</a>
or any other HTTP communication library for any programming language. So you might find it
useful even if you're not using Java and HttpClient.</p><p>The existence of this
document does not imply that the HttpClient community feels responsible for teaching you how
to program a client HTTP application. It is merely a way for us to reduce the noise on the
mailing list without just leaving the newbies out in the cold.</p></div><div
class="section"><h3><a name="Scenario">Scenario</a></h3><p>Let's
assume that you have some kind of repetitive, web-based task that you want to automate. Something
like:</p><ul><li>goto page http://xxx.yyy.zzz/login.html</li><li>enter
user
 name and password in a web form and hit the &quot;login&quot; button</li><li>navigate
to a specific page</li><li>check the number/headline/whatever shown on that page</li></ul><p>At
this time, we don't have a specific example which could be developed into a sample application.
So this document is all bla-bla, and you will have to work out the details - all the details
- yourself. Such is life.</p></div><div class="section"><h3><a
name="Caveat">Caveat</a></h3><p>This scenario describes a hobbyist usage
of HTTP, in other words: <b>a bad practice</b>. Web sites are designed for user
interaction, not as an application programming interface (API). The interface of a web site
is the user interface displayed by a browser. The HTTP communication between the browser and
the server is an internal API, subject to change without notice.</p><p>A web site
can be redesigned at any point in time. The server then sends different documents and a browser
will display the new content. The user 
 easily adjusts to click the appropriate links, and the browser communicates via HTTP as specified
by the new documents from the server. Your application that only mimicks a browser will simply
break.</p><p>Nevertheless, implementing this scenario will help you to get familiar
with HTTP communication. It is also &quot;good enough&quot; for hobbyists applications,
for example if you want to download the latest installment of your favorite daily webcomic
to install it as the screen background. There is no big damage if such an application breaks.</p><p>If
you want to implement a solid application, you should use only published APIs. For example,
to check for new mail on your webmail account, you should ask the webmail provider for POP
or IMAP access. These are standardized protocols supported my most EMail client applications.
If you want to have a newsticker, look for RSS feeds from the provider and applications that
display them.</p><p>As another example, if you want to perfo
 rm a web search, there are search companies that provide an API for using their search engines.
Unlike the examples before, such APIs are proprietary. You will still have to implement an
application, but then you are using a published API that the provider will not change without
notice.</p></div><div class="section"><h3><a name="Not_a_Browser">Not
a Browser</a></h3><p>HttpClient is not a browser. Here's the difference.</p><p><b>Browser</b></p><img
src="images/browser.png" alt="<a name="Browser">Browser</a>" /><p>The
figure shows some of the components you will find in a browser. To the left, there is the
user interface. The browser needs a rendering engine to display pages, and to interpret user
input such as mouse clicks somewhere on the displayed page. There is a layout engine which
computes how an HTML page should be displayed, including cascading style sheets and images.
A JavaScript interpreter runs JavaScript code embedded in or referenced from HTML pages. Events
from
  the user interface are passed to the JavaScript interpreter for processing. On the top,
there are interfaces for plugins that can handle Applets, embedded media objects like PDF
files, Quicktime movies and Flash animations, or ActiveX controls that can do anything.</p><p>In
the center of the figure you can find internal components. Browsers have a cache of recently
accessed documents and image files. They need to remember cookies and passwords entered by
the user. Such information can be kept in memory or stored persistently in the file system
at the bottom of the figure, to be available again when the browser is restarted. Certificates
for secure communication are almost always stored persistently. To the right of the figure
is the network. Browsers support many protocols on different levels of abstraction. There
are application protocols such as FTP and HTTP to retrieve documents from servers, and transport
layer protocols such as TLS/SSL and Socks to establish connection
 s for the application protocols.</p><p>One characteristic of browsers that is
not shown in the figure is tolerance for bad input. There needs to be tolerance for invalid
user input to make the browser user friendly. There also needs to be tolerance for malformed
documents retrieved from servers, and for flaws in server behavior when executing protocols,
to make as many websites as possible accessible to the user.</p><p><b>HTTP
Client</b></p><img src="images/httpclient.png" alt="<a name="HTTP_Client">HTTP
Client</a>" /><p>The figure shows some of the components you will find in a
browser, and highlights the scope of HttpClient. The primary responsibility of HttpClient
is the HTTP protocol, executed directly or through an HTTP proxy. It provides interfaces and
default implementations for cookie and password management, but not for persisting such data.
User interfacing, HTML parsing, plugins or non-HTTP application level protocols are not in
the scope of HttpClient. It does pr
 ovide interfaces to plug in transport layer protocols, but it does not implement such protocols.</p><p>All
the rest of a browser's functionality you require needs to be provided by your application.
HttpClient executes HTTP requests, but it will not and can not assemble them. Since HttpClient
does not interface with the user, nor interpret content such as HTML files, there is little
or no tolerance for bad data passed to the API. There is some tolerance for flaws in server
behavior, but there are limits to the deviations HttpClient can handle.</p></div><div
class="section"><h3><a name="Terminology">Terminology</a></h3><p>This
section introduces some important terms you have to know to understand the rest of this document.</p><p><tt><a
name="HTTP_Message">HTTP Message</a></tt></p><p>consists of a header
section and an optional entity. There are two kinds of messages, requests and responses. They
differ in the format of the first line, but both can have header fields and an op
 tional entity.</p><p><tt><a name="HTTP_Request">HTTP Request</a></tt>
</p><p>is sent from a client to a server. The first line includes the URI for
which the request is sent, and a method that the server should execute for the client.</p><p><tt><a
name="HTTP_Response">HTTP Response</a></tt></p><p>is sent from
a server to a client in response to a request. The first line includes a status code that
tells about success or failure of the request. HTTP defines a set of status codes, like 200
for success and 404 for not found. Other protocols based on HTTP can define additional status
codes.</p><p><tt><a name="Method">Method</a></tt></p><p>is
an operation requested from the server. HTTP defines a set of operations, the most frequent
being GET and POST. Other protocols based on HTTP can define additional methods.</p><p><tt><a
name="Header_Fields">Header Fields</a></tt></p><p>are name-value
pairs, where both name and value are text. The name of a header field is not case sensitive.
  Multiple values can be assigned to the same name. RFC 2616 defines a wide range of header
fields for handling various aspects of the HTTP protocol. Other specifications, like RFC 2617
and RFC 2965, define additional headers. Some of the defined headers are for general use,
others are meant for exclusive use with either requests or responses, still others are meant
for use only with an entity.</p><p><tt><a name="Entity">Entity</a></tt></p><p>is
data sent with an HTTP message. For example, a response can contain the page or image you
are downloading as an entity, or a request can include the parameters that you entered into
a web form. The entity of an HTTP message can have an arbitrary data format, which is usually
specified as a MIME type in a header field.</p><p><tt><a name="Session">Session</a></tt></p><p>is
a series of requests from a single source to a server. The server can keep session data, and
needs to recognize the session to which each incoming request belongs. Fo
 r example, if you execute a web search, the server will only return one page of search results.
But it keeps track of the other results and makes them available when you click on the link
to the &quot;next&quot; page. The server needs to know from the request that it is
you and your session for which more results are requested, and not me and my session. That's
because I searched for something else.</p><p><tt><a name="Cookies">Cookies</a></tt></p><p>are
the preferred way for servers to track sessions. The server supplies a piece of data, called
a cookie, in response to a request. The server expects the client to send that piece of data
in a header field with each following request of the same session. The cookie is different
for each session, so the server can identify to which session a request belongs by looking
at the cookie. If the cookie is missing from a request, the server will not respond as expected.</p></div><div
class="section"><h3><a name="Step_by_Step">Step by S
 tep</a></h3><div class="section"><h4><a name="GET_the_Login_Page">GET
the Login Page</a></h4><p>Create and execute a GET request for the login
page. Just use the link you would type into the browser as the URL. This is what a browser
does when you enter a URL in the address bar or when you click on a link that points to another
web page.</p><p>Inspect the response from the server:</p><ul><li>do
you get the page you expected?</li></ul><p>It should be sent as the entity
of the response to your request. The entity is also referred to as the response body.</p><ul><li>do
you get a session cookie?</li></ul><p>Cookies are sent in a header field
named Set-Cookie or Set-Cookie2. It is possible that you don't get a session cookie until
you log in. If there is no session cookie in the response, you'll have to do perform step
2 later, after you reach the point where the cookie is set.</p><p>If you do not
get the page you expect, check the URL you are requesting. If it is correct, the se
 rver may use a browser detection. You will have to set the header field User-Agent to a value
used by a popular browser to pretend that the request is coming from that browser.</p><p>If
you can't get the login page, get the home page instead now. Get the login page in the next
step, when you establish the session.</p></div><div class="section"><h4><a
name="Establish_the_Session">Establish the Session</a></h4><p>Create
and execute another GET request for a page. You can simply request the login page again, or
some other page of which you know the URL. Do NOT try to get a page which would be returned
in response to submitting a web form. Use something you can reach simply by clicking on a
link in the browser. Something where you can see the URL in the browser status line while
the mouse pointer is hovering over the link.</p><p>This step is important when
developing the application. Once you know that your application does establish the session
correctly, you may be able to rem
 ove it. Only if you couldn't get the login page directly and had to get the home page first,
you know you have to leave it in.</p><p>Inspect the request being sent to the
server.</p><ul><li>is the session cookie sent with the request?</li></ul><p>You
can see what is sent to the server by enabling the wire log for HttpClient. You only need
to see the request headers, not the body. The session cookie should be sent in a header field
called Cookie. There may be several of those, and other cookies might be sent as well.</p><p>Inspect
the response from the server:</p><ul><li>do you get another session cookie?</li></ul><p>You
should not get another session cookie. If you get the same session cookie as before, the server
behaves a little strange but that should not be a problem. If you get a new session cookie,
then the server did not recognize the session for the request. Usually, this happens if the
request did not contain the session cookie. But servers might use other means to 
 track sessions, or to detect session hijacking.</p><p>If the session cookie is
not sent in the request, one of two things has gone wrong. Either the cookie was not detected
in the previous response, or the cookie was not selected for being sent with the new request.</p><p>HttpClient
automatically parses cookies sent in responses and puts them into a cookie store. HttpClient
uses a configurable cookie policy to decide whether a cookie being sent from a server is correct.
The default policy complies strictly with RFC 2109, but many servers do not. Play around with
the cookie policies until the cookie is accepted and put into the cookie store.</p><p>If
the cookie is accepted from the previous response but still not sent with the new request,
make sure that HttpClient uses the same cookie store object. Unless you explicitly manage
cookie store objects (not recommended for newbies!), this will be the case if you use the
same HttpClient object to execute both requests.</p><p>If th
 e cookie is still not sent with the request, make sure that the URL you are requesting is
in the scope for the cookie. Cookies are only sent to the domain and path specified in the
cookie scope. A cookie for host &quot;jakarta.apache.org&quot; will not be sent to
host &quot;tomcat.apache.org&quot;. A cookie for domain &quot;.apache.org&quot;
will be sent to both. A cookie for host &quot;apache.org&quot;, without the leading
dot, will not be sent to &quot;jakarta.apache.org&quot;. The latter case can be resolved
by using a different cookie spec that adds the leading dot. In the other cases, use a URL
that in the cookie scope to establish the session.</p><p>If the session cookie
is sent with the request, but a new session cookie is set in the response anyway, check whether
there are cookies other than the session cookie in the request. Some servers are incapable
of detecting multiple cookies sent in individual header fields. HttpClient can be advised
to put all cookies into a 
 single header field.</p><p>If that doesn't help, you are in trouble. The server
may use additional means to track the session, for example the header field named Referer.
Set that field to the URL of the previous request. (<a class="externalLink" href="http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200602.mbox/%3c19b.44e04b45.31166eaa@aol.com%3e">see
this mail</a>)</p><p>If that doesn't help either, you will have to compare
the request from your application to a corresponding one generated by a browser. The instructions
in step 5 for POST requests apply for GET requests as well. It's even simpler with GET, since
you don't have an entity.</p></div><div class="section"><h4><a
name="Analyze_the_Form">Analyze the Form</a></h4><p>Now it is time to
analyze the form defined in the HTML markup of the page. A form in HTML is a set of name-value-pairs
called parameters, where some of the values can be entered in the browser. By analyzing the
HTML markup, you can lear
 n which parameters you have to define and how to send them to the server.</p><p>Look
for the <i>form</i> tag in the page source. There may be several forms in the
page, but they can not be nested. Locate the form you want to submit. Locate the matching
<i>/form</i> tag. Everything in between the two may be relevant. Let's start with
the <a name="attributes_of_the_form_tag">attributes of the <i>form</i> tag</a>:</p><p><tt><a
name="method">method</a>=</tt></p><p>specifies the method used
for submitting the form. If it is GET or not specified at all, then you need to create a GET
request. The parameters will be added as a query string to the URL. If the method is POST,
you need to create a POST request. The parameters will be put in the entity of the request,
also referred to as the request body. How to do that is discussed in step 5.</p><p><tt><a
name="action">action</a>=</tt></p><p>specifies the URL to which
the request has to be sent. Do not try to get this URL from the addr
 ess bar of your browser! A browser will automatically follow redirects and only displays
the final URL, which can be different from the URL in this attribute. It is possible that
the URL includes a query string that specifies some parameters. If so, keep that in mind.</p><p><tt><a
name="enctype">enctype</a>=</tt></p><p>specifies the MIME type
for the entity of the request generated by the form. The two common cases are url-encoded
(default) and multipart-mime. Note that these terms are just informally used here, the exact
values that need to be written in an HTML document are specified elsewhere. This attribute
is only used for the POST method. If the method is GET, the parameters will always be url-encoded,
but not in an entity.</p><p><tt><a name="accept-charset">accept-charset</a>=</tt></p><p>specifies
the character set that the browser should allow for user input. It will not be discussed here,
but you will have to consider this value if you experience charset related pro
 blems.</p><p>Except for optional query parameters in the action attribute, the
parameters of a form are specified by HTML tags between <i>form</i> and <i>/form</i>.
The following is a list of tags that can be used to define parameters. Except where stated
otherwise, they have a name attribute which specifies the name of the parameter. The value
of the parameter usually depends on user input.</p><div><pre>&lt;input
type=&quot;text&quot; name=&quot;...&quot;&gt;
+        <!-- ==================================================================== --><!--
Licensed to the Apache Software Foundation (ASF) under one --><!-- or more contributor
license agreements.  See the NOTICE file --><!-- distributed with this work for additional
information --><!-- regarding copyright ownership.  The ASF licenses this file --><!--
to you under the Apache License, Version 2.0 (the --><!-- "License"); you may not use
this file except in compliance --><!-- with the License.  You may obtain a copy of the
License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--
 --><!-- Unless required by applicable law or agreed to in writing, --><!-- software
distributed under the License is distributed on an --><!-- "AS IS" BASIS, WITHOUT WARRANTIES
OR CONDITIONS OF ANY --><!-- KIND, either express or implied.  See the License for the
--><!-- specific language governing permissions and limitations --><!-- under
the License. --><!-- ==================
 ================================================== --><!--  --><!-- This software
consists of voluntary contributions made by many --><!-- individuals on behalf of the
Apache Software Foundation.  For more --><!-- information on the Apache Software Foundation,
please see --><!-- <http://www.apache.org/>. --><div class="section"><h2>Client
HTTP Programming Primer<a name="Client_HTTP_Programming_Primer"></a></h2><div
class="section"><h3><a name="About">About</a></h3><p>This
document is intended for people who suddenly have to or want to implement an application that
automates something usually done with a browser, but are missing the background to understand
what they actually need to do. It provides guidance on the steps required to implement a program
that interacts with a web site which is designed to be used with a browser. It does not save
you from eventually learning the background of what you are doing, but it should help you
to get started quickly and learn the details
  later.</p><p>This document has evolved from discussions on the HttpClient mailing
lists. Although it refers to HttpClient, the concepts described here apply equally to HttpComponents
or SUN's <a class="externalLink" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/HttpURLConnection.html">HttpURLConnection</a>
or any other HTTP communication library for any programming language. So you might find it
useful even if you're not using Java and HttpClient.</p><p>The existence of this
document does not imply that the HttpClient community feels responsible for teaching you how
to program a client HTTP application. It is merely a way for us to reduce the noise on the
mailing list without just leaving the newbies out in the cold.</p></div><div
class="section"><h3><a name="Scenario">Scenario</a></h3><p>Let's
assume that you have some kind of repetitive, web-based task that you want to automate. Something
like:</p><ul><li>goto page http://xxx.yyy.zzz/login.html</li><li>enter
user
 name and password in a web form and hit the &quot;login&quot; button</li><li>navigate
to a specific page</li><li>check the number/headline/whatever shown on that page</li></ul><p>At
this time, we don't have a specific example which could be developed into a sample application.
So this document is all bla-bla, and you will have to work out the details - all the details
- yourself. Such is life.</p></div><div class="section"><h3><a
name="Caveat">Caveat</a></h3><p>This scenario describes a hobbyist usage
of HTTP, in other words: <b>a bad practice</b>. Web sites are designed for user
interaction, not as an application programming interface (API). The interface of a web site
is the user interface displayed by a browser. The HTTP communication between the browser and
the server is an internal API, subject to change without notice.</p><p>A web site
can be redesigned at any point in time. The server then sends different documents and a browser
will display the new content. The user 
 easily adjusts to click the appropriate links, and the browser communicates via HTTP as specified
by the new documents from the server. Your application that only mimicks a browser will simply
break.</p><p>Nevertheless, implementing this scenario will help you to get familiar
with HTTP communication. It is also &quot;good enough&quot; for hobbyists applications,
for example if you want to download the latest installment of your favorite daily webcomic
to install it as the screen background. There is no big damage if such an application breaks.</p><p>If
you want to implement a solid application, you should use only published APIs. For example,
to check for new mail on your webmail account, you should ask the webmail provider for POP
or IMAP access. These are standardized protocols supported my most EMail client applications.
If you want to have a newsticker, look for RSS feeds from the provider and applications that
display them.</p><p>As another example, if you want to perfo
 rm a web search, there are search companies that provide an API for using their search engines.
Unlike the examples before, such APIs are proprietary. You will still have to implement an
application, but then you are using a published API that the provider will not change without
notice.</p></div><div class="section"><h3><a name="Not_a_Browser">Not
a Browser</a></h3><p>HttpClient is not a browser. Here's the difference.</p><p><a
name="Browser"><b>Browser</b></a></p><img src="images/browser.png"
alt="Browser components" /><p>The figure shows some of the components you will find
in a browser. To the left, there is the user interface. The browser needs a rendering engine
to display pages, and to interpret user input such as mouse clicks somewhere on the displayed
page. There is a layout engine which computes how an HTML page should be displayed, including
cascading style sheets and images. A JavaScript interpreter runs JavaScript code embedded
in or referenced from HTML pages. 
 Events from the user interface are passed to the JavaScript interpreter for processing. On
the top, there are interfaces for plugins that can handle Applets, embedded media objects
like PDF files, Quicktime movies and Flash animations, or ActiveX controls that can do anything.</p><p>In
the center of the figure you can find internal components. Browsers have a cache of recently
accessed documents and image files. They need to remember cookies and passwords entered by
the user. Such information can be kept in memory or stored persistently in the file system
at the bottom of the figure, to be available again when the browser is restarted. Certificates
for secure communication are almost always stored persistently. To the right of the figure
is the network. Browsers support many protocols on different levels of abstraction. There
are application protocols such as FTP and HTTP to retrieve documents from servers, and transport
layer protocols such as TLS/SSL and Socks to establish
  connections for the application protocols.</p><p>One characteristic of browsers
that is not shown in the figure is tolerance for bad input. There needs to be tolerance for
invalid user input to make the browser user friendly. There also needs to be tolerance for
malformed documents retrieved from servers, and for flaws in server behavior when executing
protocols, to make as many websites as possible accessible to the user.</p><p><a
name="HTTP_Client"><b>HTTP Client</b></a></p><img src="images/httpclient.png"
alt="HTTP Client components" /><p>The figure shows some of the components you will
find in a browser, and highlights the scope of HttpClient. The primary responsibility of HttpClient
is the HTTP protocol, executed directly or through an HTTP proxy. It provides interfaces and
default implementations for cookie and password management, but not for persisting such data.
User interfacing, HTML parsing, plugins or non-HTTP application level protocols are not in
the scope of 
 HttpClient. It does provide interfaces to plug in transport layer protocols, but it does
not implement such protocols.</p><p>All the rest of a browser's functionality
you require needs to be provided by your application. HttpClient executes HTTP requests, but
it will not and can not assemble them. Since HttpClient does not interface with the user,
nor interpret content such as HTML files, there is little or no tolerance for bad data passed
to the API. There is some tolerance for flaws in server behavior, but there are limits to
the deviations HttpClient can handle.</p></div><div class="section"><h3><a
name="Terminology">Terminology</a></h3><p>This section introduces some
important terms you have to know to understand the rest of this document.</p><p><tt><a
name="HTTP_Message">HTTP Message</a></tt></p><p>consists of a header
section and an optional entity. There are two kinds of messages, requests and responses. They
differ in the format of the first line, but both can have h
 eader fields and an optional entity.</p><p><tt><a name="HTTP_Request">HTTP
Request</a></tt> </p><p>is sent from a client to a server. The first
line includes the URI for which the request is sent, and a method that the server should execute
for the client.</p><p><tt><a name="HTTP_Response">HTTP Response</a></tt></p><p>is
sent from a server to a client in response to a request. The first line includes a status
code that tells about success or failure of the request. HTTP defines a set of status codes,
like 200 for success and 404 for not found. Other protocols based on HTTP can define additional
status codes.</p><p><tt><a name="Method">Method</a></tt></p><p>is
an operation requested from the server. HTTP defines a set of operations, the most frequent
being GET and POST. Other protocols based on HTTP can define additional methods.</p><p><tt><a
name="Header_Fields">Header Fields</a></tt></p><p>are name-value
pairs, where both name and value are text. The name of a header field 
 is not case sensitive. Multiple values can be assigned to the same name. RFC 2616 defines
a wide range of header fields for handling various aspects of the HTTP protocol. Other specifications,
like RFC 2617 and RFC 2965, define additional headers. Some of the defined headers are for
general use, others are meant for exclusive use with either requests or responses, still others
are meant for use only with an entity.</p><p><tt><a name="Entity">Entity</a></tt></p><p>is
data sent with an HTTP message. For example, a response can contain the page or image you
are downloading as an entity, or a request can include the parameters that you entered into
a web form. The entity of an HTTP message can have an arbitrary data format, which is usually
specified as a MIME type in a header field.</p><p><tt><a name="Session">Session</a></tt></p><p>is
a series of requests from a single source to a server. The server can keep session data, and
needs to recognize the session to which each incomi
 ng request belongs. For example, if you execute a web search, the server will only return
one page of search results. But it keeps track of the other results and makes them available
when you click on the link to the &quot;next&quot; page. The server needs to know
from the request that it is you and your session for which more results are requested, and
not me and my session. That's because I searched for something else.</p><p><tt><a
name="Cookies">Cookies</a></tt></p><p>are the preferred way for
servers to track sessions. The server supplies a piece of data, called a cookie, in response
to a request. The server expects the client to send that piece of data in a header field with
each following request of the same session. The cookie is different for each session, so the
server can identify to which session a request belongs by looking at the cookie. If the cookie
is missing from a request, the server will not respond as expected.</p></div><div
class="section"><h3><a name="S
 tep_by_Step">Step by Step</a></h3><div class="section"><h4><a
name="GET_the_Login_Page">GET the Login Page</a></h4><p>Create and execute
a GET request for the login page. Just use the link you would type into the browser as the
URL. This is what a browser does when you enter a URL in the address bar or when you click
on a link that points to another web page.</p><p>Inspect the response from the
server:</p><ul><li>do you get the page you expected?</li></ul><p>It
should be sent as the entity of the response to your request. The entity is also referred
to as the response body.</p><ul><li>do you get a session cookie?</li></ul><p>Cookies
are sent in a header field named Set-Cookie or Set-Cookie2. It is possible that you don't
get a session cookie until you log in. If there is no session cookie in the response, you'll
have to do perform step 2 later, after you reach the point where the cookie is set.</p><p>If
you do not get the page you expect, check the URL you are requesting. If
  it is correct, the server may use a browser detection. You will have to set the header field
User-Agent to a value used by a popular browser to pretend that the request is coming from
that browser.</p><p>If you can't get the login page, get the home page instead
now. Get the login page in the next step, when you establish the session.</p></div><div
class="section"><h4><a name="Establish_the_Session">Establish the Session</a></h4><p>Create
and execute another GET request for a page. You can simply request the login page again, or
some other page of which you know the URL. Do NOT try to get a page which would be returned
in response to submitting a web form. Use something you can reach simply by clicking on a
link in the browser. Something where you can see the URL in the browser status line while
the mouse pointer is hovering over the link.</p><p>This step is important when
developing the application. Once you know that your application does establish the session
correctly, 
 you may be able to remove it. Only if you couldn't get the login page directly and had to
get the home page first, you know you have to leave it in.</p><p>Inspect the request
being sent to the server.</p><ul><li>is the session cookie sent with the
request?</li></ul><p>You can see what is sent to the server by enabling
the wire log for HttpClient. You only need to see the request headers, not the body. The session
cookie should be sent in a header field called Cookie. There may be several of those, and
other cookies might be sent as well.</p><p>Inspect the response from the server:</p><ul><li>do
you get another session cookie?</li></ul><p>You should not get another session
cookie. If you get the same session cookie as before, the server behaves a little strange
but that should not be a problem. If you get a new session cookie, then the server did not
recognize the session for the request. Usually, this happens if the request did not contain
the session cookie. But servers mig
 ht use other means to track sessions, or to detect session hijacking.</p><p>If
the session cookie is not sent in the request, one of two things has gone wrong. Either the
cookie was not detected in the previous response, or the cookie was not selected for being
sent with the new request.</p><p>HttpClient automatically parses cookies sent
in responses and puts them into a cookie store. HttpClient uses a configurable cookie policy
to decide whether a cookie being sent from a server is correct. The default policy complies
strictly with RFC 2109, but many servers do not. Play around with the cookie policies until
the cookie is accepted and put into the cookie store.</p><p>If the cookie is accepted
from the previous response but still not sent with the new request, make sure that HttpClient
uses the same cookie store object. Unless you explicitly manage cookie store objects (not
recommended for newbies!), this will be the case if you use the same HttpClient object to
execute both
  requests.</p><p>If the cookie is still not sent with the request, make sure
that the URL you are requesting is in the scope for the cookie. Cookies are only sent to the
domain and path specified in the cookie scope. A cookie for host &quot;jakarta.apache.org&quot;
will not be sent to host &quot;tomcat.apache.org&quot;. A cookie for domain &quot;.apache.org&quot;
will be sent to both. A cookie for host &quot;apache.org&quot;, without the leading
dot, will not be sent to &quot;jakarta.apache.org&quot;. The latter case can be resolved
by using a different cookie spec that adds the leading dot. In the other cases, use a URL
that in the cookie scope to establish the session.</p><p>If the session cookie
is sent with the request, but a new session cookie is set in the response anyway, check whether
there are cookies other than the session cookie in the request. Some servers are incapable
of detecting multiple cookies sent in individual header fields. HttpClient can be advised
to p
 ut all cookies into a single header field.</p><p>If that doesn't help, you are
in trouble. The server may use additional means to track the session, for example the header
field named Referer. Set that field to the URL of the previous request. (<a class="externalLink"
href="http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200602.mbox/%3c19b.44e04b45.31166eaa@aol.com%3e">see
this mail</a>)</p><p>If that doesn't help either, you will have to compare
the request from your application to a corresponding one generated by a browser. The instructions
in step 5 for POST requests apply for GET requests as well. It's even simpler with GET, since
you don't have an entity.</p></div><div class="section"><h4><a
name="Analyze_the_Form">Analyze the Form</a></h4><p>Now it is time to
analyze the form defined in the HTML markup of the page. A form in HTML is a set of name-value-pairs
called parameters, where some of the values can be entered in the browser. By analyzing the
HTM
 L markup, you can learn which parameters you have to define and how to send them to the server.</p><p>Look
for the <i>form</i> tag in the page source. There may be several forms in the
page, but they can not be nested. Locate the form you want to submit. Locate the matching
<i>/form</i> tag. Everything in between the two may be relevant. Let's start with
the <a name="attributes_of_the_form_tag">attributes of the <i>form</i> tag</a>:</p><p><tt><a
name="method">method</a>=</tt></p><p>specifies the method used
for submitting the form. If it is GET or not specified at all, then you need to create a GET
request. The parameters will be added as a query string to the URL. If the method is POST,
you need to create a POST request. The parameters will be put in the entity of the request,
also referred to as the request body. How to do that is discussed in step 5.</p><p><tt><a
name="action">action</a>=</tt></p><p>specifies the URL to which
the request has to be sent. Do not try to get 
 this URL from the address bar of your browser! A browser will automatically follow redirects
and only displays the final URL, which can be different from the URL in this attribute. It
is possible that the URL includes a query string that specifies some parameters. If so, keep
that in mind.</p><p><tt><a name="enctype">enctype</a>=</tt></p><p>specifies
the MIME type for the entity of the request generated by the form. The two common cases are
url-encoded (default) and multipart-mime. Note that these terms are just informally used here,
the exact values that need to be written in an HTML document are specified elsewhere. This
attribute is only used for the POST method. If the method is GET, the parameters will always
be url-encoded, but not in an entity.</p><p><tt><a name="accept-charset">accept-charset</a>=</tt></p><p>specifies
the character set that the browser should allow for user input. It will not be discussed here,
but you will have to consider this value if you experien
 ce charset related problems.</p><p>Except for optional query parameters in the
action attribute, the parameters of a form are specified by HTML tags between <i>form</i>
and <i>/form</i>. The following is a list of tags that can be used to define parameters.
Except where stated otherwise, they have a name attribute which specifies the name of the
parameter. The value of the parameter usually depends on user input.</p><div><pre>&lt;input
type=&quot;text&quot; name=&quot;...&quot;&gt;
 &lt;input type=&quot;password&quot; name=&quot;...&quot;&gt;</pre></div><p>specify
single-line input fields. Using the return key in one of these fields will submit the form,
so the value really is a single line of input from the user.</p><div><pre>&lt;input
type=&quot;text&quot; readonly name=&quot;...&quot; value=&quot;...&quot;&gt;
 &lt;input type=&quot;hidden&quot; name=&quot;...&quot; value=&quot;...&quot;&gt;</pre></div><p>specify
a parameter that can not be changed by the user. The value of the parameter is given by the
value attribute.</p><div><pre>&lt;input type=&quot;radio&quot;
name=&quot;...&quot; value=&quot;...&quot;&gt;
 &lt;input type=&quot;checkbox&quot; name=&quot;...&quot; value=&quot;...&quot;&gt;</pre></div><p>specify
a parameter that can be included or omitted. There usually is more than one tag with the same
name. For radio buttons, only one can be selected and the value of the parameter is the value
of the selected radio button. For checkboxes, more than one can be selected. There will be
one name-value-pair for each selected checkbox, with the same name for all of them.</p><div><pre>&lt;input
type=&quot;submit&quot; name=&quot;...&quot; value=&quot;...&quot;&gt;



Mime
View raw message