manifoldcf-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shinich...@apache.org
Subject svn commit: r1137907 [2/8] - in /incubator/lcf/site/publish: ./ images/ skin/ skin/css/ skin/images/ skin/scripts/ skin/translations/
Date Tue, 21 Jun 2011 08:36:59 GMT
Added: incubator/lcf/site/publish/end-user-documentation.html
URL: http://svn.apache.org/viewvc/incubator/lcf/site/publish/end-user-documentation.html?rev=1137907&view=auto
==============================================================================
--- incubator/lcf/site/publish/end-user-documentation.html (added)
+++ incubator/lcf/site/publish/end-user-documentation.html Tue Jun 21 08:36:54 2011
@@ -0,0 +1,2419 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.9">
+<meta name="Forrest-skin-name" content="lucene">
+<title>ManifoldCF- End-user Documentation</title>
+<link type="text/css" href="skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
+<link type="text/css" href="skin/profile.css" rel="stylesheet">
+<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="images/favicon.ico">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://incubator.apache.org/">Incubator</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+    |header
+    +-->
+<div class="header">
+<!--+
+    |start group logo
+    +-->
+<div class="grouplogo">
+<a href="http://www.apache.org"><img class="logoImage" alt="Apache" src="images/apache_feather.gif" title="Apache Software Foundation"></a>
+</div>
+<!--+
+    |end group logo
+    +-->
+<!--+
+    |start Project Logo
+    +-->
+<div class="projectlogo">
+<a href="http://incubator.apache.org/lcf"><img class="logoImage" alt="Apache ManifoldCF" src="images/ManifoldCF-logo.PNG" title="ManifoldCF"></a>
+</div>
+<!--+
+    |end Project Logo
+    +-->
+<!--+
+    |start Search
+    +-->
+<div class="searchbox">
+<form action="http://www.lucidimagination.com/search/" method="get" class="roundtopsmall">
+<input onFocus="getBlank (this, 'Search the site with Solr');" size="25" name="q" id="query" type="text" value="Search the site with Solr">&nbsp; 
+                    <input name="Search" value="Search" type="submit">
+</form>
+<div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
+</div>
+</div>
+<!--+
+    |end search
+    +-->
+<!--+
+    |start Tabs
+    +-->
+<ul id="tabs">
+<li class="current">
+<a class="selected" href="index.html">Main</a>
+</li>
+<li>
+<a class="unselected" href="http://cwiki.apache.org/confluence/display/CONNECTORS/Index">Wiki</a>
+</li>
+</ul>
+<!--+
+    |end Tabs
+    +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+    |start Subtabs
+    +-->
+<div id="level2tabs"></div>
+<!--+
+    |end Endtabs
+    +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+
+             &nbsp;
+           </div>
+<!--+
+    |start Menu, mainarea
+    +-->
+<!--+
+    |start Menu
+    +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">About</div>
+<div id="menu_1.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="index.html">Welcome</a>
+</div>
+<div class="menuitem">
+<a href="who.html">Who We Are</a>
+</div>
+<div class="menuitem">
+<a href="http://www.manning.com/wright/">Get the Book</a>
+</div>
+<div class="menuitem">
+<a href="http://www.cafepress.com/lucene/">Buy Stuff</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/foundation/sponsorship.html">Sponsor Apache</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/foundation/thanks.html">Sponsors of Apache</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
+<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="concepts.html">Concepts</a>
+</div>
+<div class="menuitem">
+<a href="included-connectors.html">Compatibility Matrix</a>
+</div>
+<div class="menuitem">
+<a href="faq.html">Frequently Asked Questions</a>
+</div>
+<div class="menuitem">
+<a href="programmatic-operation.html">API Documentation</a>
+</div>
+<div class="menuitem">
+<a href="javadoc.html">Javadoc</a>
+</div>
+<div class="menuitem">
+<a href="how-to-build-and-deploy.html">Building and Deploying</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">End-user Documentation (HTML)</div>
+</div>
+<div class="menuitem">
+<a href="end-user-documentation.pdf">End-user Documentation (PDF)</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="download.html">Download</a>
+</div>
+<div class="menuitem">
+<a href="mail.html">Mailing Lists</a>
+</div>
+<div class="menuitem">
+<a href="developer-resources.html">Developer/Integrator Resources</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related-Projects</div>
+<div id="menu_1.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://incubator.apache.org/droids/">Droids</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/java/">Java</a>
+</div>
+<div class="menuitem">
+<a href="http://incubator.apache.org/lucene.net/">Lucene.Net</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/lucy/">Lucy</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/mahout/">Mahout</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/nutch/">Nutch</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/openrelevance/">Open Relevance</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/pylucene/">PyLucene</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/solr/">Solr</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/tika/">Tika</a>
+</div>
+</div>
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+  |alternative credits
+  +-->
+<div id="credit2"></div>
+</div>
+<!--+
+    |end Menu
+    +-->
+<!--+
+    |start content
+    +-->
+<div id="content">
+<h1>ManifoldCF- End-user Documentation</h1>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#overview">Overview</a>
+<ul class="minitoc">
+<li>
+<a href="#outputs">Defining Output Connections</a>
+</li>
+<li>
+<a href="#authorities">Defining Authority Connections</a>
+</li>
+<li>
+<a href="#connectors">Defining Repository Connections</a>
+</li>
+<li>
+<a href="#jobs">Creating Jobs</a>
+</li>
+<li>
+<a href="#executing">Executing Jobs</a>
+</li>
+<li>
+<a href="#statusreports">Status Reports</a>
+<ul class="minitoc">
+<li>
+<a href="#documentstatus">Document Status</a>
+</li>
+<li>
+<a href="#queuestatus">Queue Status</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#historyreports">History Reports</a>
+<ul class="minitoc">
+<li>
+<a href="#simplehistory">Simple History Reports</a>
+</li>
+<li>
+<a href="#maxactivity">Maximum Activity Reports</a>
+</li>
+<li>
+<a href="#maxbandwidth">Maximum Bandwidth Reports</a>
+</li>
+<li>
+<a href="#resulthistogram">Result Histogram Reports</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#credentials">A Note About Credentials</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#outputconnectiontypes">Output Connection Types</a>
+<ul class="minitoc">
+<li>
+<a href="#solroutputconnector">Solr Output Connection</a>
+</li>
+<li>
+<a href="#gtsoutputconnector">MetaCarta GTS Output Connection</a>
+</li>
+<li>
+<a href="#nulloutputconnector">Null Output Connection</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#authorityconnectiontypes">Authority Connection Types</a>
+<ul class="minitoc">
+<li>
+<a href="#adauthority">Active Directory Authority Connection</a>
+</li>
+<li>
+<a href="#livelinkauthority">OpenText LiveLink Authority Connection</a>
+</li>
+<li>
+<a href="#documentumauthority">EMC Documentum Authority Connection</a>
+</li>
+<li>
+<a href="#memexauthority">Memex Patriarch Authority Connection</a>
+</li>
+<li>
+<a href="#meridioauthority">Autonomy Meridio Authority Connection</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#repositoryconnectiontypes">Repository Connection Types</a>
+<ul class="minitoc">
+<li>
+<a href="#filesystemrepository">Generic File System Repository Connection</a>
+</li>
+<li>
+<a href="#rssrepository">Generic RSS Repository Connection</a>
+</li>
+<li>
+<a href="#webrepository">Generic Web Repository Connection</a>
+</li>
+<li>
+<a href="#jcifsrepository">Windows Share/DFS Repository Connection</a>
+</li>
+<li>
+<a href="#jdbcrepository">Generic Database Repository Connection</a>
+</li>
+<li>
+<a href="#filenetrepository">IBM FileNet P8 Repository Connection</a>
+</li>
+<li>
+<a href="#documentumrepository">EMC Documentum Repository Connection</a>
+</li>
+<li>
+<a href="#livelinkrepository">OpenText LiveLink Repository Connection</a>
+</li>
+<li>
+<a href="#mexexrepository">Memex Patriarch Repository Connection</a>
+</li>
+<li>
+<a href="#meridiorepository">Autonomy Meridio Repository Connection</a>
+</li>
+<li>
+<a href="#sharepointrepository">Microsoft SharePoint Repository Connection</a>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+
+        
+<a name="N1000E"></a><a name="overview"></a>
+<h2 class="h3">Overview</h2>
+<div class="section">
+<p>This manual is intended for an end-user of ManifoldCF.  It is assumed that the Framework has been properly installed, either by you or by a system integrator,
+                   with all required services running and desired connection types properly registered.  If you think you need to know how to do that yourself, please visit the "Developer Resources" page.
+            </p>
+<p>Most of this manual describes how to use the ManifoldCF user interface.  On a standard ManifoldCF deployment, you would reach that interface by giving your browser
+                  a URL something like this: <span class="codefrag">http://my-server-name:8080/acf-crawler-ui</span>.  This will, of course, differ from system to system.  Please contact your system administrator
+                  to find out what URL is appropriate for your environment.
+            </p>
+<p>The ManifoldCF UI has been tested with Firefox and various incarnations of Internet Explorer.  If you use another browser, there is a small chance that the UI
+                  will not work properly.  Please let your system integrator know if you find any browser incompatibility problems.</p>
+<p>When you do manage to enter the Framework user interface the first time, you should see a screen that looks something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Welcome Screen" src="images/welcome-screen.PNG" width="80%"></div>
+<br>
+<br>
+<p>On the left, there are menu options you can select.  The main pane on the right shows a welcome message, but depending on what you select on the left, the contents of the main pane
+                  will change.  Before you try to accomplish anything, please take a moment to read the descriptions below of the menu selections, and thus get an idea of how the Framework works
+                  as a whole.
+            </p>
+<a name="N10031"></a><a name="outputs"></a>
+<h3 class="h4">Defining Output Connections</h3>
+<p>The Framework UI's left-side menu contains a link for listing output connections.  An output connection is a connection to a system or place where documents fetched from various
+                       repositories can be written to.  This is often a search engine.</p>
+<p>All jobs must specify an output connection.  You can create an output connection by clicking the "List Output Connections" link in the left-side navigation menu.  When you do this, the
+                       following screen will appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="List Output Connections" src="images/list-output-connections.PNG" width="80%"></div>
+<br>
+<br>
+<p>On a freshly created system, there may well be no existing output connections listed.  If there are already output connections, they will be listed on this screen, along with links that allow
+                      you to view, edit, or delete them.  To create a new output connection, click the "Add new output connection" link at the bottom.  The following screen will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Output Connection, specify Name" src="images/add-new-output-connection-name.PNG" width="80%"></div>
+<br>
+<br>
+<p>The tabs across the top each present a different view of your output connection.  Each tab allows you to edit a different characteristic of that connection.  The exact set of tabs you see
+                       depends on the connection type you choose for the connection.</p>
+<p>Start by giving your connection a name and a description.  Remember that all output connection names must be unique, and cannot be changed after the connection is defined.  The name must be
+                       no more than 32 characters long.  The description can be up to 255 characters long.  When you are done, click on the "Type" tab.  The Type tab for the connection will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Output Connection, select Type" src="images/add-new-output-connection-type.PNG" width="80%"></div>
+<br>
+<br>
+<p>The list of output connection types in the pulldown box, and what they are each called, is determined by your system integrator.  The configuration tabs for each different kind of output connection
+                       type are described in separate sections below.</p>
+<p>After you choose an output connection type, click the "Continue" button at the bottom of the pane.  You will then see all the tabs appropriate for that kind of connection appear, and a
+                       "Save" button will also appear at the bottom of the pane.  You <b>must</b> click the "Save" button when you are done in order to create your connection.  If you click "Cancel" instead, the new connection
+                       will not be created.  (The same thing will happen if you click on any of the navigation links in the left-hand pane.)</p>
+<p>Every output connection has a "Throttling" tab.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Output Connection Throttling" src="images/output-throttling.PNG" width="80%"></div>
+<br>
+<br>
+<p>On this tab, you can specify only one thing: how many open connections are allowed at any given time to the system the output connection talks with.  This restriction helps prevent
+                       that system from being overloaded, or in some cases exceeding its license limitations.  Conversely, making this number larger allows for greater overall throughput.  The default
+                       value is 10, which may not be optimal for all types of output connections.  Please refer to the section of the manual describing your output connection type for more precise
+                       recommendations.
+                </p>
+<p>Please refer to the section of the manual describing your chosen output connection type for a description of the tabs appropriate for that connection type.</p>
+<p>After you save your connection, a summary screen will be displayed that describes your connection's configuration.  This summary screen contains a line where the connection's status
+                       is displayed.  If you did everything correctly, the message "Connection working" will be displayed as a status.  If there was a problem, you will see a connection-type-specific diagnostic message instead.
+                       If this happens, you will need to correct the problem, by either fixing your infrastructure, or by editing the connection configuration appropriately, before the output connection
+                       will work correctly.</p>
+<a name="N10088"></a><a name="authorities"></a>
+<h3 class="h4">Defining Authority Connections</h3>
+<p>The Framework UI's left-side menu contains a link for listing authority connections.  An authority connection is a connection to a system that defines a particular security environment.
+                       For example, if you want to index some documents that are protected by Active Directory, you would need to configure an Active Directory authority connection.</p>
+<p>You may not need an authority if you do not mind that portions of all the documents you want to index are visible to everyone.  For web crawling and RSS crawling, this might be the
+                       situation.  Most other repositories have some security mechanism, however.</p>
+<p>You should define your authority connections <b>before</b> setting up your repository connections.  While it is possible to change the relationship between a repository connection
+                       and its authority after-the-fact, in practice such changes may cause many documents to require reindexing.</p>
+<p>You can create an authority connection by clicking the "List Authority Connections" link in the left-side navigation menu.  When you do this, the
+                       following screen will appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="List Authority Connections" src="images/list-authority-connections.PNG" width="80%"></div>
+<br>
+<br>
+<p>On a freshly created system, there may well be no existing authority connections listed.  If there are already authority connections, they will be listed on this screen, along with links
+                       that allow you to view, edit, or delete them.  To create a new authority connection, click the "Add a new connection" link at the bottom.  The following screen will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Authority Connection, specify Name" src="images/add-new-authority-connection-name.PNG" width="80%"></div>
+<br>
+<br>
+<p>The tabs across the top each present a different view of your authority connection.  Each tab allows you to edit a different characteristic of that connection.  The exact set of tabs you see
+                       depends on the connection type you choose for the connection.</p>
+<p>Start by giving your connection a name and a description.  Remember that all authority connection names must be unique, and cannot be changed after the connection is defined.  The name must be
+                       no more than 32 characters long.  The description can be up to 255 characters long.  When you are done, click on the "Type" tab.  The Type tab for the connection will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Authority Connection, select Type" src="images/add-new-authority-connection-type.PNG" width="80%"></div>
+<br>
+<br>
+<p>The list of authority connection types in the pulldown box, and what they are each called, is determined by your system integrator.  The configuration tabs for each different kind of authority connection
+                       type are described in separate sections below.</p>
+<p>After you choose an authority connection type, click the "Continue" button at the bottom of the pane.  You will then see all the tabs appropriate for that kind of connection appear, and a
+                       "Save" button will also appear at the bottom of the pane.  You <b>must</b> click the "Save" button when you are done in order to create your connection.  If you click "Cancel" instead, the new connection
+                       will not be created.  (The same thing will happen if you click on any of the navigation links in the left-hand pane.)</p>
+<p>Every authority connection has a "Throttling" tab.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Authority Connection Throttling" src="images/authority-throttling.PNG" width="80%"></div>
+<br>
+<br>
+<p>On this tab, you can specify only one thing: how many open connections are allowed at any given time to the system the authority connection talks with.  This restriction helps prevent
+                       that system from being overloaded, or in some cases exceeding its license limitations.  Conversely, making this number larger allows for smaller average search latency.  The default
+                       value is 10, which may not be optimal for all types of authority connections.  Please refer to the section of the manual describing your authority connection type for more precise
+                       recommendations.
+                </p>
+<p>Please refer to the section of the manual describing your chosen authority connection type for a description of the tabs appropriate for that connection type.</p>
+<p>After you save your connection, a summary screen will be displayed that describes your connection's configuration.  This summary screen contains a line where the connection's status
+                       is displayed.  If you did everything correctly, the message "Connection working" will be displayed as a status.  If there was a problem, you will see a connection-type-specific diagnostic message instead.
+                       If this happens, you will need to correct the problem, by either fixing your infrastructure, or by editing the connection configuration appropriately, before the authority connection
+                       will work correctly.</p>
+<a name="N100E8"></a><a name="connectors"></a>
+<h3 class="h4">Defining Repository Connections</h3>
+<p>The Framework UI's left-hand menu contains a link for listing repository connections.  A repository connection is a connection to the repository system that contains the documents
+                       that you are interested in indexing.</p>
+<p>All jobs require you to specify a repository connection, because that is where they get their documents from.  It is therefore necessary to create a repository connection before
+                       indexing any documents.</p>
+<p>A repository connection also may have an associated authority connection.  This specified authority determines the security environment in which documents from the repository
+                       connection are placed.  While it is possible to change the specified authority for a repository connection after a crawl has been done, in practice this will require that all documents
+                       associated with that repository connection be reindexed.  Therefore, we recommend that you set up your desired authority connection before defining your repository connection.</p>
+<p>You can create a repository connection by clicking the "List Repository Connections" link in the left-side navigation menu.  When you do this, the
+                       following screen will appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="List Repository Connections" src="images/list-repository-connections.PNG" width="80%"></div>
+<br>
+<br>
+<p>On a freshly created system, there may well be no existing repository connections listed.  If there are already repository connections, they will be listed on this screen, along with links
+                       that allow you to view, edit, or delete them.  To create a new repository connection, click the "Add a new connection" link at the bottom.  The following screen will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Repository Connection, specify Name" src="images/add-new-repository-connection-name.PNG" width="80%"></div>
+<br>
+<br>
+<p>The tabs across the top each present a different view of your repository connection.  Each tab allows you to edit a different characteristic of that connection.  The exact set of tabs you see
+                       depends on the connection type you choose for the connection.</p>
+<p>Start by giving your connection a name and a description.  Remember that all repository connection names must be unique, and cannot be changed after the connection is defined.  The name must be
+                       no more than 32 characters long.  The description can be up to 255 characters long.  When you are done, click on the "Type" tab.  The Type tab for the connection will then appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Repository Connection, select Type" src="images/add-new-repository-connection-type.PNG" width="80%"></div>
+<br>
+<br>
+<p>The list of repository connection types in the pulldown box, and what they are each called, is determined by your system integrator.  The configuration tabs for each different kind of repository connection
+                       type are described in separate sections below.</p>
+<p>You may also at this point select the authority connection to secure all documents fetched from this repository with.  Bear in mind that only some authority connection types are compatible with any
+                       given repository connection types.  Read the details of your desired repository or authority connection type to understand its intentions, and how it is expected to be used.</p>
+<p>After you choose the desired repository connection type and an authority connection, click the "Continue" button at the bottom of the pane.  You will then see all the tabs appropriate for that kind of connection appear, and a
+                       "Save" button will also appear at the bottom of the pane.  You <b>must</b> click the "Save" button when you are done in order to create or update your connection.  If you click "Cancel" instead, the new connection
+                       will not be created.  (The same thing will happen if you click on any of the navigation links in the left-hand pane.)</p>
+<p>Every repository connection has a "Throttling" tab.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Repository Connection Throttling" src="images/repository-throttling.PNG" width="80%"></div>
+<br>
+<br>
+<p>On this tab, you can specify two things.  The first is how many open connections are allowed at any given time to the system the authority connection talks with.  This restriction helps prevent
+                       that system from being overloaded, or in some cases exceeding its license limitations.  Conversely, making this number larger allows for smaller average search latency.  The default
+                       value is 10, which may not be optimal for all types of repository connections.  Please refer to the section of the manual describing your authority connection type for more precise
+                       recommendations.  The second specifies how rapidly, on average, the crawler will fetch documents via this connection.
+                </p>
+<p>Each connection type has its own notion of "throttling bin".  A throttling bin is the name of a resource whose access needs to be throttled.  For example, the Web connection type uses a
+                       document's server name as the throttling bin associated with the document, since (presumably) it will be access to each individual server that will need to be throttled independently.
+                </p>
+<p>On the repository connection "Throttling" tab, you can specify an unrestricted number of throttling descriptions.  Each throttling description consists of a regular expression that describes
+                       a family of throttling bins, plus a helpful description, plus an average number of fetches per minute for each of the throttling bins that matches the regular expression.  If a given
+                       throttling bin matches more than one throttling description, the most conservative fetch rate is chosen.</p>
+<p>The simplest regular expression you can use is the empty regular expression.  This will match all of the connection's throttle bins, and thus will allow you to specify a default
+                       throttling policy for the connection.  Set the desired average fetch rate, and click the "Add" button.  The throttling tab will then appear something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Repository Connection Throttling With Throttle" src="images/repository-throttling-with-throttle.PNG" width="80%"></div>
+<br>
+<br>
+<p>If no throttle descriptions are added, no fetch-rate throttling will be performed.</p>
+<p>Please refer to the section of the manual describing your chosen repository connection type for a description of the tabs appropriate for that connection type.</p>
+<p>After you save your connection, a summary screen will be displayed that describes your connection's configuration.  This summary screen contains a line where the connection's status
+                       is displayed.  If you did everything correctly, the message "Connection working" will be displayed as a status.  If there was a problem, you will see a connection-type-specific diagnostic message instead.
+                       If this happens, you will need to correct the problem, by either fixing your infrastructure, or by editing the connection configuration appropriately, before the authority connection
+                       will work correctly.</p>
+<a name="N1015F"></a><a name="jobs"></a>
+<h3 class="h4">Creating Jobs</h3>
+<p>A "job" in ManifoldCF is a description of a set of documents.  The Framework's job is to fetch this set of documents come from a specific repository connection, and
+                       send them to a specific output connection.  The repository connection that is associated with the job will determine exactly how this set of documents is described, and to some
+                       degree how they are indexed.  The output connection associated with the job can also affect how each document is indexed.</p>
+<p>Every job is expected to be run more than once.  Each time a job is run, it is responsible not only for sending new or changed documents to the output connection, but also for
+                       notifying the output connection of any documents that are no longer part of the set.  Note that there are two ways for a document to no longer be part of the included set of documents:
+                       Either the document may have been deleted from the repository, or the document may no longer be included in the allowed set of documents.  The Framework handles each case properly.</p>
+<p>Deleting a job causes the output connection to be notified of deletion for all documents belonging to that job.  This makes sense because the job represents the set of documents, which would
+                       otherwise be orphaned when the job was removed.  (Some users make the assumption that a ManifoldCF job represents nothing more than a task, which is an incorrect
+                       assumption.)</p>
+<p>Note that the Framework allows jobs that describe overlapping sets of documents to be defined.  Documents that exist in more than one job are treated in the following special ways:</p>
+<ul>
+                    
+<li>When a job is deleted, the output connection is notified of deletion of documents belonging to that job only if they don't belong to another job</li>
+                    
+<li>The version of the document sent to the output connection depends on which job was run last</li>
+                
+</ul>
+<p>The subtle logic of overlapping documents means that you probably want to avoid this situation entirely, if it is at all feasible.</p>
+<p>A typical non-continuous run of a job has the following stages of execution:</p>
+<ol>
+                    
+<li>Adding the job's new, changed, or deleted starting points to the queue ("seeding")</li>
+                    
+<li>Fetching documents, discovering new documents, and detecting deletions</li>
+                    
+<li>Removing no-longer-included documents from the queue</li>
+                
+</ol>
+<p>Jobs can also be run "continuously", which means that the job never completes, unless it is aborted.  A continuous run has different stages of execution:</p>
+<ol>
+                    
+<li>Adding the job's new, changed, or deleted starting points to the queue ("seeding")</li>
+                    
+<li>Fetching documents, discovering new documents, and detecting deletions, while reseeding periodically</li>
+                
+</ol>
+<p>Note that continuous jobs <b>cannot</b> remove no-longer-included documents from the queue.  They can only remove documents that have been deleted from the repository.</p>
+<p>A job can independently be configured to start when explicitly started by a user, or to run on a user-specified schedule.  If a job is set up to run on a schedule, it can be made to
+                      start only at the beginning of a schedule window, or to start again within any remaining schedule window when the previous job run completes.</p>
+<p>There is no restriction in ManifoldCF as to how many jobs many running at any given time.</p>
+<p>You create a job by first clicking on the "List All Jobs" link on the left-side menu.  The following screen will appear:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="List Jobs" src="images/list-jobs.PNG" width="80%"></div>
+<br>
+<br>
+<p>You may view, edit, or delete any existing jobs by clicking on the appropriate link.  You may also create a new job that is a copy of an existing job.  But to create a brand-new job,
+                       click the "Add a new job" link at the bottom.  You will then see the following page:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Job, name tab" src="images/add-new-job-name.PNG" width="80%"></div>
+<br>
+<br>
+<p>Give your job a name.  Note that job names do <b>not</b> have to be unique, although it is probably less confusing to have a different name for each one.  Then, click the
+                       "Connection" tab:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Job, connection tab" src="images/add-new-job-connection.PNG" width="80%"></div>
+<br>
+<br>
+<p>Now, you should select both the output connection name, and the repository connection name.  Bear in mind that whatever you select cannot be changed after the job is saved
+                       the first time.</p>
+<p>You also have the opportunity to modify the job's priority and start method at this time.  The priority
+                       controls how important this job's documents are, relative to documents from any other job.  The higher the number, the more important it is considered for that job's documents to be
+                       fetched first.  The start method is as previously described; you get a choice of manual start, starting on the beginning of a scheduling window, or starting whenever possible within
+                       a scheduling window.</p>
+<p>Make your selections, and click "Continue".  The rest of the job's tabs will now appear, and a
+                       "Save" button will also appear at the bottom of the pane.  You <b>must</b> click the "Save" button when you are done in order to create or update your job.  If you click "Cancel" instead, the new job
+                       will not be created.  (The same thing will happen if you click on any of the navigation links in the left-hand pane.)</p>
+<p>All jobs have a "Scheduling" tab.  The scheduling tab allows you to set up schedule-related configuration information:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Job, scheduling tab" src="images/add-new-job-scheduling.PNG" width="80%"></div>
+<br>
+<br>
+<p>On this tab, you can specify the following parameters:</p>
+<ul>
+                    
+<li>Whether the job runs continuously, or scans every document once</li>
+                    
+<li>How long a document should remain alive before it is 'expired', and removed from the index</li>
+                    
+<li>How long an interval before a document is re-checked, to see if it has changed</li>
+                    
+<li>How long to wait before reseeding initial documents</li>
+                
+</ul>
+<br>
+<p>The last three parameters only make sense if a job is a continuously running one, as the UI indicates.</p>
+<p>The other thing you can do on this time is to define an appropriate set of scheduling records.  Each scheduling record defines some related set of intervals during which the job can run.  The
+                       intervals are determined by the starting time (which is defined by the day of week, month, day, hour, and minute pulldowns), and the maximum run time in minutes, which determines
+                       when the interval ends.  It is, of course, possible to select multiple values for each of the pulldowns, in which case you be describing a starting time that had to match at least <b>one</b>
+                       of the selected values for <b>each</b> of the specified fields.</p>
+<p>Once you have selected the schedule values you want, click the "Add Scheduled Time" button:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Add New Job, scheduling tab with record" src="images/add-new-job-scheduling-with-record.PNG" width="80%"></div>
+<br>
+<br>
+<p>The example shows a schedule where crawls are run on Saturday and Sunday nights at 2 AM, and run for no more than 4 hours.</p>
+<p>The rest of the job tabs depend on the types of the connections you selected.  Please refer to the section of the manual
+                       describing the appropriate connection types corresponding to your chosen repository and output connections for a description of the job tabs that will appear for those connections.</p>
+<a name="N10220"></a><a name="executing"></a>
+<h3 class="h4">Executing Jobs</h3>
+<p>You can follow what is going on, and control the execution of your jobs, by clicking on the "Status and Job Management" link on the left-side navigation menu.  When you do, you might
+                       see something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Job Status" src="images/job-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>From here, you can click the "Refresh" link at the bottom of the main pane to see an updated status display, or you can directly control the job using the links in the leftmost
+                       status column.  Allowed actions you may see at one point or another include:</p>
+<ul>
+                    
+<li>Start (start the job)</li>
+                    
+<li>Abort(abort the job)</li>
+                    
+<li>Pause (pause the job)</li>
+                    
+<li>Resume (resume the job)</li>
+                    
+<li>Restart (equivalent to aborting the job, and starting it all over again)</li>
+                
+</ul>
+<br>
+<p>The columns "Documents", "Active", and "Processed" have very specific means as far as documents in the job's queue are concerned.  The "Documents" column counts all the documents
+                       that belong to the job.  The "Active" column counts all of the documents for that job that are queued up for processing.  The "Processed" column counts all documents that are on the
+                       queue for the job that have been processed at least once in the past.</p>
+<a name="N1024F"></a><a name="statusreports"></a>
+<h3 class="h4">Status Reports</h3>
+<p>Every job in ManifoldCF describes a set of documents.  A reference to each document in the set is kept in a job-specific queue.  It is sometimes valuable for
+                       diagnostic reasons to examine this queue for information.  The Framework UI has several canned reports which do just that.</p>
+<p>Each status report allows you to select what documents you are interested in from a job's queue based on the following information:</p>
+<ul>
+                    
+<li>The job</li>
+                    
+<li>The document identifier</li>
+                    
+<li>The document's status and state</li>
+                    
+<li>When the document is scheduled to be processed next</li>
+                
+</ul>
+<a name="N1026A"></a><a name="documentstatus"></a>
+<h4>Document Status</h4>
+<p>A document status report simply lists all matching documents from within the queue, along with their state, status, and planned future activity.  You might use this report if you were
+                           trying to figure out (for example) whether a specific document had been processed yet during a job run.</p>
+<p>Click on the "Document Status" link on the left-hand menu.  You will see a screen that looks something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Document Status, select connection" src="images/document-status-select-connection.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the desired connection.  You may also select the desired document state and status, as well as specify a regular expression for the document identifier, if you want.  Then,
+                           click the "Continue" button:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Document Status, select job" src="images/document-status-select-job.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the job whose documents you want to see, and click "Continue" once again.  The results will display:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Document Status, example" src="images/document-status-example.PNG" width="80%"></div>
+<br>
+<br>
+<p>You may alter the criteria, and click "Go" again, if you so choose.  Or, you can alter the number of result rows displayed at a time, and click "Go" to redisplay.  Finally, you can page
+                           up and down through the results using the "Prev" and "Next" links.</p>
+<a name="N102A1"></a><a name="queuestatus"></a>
+<h4>Queue Status</h4>
+<p>A queue status report is an aggregate report that counts the number of occurrences of documents in specified classes.  The classes are specified as a grouping within a regular
+                           expression, which is matched against all specified document identifiers.  The results that are displayed are counts of documents.  There will be a column for each combination of
+                           document state and status.</p>
+<p>For example, a class specification of "()" will produce exactly one result row, and will provide a count of documents that are in each state/status combination.  A class description
+                           of "(.*)", on the other hand, will create one row for each document identifier, and will put a "1" in the column representing state and status of that document, with a "0" in all other
+                           column positions.</p>
+<p>Click the "Queue Status" link on the left-hand menu.  You will see a screen that looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Queue Status, select connection" src="images/queue-status-select-connection.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the desired connection.  You may also select the desired document state and status, as well as specify a regular expression for the document identifier, if you want.  You
+                           will probably want to change the document identifier class from its default value of "(.*)".  Then, click the "Continue" button:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Queue Status, select job" src="images/queue-status-select-job.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the job whose documents you want to see, and click "Continue" once again.  The results will display:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Queue Status, example" src="images/queue-status-example.PNG" width="80%"></div>
+<br>
+<br>
+<p>You may alter the criteria, and click "Go" again, if you so choose.  Or, you can alter the number of result rows displayed at a time, and click "Go" to redisplay.  Finally, you can page
+                           up and down through the results using the "Prev" and "Next" links.</p>
+<a name="N102DC"></a><a name="historyreports"></a>
+<h3 class="h4">History Reports</h3>
+<p>For every repository connection, ManifoldCF keeps a history of what has taken place involving that connection.  This history includes both events that the
+                       framework itself logs, as well as events that a repository connection or output connection will log.  These individual events are categorized by "activity type".  Some of the kinds of
+                       activity types that exist are:</p>
+<ul>
+                    
+<li>Job start</li>
+                    
+<li>Job end</li>
+                    
+<li>Job abort</li>
+                    
+<li>Various connection-type-specific read or access operations</li>
+                    
+<li>Various connection-type-specific output or indexing operations</li>
+                
+</ul>
+<p>This history can be enormously helpful in understand how your system is behaving, and whether or not it is working properly.  For this reason, the Framework UI has the ability to
+                       generate several canned reports which query this history data and display the results.</p>
+<p>All history reports allow you to specify what history records you are interested in including.  These records are selected using the following criteria:</p>
+<ul>
+                    
+<li>The repository connection name</li>
+                    
+<li>The activity type(s) desired</li>
+                    
+<li>The start time desired</li>
+                    
+<li>The end time desired</li>
+                    
+<li>The identifier(s) involved, specified as a regular expression</li>
+                    
+<li>The result(s) produced, specified as a regular expression</li>
+                
+</ul>
+<p>The actual reports available are designed to be useful for diagnosing both access issues, and performance issues.  See below for a summary of the types available.</p>
+<a name="N10315"></a><a name="simplehistory"></a>
+<h4>Simple History Reports</h4>
+<p>As the name suggests, a simple history report does not attempt to aggregate any data, but instead just lists matching records from the repository connection's history.
+                           These records are initially presented in most-recent-first order, and include columns for the start and end time of the event, the kind of activity represented by the event,
+                           the identifier involved, the number of bytes involved, and the results of the event.  Once displayed, you may choose to display more or less data, or reorder the display by column, or page through the data.</p>
+<p>To get started, click on the "Simple History" link on the left-hand menu.  You will see a screen that looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Simple History Report, select connection" src="images/simple-history-select-connection.PNG" width="80%"></div>
+<br>
+<br>
+<p>Now, select the desired repository connection from the pulldown in the upper left hand corner.  If you like, you can also change the specified date/time range, or specify an identifier
+                           regular expression or result code regular expression.  By default, the date/time range selects all events within the last hour, while the identifier regular expression and result code
+                           regular expression matches all identifiers and result codes.</p>
+<p>Next, click the "Continue" button.  A list of pertinent activities should then appear in a pulldown in the upper right:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Simple History Report, select activities" src="images/simple-history-select-activities.PNG" width="80%"></div>
+<br>
+<br>
+<p>You may select one or more activities that you would like a report on.  When you are done, click the "Go" button.  The results will appear, ordered by time, most recent event first:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Simple History Report, example" src="images/simple-history-example.PNG" width="80%"></div>
+<br>
+<br>
+<p>You may alter the criteria, and click "Go" again, if you so choose.  Or, you can alter the number of result rows displayed at a time, and click "Go" to redisplay.  Finally, you can page
+                           up and down through the results using the "Prev" and "Next" links.</p>
+<p>Please bear in mind that the report redisplays whatever matches each time you click "Go".  So, if your time interval goes from an hour beforehand to "now", and you have activity
+                           happening, you will see different results each time "Go" is clicked.</p>
+<a name="N10352"></a><a name="maxactivity"></a>
+<h4>Maximum Activity Reports</h4>
+<p>A maximum activity report is an aggregate report used primarily to display the maximum rate that events occur within a specified time interval.  MHL</p>
+<a name="N1035C"></a><a name="maxbandwidth"></a>
+<h4>Maximum Bandwidth Reports</h4>
+<p>A maximum bandwidth report is an aggregate report used primarily to display the maximum byte rate that pertains to events occurring within a specified time interval.  MHL</p>
+<a name="N10366"></a><a name="resulthistogram"></a>
+<h4>Result Histogram Reports</h4>
+<p>A result histogram report is an aggregate report is used to count the occurrences of each kind of matching result for all matching events.  MHL</p>
+<a name="N10371"></a><a name="credentials"></a>
+<h3 class="h4">A Note About Credentials</h3>
+<p>If any of your selected connection types require credentials, you may find it necessary to approach your system administrator to obtain an appropriate set.  System administrators
+                       are often reluctant to provide accounts and credentials that have any more power than is utterly necessary, and sometimes not even that.  Great care has been taken in the
+                       development of all connection types to be sure they require no more privilege than is utterly necessary.  If a security-related warning appears when you view a connection's
+                       status, you must inform the system administrator that the credentials are inadequate to allow the connection to accomplish its task, and work with him/her to correct the problem.
+                </p>
+</div>
+        
+        
+<a name="N1037C"></a><a name="outputconnectiontypes"></a>
+<h2 class="h3">Output Connection Types</h2>
+<div class="section">
+<a name="N10382"></a><a name="solroutputconnector"></a>
+<h3 class="h4">Solr Output Connection</h3>
+<p>The Solr output connection type is designed to allow ManifoldCF to submit documents to an appropriate Solr pipeline, via the Solr
+                       HTTP ingestion API.  The configuration parameters are set to the default Solr values, which can be changed (since Solr's configuration can be changed).
+                       The Solr output connection type furthermore makes no judgment as to whether a given document is indexable or not - it accepts everything, and passes all documents
+                       on to the pipeline, where presumably the configured pipeline will decide if a document should be rejected or not.  (All of that happens without a Solr connection
+                       being aware of it in any way.)</p>
+<p>Unfortunately, this lack of specificity comes at a cost.  Unless you take care to filter documents properly in each job, large movie files or other opaque
+                       content may well be picked up and sent to Solr for indexing, which will greatly increase the dead load on the overall system.  It is therefore a good idea to review
+                       all crawls done through a Solr connection while they are underway, to be sure there isn't a misconfiguration of this kind.</p>
+<p>When you create a Solr output connection, three configuration tabs appear.  The "Server" tab allows you to configure the HTTP target of the connection:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Solr Configuration, Server tab" src="images/solr-configure-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>Fill in the fields according to your Solr configuration.  The Solr connection type supports only basic authentication at this time; if you have this enabled, supply the credentials
+                       as requested on the bottom part of the form.</p>
+<p>The second tab is the "Schema" tab, which allows you to specify the name of the Solr field to use as a document identifier.  The Solr connection type will treat
+                       this field as being a unique key for locating the indexed document for further modification or deletion:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Solr Configuration, Schema tab" src="images/solr-configure-schema.PNG" width="80%"></div>
+<br>
+<br>
+<p>The third tab is the "Arguments" tab, which allows you to specify arbitrary arguments to be sent to Solr.  This is a popular way of telling Solr how to handle
+                       specific documents, so the connection type allows you to add arguments to each Solr indexing request:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Solr Configuration, Arguments tab" src="images/solr-configure-arguments.PNG" width="80%"></div>
+<br>
+<br>
+<p>Fill in the argument name and value, and click the "Add" button.  Bear in mind that if you add an argument with the same name as an existing one, it will replace the
+                       existing one with the new specified value.  You can delete existing arguments by clicking the "Delete" button next to the argument you want to delete.</p>
+<p>When you are done, don't forget to click the "Save" button to save your changes!  When you do, a connection summary and status screen will be presented, which
+                       may look something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Solr Status" src="images/solr-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>Note that in this example, the Solr connection is not responding, which is leading to an error status message instead of "Connection working".</p>
+<p>When you configure a job to use a Solr-type output connection, the Solr connection type provides a tab called "Field Mapping".  The purpose of this tab
+                       is to allow you to map metadata fields as fetched by the job's connection type to fields that Solr is set up to receive.  This is necessary because
+                       the names of the metadata items are often determined by the repository, with no alignment to fields defined in the Solr schema.  You may also
+                       suppress specific metadata items from being sent to the index using this tab.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Solr Specification, Field Mapping tab" src="images/solr-job-field-mapping.PNG" width="80%"></div>
+<br>
+<br>
+<p>Add a new mapping by filling in the "source" with the name of the metadata item from the repository, and "target" as the name of the output field in
+                       Solr, and click the "Add" button.  Leaving the "target" field blank will result in all metadata items of that name not being sent to Solr.</p>
+<a name="N103E1"></a><a name="gtsoutputconnector"></a>
+<h3 class="h4">MetaCarta GTS Output Connection</h3>
+<p>The MetaCarta GTS output connection type is designed to allow ManifoldCF to submit documents to an appropriate MetaCarta GTS search
+                       appliance, via the appliance's HTTP Ingestion API.</p>
+<p>The connection type implicitly understands that GTS can only handle text, HTML, XML, RTF, PDF, and Microsoft Office documents.  All other document types will be
+                       considered to be unindexable.  This helps prevent jobs based on a GTS-type output connection from fetching data that is large, but of no particular relevance.</p>
+<p>When you configure a job to use a GTS-type output connection, two additional tabs will be presented to the user: "Collections" and "Document Templates".  These
+                       tabs allow per-job specification of these GTS-specific features.</p>
+<p>More here later</p>
+<a name="N103F4"></a><a name="nulloutputconnector"></a>
+<h3 class="h4">Null Output Connection</h3>
+<p>The null output connection type is meant primarily to function as an aid for people writing repository connection types.  It is not expected to be useful in practice.</p>
+<p>The null output connection type simply logs indexing and deletion requests, and does nothing else.  It does not have any special configuration tabs, nor does it
+                       contribute tabs to jobs defined that use it.</p>
+</div>
+        
+        
+<a name="N10402"></a><a name="authorityconnectiontypes"></a>
+<h2 class="h3">Authority Connection Types</h2>
+<div class="section">
+<a name="N10408"></a><a name="adauthority"></a>
+<h3 class="h4">Active Directory Authority Connection</h3>
+<p>An active directory authority connection is essential for enforcing security for documents from Windows shares, Microsoft SharePoint, and IBM FileNet repositories.
+                       This connection type needs to be provided with information about how to log into an appropriate Windows domain controller, with a user that has sufficient privileges to
+                       be able to look up any user's ID and group relationships.  While the connection type has some known limitations, it should function well for most straightforward Windows
+                       security architecture situations.  The cases in which it may not be adequate include:</p>
+<br>
+<ul>
+                    
+<li>when child domains are present</li>
+                    
+<li>when the expected number of requests per second is fairly high</li>
+                
+</ul>
+<br>
+<p>An active directory authority connection type has a single special tab in the authority connection editing screen: the "Domain Controller" tab:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="AD Configuration, Domain Controller tab" src="images/ad-configure-dc.PNG" width="80%"></div>
+<br>
+<br>
+<p>Fill in the requested values.  Note that the "Administrative user name" field usually requires no domain suffix, but depending on the details of how the domain
+                       controller is configured, may sometimes only accept the "name@domain" format.</p>
+<p>When you are done, click the "Save" button.  When you do, a connection
+                       summary and status screen will be presented, which
+                       may look something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="AD Status" src="images/ad-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>Note that in this example, the Active Directory connection is not responding, which is leading to an error status message instead of "Connection working".</p>
+<a name="N10441"></a><a name="livelinkauthority"></a>
+<h3 class="h4">OpenText LiveLink Authority Connection</h3>
+<p>A LiveLink authority connection is needed to enforce security for documents retrieved from LiveLink repositories.</p>
+<p>In order to function, this connection type needs to be provided with 
+                    information about the name of the LiveLink server, and credentials appropriate for retrieving a user's ACLs from that machine.  Since LiveLink operates with its own list of users, you
+                    may also want to specify a rule-based mapping between an Active Directory user and the corresponding LiveLink user.  The authority type allows you to specify such a mapping using
+                    regular expressions.</p>
+<p>A LiveLink authority connection has two special tabs you will need to configure: the "Server" tab, and the "User Mapping" tab.</p>
+<p>The "Server" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="LiveLink Authority, Server tab" src="images/livelink-authority-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the name of the desired LiveLink server, the LiveLink port, and the LiveLink credentials.</p>
+<p>The "User Mapping" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="LiveLink Authority, User Mapping tab" src="images/livelink-authority-user-mapping.PNG" width="80%"></div>
+<br>
+<br>
+<p>The purpose of the "User Mapping" tab is to allow you to map the incoming user name and domain (usually from Active Directory) to its LiveLink equivalent.
+                       The mapping consists of a match expression, which is a regular expression where parentheses ("("
+                       and ")") mark sections you are interested in, and a replace string.  The sections marked with parentheses are called "groups" in regular expression parlance.  The replace string consists of constant text plus
+                       substitutions of the groups from the match, perhaps modified.  For example, "$(1)" refers to the first group within the match, while "$(1l)" refers to the first match group
+                       mapped to lower case.  Similarly, "$(1u)" refers to the same characters, but mapped to upper case.</p>
+<p>For example, a match expression of <span class="codefrag">^(.*)\@([A-Z|a-z|0-9|_|-]*)\.(.*)$</span> with a replace string of <span class="codefrag">$(2)\$(1l)</span> would convert an AD username of
+                    <span class="codefrag">MyUserName@subdomain.domain.com</span> into the LiveLink user name <span class="codefrag">subdomain\myusername</span>.</p>
+<p>When you are done, click the "Save" button.  You will then see a summary and status for the authority connection:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="LiveLink Authority Status" src="images/livelink-authority-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>We suggest that you examine the status carefully and correct any reported errors before proceeding.  Note that in this example, the LiveLink server would not accept connections, which
+                    is leading to an error status message instead of "Connection working".</p>
+<a name="N10493"></a><a name="documentumauthority"></a>
+<h3 class="h4">EMC Documentum Authority Connection</h3>
+<p>A Documentum authority connection is required for enforcing security for documents retrieved from Documentum repositories.</p>
+<p>This connection type needs to be provided with information about what Content Server to connect to, and the credentials that should be used to retrieve a user's ACLs from that machine.
+                    In addition, you can also specify whether or not you wish to include auto-generated ACLs in every user's list.  Auto-generated ACLs are created within Documentum for every folder
+                    object.  Because there are often a very large number of folders, including these ACLs can bloat the number of ManifoldCF access tokens returned for a user to tens of thousands, which can negatively
+                    impact perfomance.  Even more notably, few Documentum installations make any real use of these ACLs in any way.  Since Documentum's ACLs are purely additive (that is, there are no
+                    mechanisms for 'deny' semantics), the impact of a missing ACLs is only to block a user from seeing something they otherwise could see.  It is thus safe, and often desirable, to simply ignore the
+                    existence of these auto-generated ACLs.</p>
+<p>A Documentum authority connection has three special tabs you will need to configure: the "Docbase" tab, the "User Mapping" tab, and the "System ACLs" tab.</p>
+<p>The "Docbase" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Documentum Authority, Docbase tab" src="images/documentum-authority-docbase.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the desired Content Server docbase name, and enter the appropriate credentials.  You may leave the "Domain" field blank if the Content Server you specify does not have
+                    Active Directory support enabled.</p>
+<p>The "User Mapping" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Documentum Authority, User Mapping tab" src="images/documentum-authority-user-mapping.PNG" width="80%"></div>
+<br>
+<br>
+<p>Here you can specify whether the mapping between incoming user names and Content Server user names is case sensitive or case insensitive.  No other mappings
+                    are currently permitted.  Typically, Documentum instances operate in conjunction with Active Directory, such that Documentum user names are either the same as the Active Directory user names,
+                    or are the Active Directory user names mapped to all lower case characters.  You may need to consult with your Documentum system administrator to decide what the correct setting should be for
+                    this option.</p>
+<p>The "System ACLs" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Documentum Authority, System ACLs tab" src="images/documentum-authority-system-acls.PNG" width="80%"></div>
+<br>
+<br>
+<p>Here, you can choose to ignore all auto-generated ACLs associated with a user.  We recommend that you try ignoring such ACLs, and only choose the default if you have
+                    reason to believe that your Documentum content is protected in a significant way by the use of auto-generated ACLs.  Your may need to consult with your Documentum system administrator to
+                    decide what the proper setting should be for this option.</p>
+<p>When you are done, click the "Save" button.  When you do, a connection summary and status screen will be presented:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Documentum Authority Status" src="images/documentum-authority-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>Pay careful attention to the status, and be prepared to correct any
+                    problems that are displayed.</p>
+<a name="N104E7"></a><a name="memexauthority"></a>
+<h3 class="h4">Memex Patriarch Authority Connection</h3>
+<p>A Memex authority connection is required for enforcing security for documents retrieved from Memex repositories.</p>
+<p>This connection type needs to be provided with information about what Memex Server to connect to, and what user mapping to perform.
+                    Also needed are the Memex credentials that should be used to retrieve a user's permissions from the Memex server.</p>
+<p>A Memex authority connection has the following special tabs you will need to configure: the "Memex Server" tab, and the "User Mapping" tab.  The "Memex Server" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Memex Authority, Memex Server tab" src="images/memex-authority-memex-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>You must supply the name of your Memex server, and the connection port, along with the Memex credentials for a user that has sufficient permissions to retrieve Memex user
+                    information.  You must also select the Memex server's character encoding.  If you do not know the encoding, consult your Memex system administrator.</p>
+<p>The "User Mapping" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Memex Authority, User Mapping tab" src="images/memex-authority-user-mapping.PNG" width="80%"></div>
+<br>
+<br>
+<p>The purpose of the "User Mapping" tab is to allow you to map the incoming user name and domain (usually from Active Directory) to its Memex equivalent.
+                       The mapping consists of a match expression, which is a regular expression where parentheses ("("
+                       and ")") mark sections you are interested in, and a replace string.  The sections marked with parentheses are called "groups" in regular expression parlance.  The replace string consists of constant text plus
+                       substitutions of the groups from the match, perhaps modified.  For example, "$(1)" refers to the first group within the match, while "$(1l)" refers to the first match group
+                       mapped to lower case.  Similarly, "$(1u)" refers to the same characters, but mapped to upper case.</p>
+<p>For example, a match expression of <span class="codefrag">^(.*)\@([A-Z|a-z|0-9|_|-]*)\.(.*)$</span> with a replace string of <span class="codefrag">$(2)\$(1l)</span> would convert an AD username of
+                    <span class="codefrag">MyUserName@subdomain.domain.com</span> into the Memex user name <span class="codefrag">subdomain\myusername</span>.</p>
+<p>When you are done, click the "Save" button.  You will then see a summary and status for the authority connection:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Memex Authority Status" src="images/memex-authority-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>We suggest that you examine the status carefully and correct any reported errors before proceeding.  Note that in this example, the Memex server has a license error, which
+                    is leading to an error status message instead of "Connection working".</p>
+<a name="N10536"></a><a name="meridioauthority"></a>
+<h3 class="h4">Autonomy Meridio Authority Connection</h3>
+<p>A Meridio authority connection is required for enforcing security for documents retrieved from Meridio repositories.</p>
+<p>This connection type needs to be provided with information about what Document Server to connect to, what Records Server to connect to, and what User Service Server
+                    to connect to.  Also needed are the Meridio credentials that should be used to retrieve a user's ACLs from those machines.</p>
+<p>Note that the User Service is part of the Meridio Authority, and must be installed somewhere in the Meridio system in order for the Meridio Authority to function correctly.
+                    If you do not know whether this has yet been done, or on what server, please ask your system administrator.</p>
+<p>A Meridio authority connection has the following special tabs you will need to configure: the "Document Server" tab, the "Records Server" tab, the "User Service Server" tab,
+                    and the "Credentials" tab.  The "Document Server" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Meridio Authority, Document Server tab" src="images/meridio-authority-document-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the correct protocol, and enter the correct server name, port, and location to reference the Meridio document server services.  If a proxy is involved, enter the proxy host
+                    and port.  Authenticated proxies are not supported by this connection type at this time.</p>
+<p>Note that, in the Meridio system, while it is possible that different services run on different servers, this is not typically the case.  The connection type, on the other hand, makes
+                    no assumptions, and permits the most general configuration.</p>
+<p>The "Records Server" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Meridio Authority, Records Server tab" src="images/meridio-authority-records-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select the correct protocol, and enter the correct server name, port, and location to reference the Meridio records server services.  If a proxy is involved, enter the proxy host
+                    and port.  Authenticated proxies are not supported by this connection type at this time.</p>
+<p>Note that, in the Meridio system, while it is possible that different services run on different servers, this is not typically the case.  The connection type, on the other hand, makes
+                    no assumptions, and permits the most general configuration.</p>
+<p>The "User Service Server" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Meridio Authority, User Service Server tab" src="images/meridio-authority-user-service-server.PNG" width="80%"></div>
+<br>
+<br>
+<p>You will require knowledge of where the special Meridio Authority extensions have been installed in order to fill out this tab.</p>
+<p>Select the correct protocol, and enter the correct server name, port, and location to reference the Meridio user service server services.  If a proxy is involved, enter the proxy host
+                    and port.  Authenticated proxies are not supported by this connection type at this time.</p>
+<p>Note that, in the Meridio system, while it is possible that different services run on different servers, this is not typically the case.  The connection type, on the other hand, makes
+                    no assumptions, and permits the most general configuration.</p>
+<p>The "Credentials" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Meridio Authority, Credentials tab" src="images/meridio-authority-credentials.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the Meridio server credentials needed to access the Meridio system.</p>
+<p>When you are done, click the "Save" button.  You will then see a screen looking something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="Meridio Authority Status" src="images/meridio-authority-status.PNG" width="80%"></div>
+<br>
+<br>
+<p>In this example, logon has not succeeded because the server on which the Meridio Authority is running is unknown to the Windows domain under which Meridio is running.
+                    This results in an error message, instead of the "Connection working" message that you would see if the authority was working properly.</p>
+<p>Since Meridio uses Windows IIS for authentication, there are many ways in which the configuration of either IIS or the Windows domain under which Meridio runs can affect
+                    the correct functioning of the Meridio Authority.  It is beyond the scope of this manual to describe the kinds of analysis and debugging techniques that might be required to diagnose connection
+                    and authentication problems.  If you have trouble, you will almost certainly need to involve your Meridio IT personnel.  Debugging tools may include (but are not limited to):</p>
+<br>
+<ul>
+                    
+<li>Windows security event logs</li>
+                    
+<li>ManifoldCF logs (see below)</li>
+                    
+<li>Packet captures (using a tool such as WireShark)</li>
+                
+</ul>
+<br>
+<p>If you need specific ManifoldCF logging information, contact your system integrator.</p>
+</div>
+        
+        
+<a name="N105BE"></a><a name="repositoryconnectiontypes"></a>
+<h2 class="h3">Repository Connection Types</h2>
+<div class="section">
+<a name="N105C4"></a><a name="filesystemrepository"></a>
+<h3 class="h4">Generic File System Repository Connection</h3>
+<p>The generic file system repository connection type was developed primarily as an example, demonstration, and testing tool, although it can potentially be useful for indexing local
+                       files that exist on the same machine that ManifoldCF is running on.  Bear in mind that there is no support in this connection type for any kind of
+                       security, and the options are somewhat limited.</p>
+<p>The file system repository connection type provides no configuration tabs beyond the standard ones.  However, please consider setting a "Maximum connections per
+                       JVM" value on the "Throttling" tab to at least one per worker thread, or 30, for best performance.</p>
+<p>Jobs created using a file-system-type repository connection
+                       have two tabs in addition to the standard repertoire: the "Hop Filters" tab, and the "Paths" tab.</p>
+<p>The "Hop Filters" tab allows you to restrict the document set by the number of child hops from the path root.  While this is not terribly interesting in the case of a file
+                       system, the same basic functionality is also used in the Web connection type, where it is a more important feature.  The file system connection type gives you a way to see
+                       how this feature works, in a more predictable environment:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="File System Connection, Hop Filters tab" src="images/filesystem-job-hopcount.PNG" width="80%"></div>
+<br>
+<br>
+<p>In the case of the file system connection type, there is only one variety of relationship between documents, which is called a "child" relationship.  If you want to
+                       restrict the document set by how far away a document is from the path root, enter the maximum allowed number of hops in the text box.  Leaving the box blank
+                       indicates that no such filtering will take place.</p>
+<p>On this same tab, you can tell the Framework what to do should there be changes in the distance from the root to a document.  The choice "Delete unreachable
+                       documents" requires the Framework to recalculate the distance to every potentially affected document whenever a change takes place.  This may require
+                       expensive bookkeeping, however, so you also have the option of  ignoring such changes.  There are two varieties of this latter option - you can ignore the changes
+                       for now, with the option of turning back on the aggressive bookkeeping at a later time, or you can decide not to ever allow changes to propagate, in which case
+                       the Framework will discard the necessary bookkeeping information permanently.</p>
+<p>The "Paths" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="File System Connection, Paths tab" src="images/filesystem-job-paths.PNG" width="80%"></div>
+<br>
+<br>
+<p>This tab allows you to type in a set of paths which function as the roots of the crawl.  For each desired path, type in the path and click the "Add" button to add it to
+                       the list.  The form of the path you type in obviously needs to be meaningful for the operating system the Framework is running on.</p>
+<p>Each root path has a set of rules which determines whether a document is included or not in the set for the job.  Once you have added the root path to the list, you
+                       may then add rules to it.  Each rule has a match expression, an indication of whether the rule is intended to match files or directories, and an action (include or exclude).
+                       Rules are evaluated from top to bottom, and the first rule that matches the file name is the one that is chosen.  To add a rule, select the desired pulldowns, type in 
+                       a match file specification (e.g. "*.txt"), and click the "Add" button.</p>
+<a name="N105FC"></a><a name="rssrepository"></a>
+<h3 class="h4">Generic RSS Repository Connection</h3>
+<p>The RSS connection type is specifically designed to crawl RSS feeds.  While the Web connection type can also extract links from RSS feeds, the RSS connection type
+                       differs in the following ways:</p>
+<br>
+<ul>
+                    
+<li>Links are <b>only</b> extracted from feeds</li>
+                    
+<li>Feeds themselves are not indexed</li>
+                    
+<li>There is fine-grained control over how often feeds are refetched, and they are treated distinctly from documents in this regard</li>
+                    
+<li>The RSS connection type knows how to carry certain data down from the feeds to individual documents, as metadata</li>
+                
+</ul>
+<br>
+<p>Many users of the RSS connection type set up their jobs to run continuously, configuring their jobs to never refetch documents, but rather to expire them after some 30 days.
+                       This model works reasonably well for news, which is what RSS is often used for.</p>
+<p>An RSS connection has the following special tabs: "Email", "Robots", "Bandwidth", and "Proxy".  The "Email" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Email tab" src="images/rss-configure-email.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter an email address.  This email address will be included in all requests made by the RSS connection, so that webmasters can report any difficulties that their
+                       sites experience as the result of improper throttling, etc.</p>
+<p>This field is mandatory.  While an RSS connection makes no effort to validate the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+<p>The "Robots" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Robots tab" src="images/rss-configure-robots.PNG" width="80%"></div>
+<br>
+<br>
+<p>Select how the connection will interpret robots.txt.  Remember that you have an interest in crawling people's sites as politely as is possible.</p>
+<p>The "Bandwidth" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Bandwidth tab" src="images/rss-configure-bandwidth.PNG" width="80%"></div>
+<br>
+<br>
+<p>This tab allows you to control the <b>maximum</b> rate at which the connection fetches data, on a per-server basis, as well as the <b>maximum</b> fetches per minute,
+                       also per-server.  Finally, the maximum number of socket connections made per server at any one time is also controllable by this tab.</p>
+<p>The screen shot displays parameters that are
+                       considered reasonably polite.  The default values for this table are all blank, meaning that, by default, there is no throttling whatsoever!  Please do not make the mistake
+                       of crawling other people's sites without adequate politeness parameters in place.</p>
+<p>The "Throttle group" parameter allows you to treat multiple RSS-type connections together, for the purposes of throttling.  All RSS-type connections that have the same
+                       throttle group name will use the same pool for throttling purposes.</p>
+<p>The "Bandwidth" tab is related to the throttles that you can set on the "Throttling" tab in the following ways:</p>
+<br>
+<ul>
+                    
+<li>The "Bandwidth" tab sets the <b>maximum</b> values, while the "Throttling" tab sets the <b>average</b> values.</li>
+                    
+<li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it simply blocks documents until it is safe to go ahead, which will use up a crawler thread
+                           for the entire period that both the wait and the fetch take place.  The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                
+</ul>
+<br>
+<p>Because of the above, we suggest that you configure your RSS connection using <b>both</b> the "Bandwidth" <b>and</b> the "Throttling" tabs.  Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates on the "Throttling" tab.  Remember that a document identifier for an RSS connection is the
+                       document's URL, and the bin name for that URL is the server name.  Also, please note that the "Maximum number of connections per JVM" field's default value of 10 is
+                       unlikely to be correct for connections of the RSS type; you should have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter to at least a value of 30 for normal operation.</p>
+<p>The "Proxy" tab allows you to specify a proxy that you want to crawl through.  The RSS connection type supports proxies that are secured with all forms of the NTLM
+                       authentication method.  This is quite typical of large organizations.  The tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Connection, Proxy tab" src="images/rss-configure-proxy.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the proxy server you will be proxying through in the "Proxy host" field.  Enter the proxy port in the "Proxy port" field.  If your server is authenticated, enter the
+                       domain, username, and password in the corresponding fields.  Leave all fields blank if you want to use no proxy whatsoever.</p>
+<p>When you save your RSS connection, you should see a status screen that looks something like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS Status" src="images/rss-status.PNG" width="80%"></div>
+<br>
+<br>
+<p></p>
+<p>Jobs created using connections of the RSS type have the following additional tabs: "URLs", "Canonicalization", "Mappings", "Time Values", "Security", "Metadata", and
+                       "Dechromed Content".  The URLs tab is where you describe the feeds that are part of the job.  It looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, URLs tab" src="images/rss-job-urls.PNG" width="80%"></div>
+<br>
+<br>
+<p>Enter the list of feed URLs you want to crawl, separated by newlines.  You may also have comments by starting lines with ("#") characters.</p>
+<p>The "Canonicalization" tab controls how the job handles url canonicalization.  Canonicalization refers to the fact that many different URLs may all refer to the
+                       same actual resource.  For example, arguments in URLs can often be reordered, so that <span class="codefrag">a=1&amp;b=2</span> is in fact the same as
+                       <span class="codefrag">b=2&amp;a=1</span>.  Other canonical operations include removal of session cookies, which some dynamic web sites include in the URL.</p>
+<p>The "Canonicalization" tab looks like this:</p>
+<br>
+<br>
+<div id="" style="text-align: center;">
+<img id="" class="figure" alt="RSS job, Canonicalization tab" src="images/rss-job-canonicalization.PNG" width="80%"></div>
+<br>
+<br>
+<p>The tab displays a list of canonicalization rules.  Each rule consists of a regular expression (which is matched against a document's URL), and some switch selections.
+                       The switch selections allow you to specify whether arguments are reordered, or whether certain specific kinds of session cookies are removed.  Specific kinds of
+                       session cookies that are recognized and can be removed are: JSP (Java applications servers), ASP (.NET), PHP, and Broadvision (BV).</p>
+<p>If a URL matches more than one rule, the first matching rule is the one selected.</p>
+<p>To add a rule, enter an appropriate regular expression, and make your checkbox selections, then click the "Add" button.</p>
+<p>The "Mappings" tab permits you to change the URL under which documents that are fetched will get indexed.  This is sometimes useful in an intranet setting because

[... 1100 lines stripped ...]


Mime
View raw message