incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kwri...@apache.org
Subject svn commit: r938188 [2/2] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/
Date Mon, 26 Apr 2010 18:48:31 GMT
Modified: incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml?rev=938188&r1=938187&r2=938188&view=diff
==============================================================================
--- incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml (original)
+++ incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml Mon Apr
26 18:48:30 2010
@@ -845,7 +845,7 @@
                        and cutoff value.  A blank value means no cutoff value at all.</p>
                 <p>For example, if you specified a maximum "link" hop count of 5, and
left the "redirect" hop count blank, then any document that requires more than five links
to reach from a seed
                        will be considered out-of-set.  If you specified both a maximum "link"
hop count of 5, and a maximum "redirect" hop count 2, then any document that requires more
than five links to
-                       reach from seed, <b>and</b> more than two redirections,
will be considered out-of-set.</p>
+                       reach from a seed, <b>and</b> more than two redirections,
will be considered out-of-set.</p>
                 <p>The "Hop Filters" tab looks like this:</p>
                 <br/><br/>
                 <figure src="images/web-job-hop-filters.PNG" alt="Web Job, Hop Filters
tab" width="80%"/>
@@ -855,9 +855,45 @@
                        expensive bookkeeping, however, so you also have the option of  ignoring
such changes.  There are two varieties of this latter option - you can ignore the changes
                        for now, with the option of turning back on the aggressive bookkeeping
at a later time, or you can decide not to ever allow changes to propagate, in which case
                        the Framework will discard the necessary bookkeeping information permanently.
 This last option is the most efficient.</p>
-                       
-                <p>More here later</p>
-
+                <p>The "Seeds" tab is where you enter the starting points for your
crawl.  It looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-seeds.PNG" alt="Web Job, Seeds tab" width="80%"/>
+                <br/><br/>
+                <p>Enter a list of seeds, separated by newline characters.  Blank lines,
or lines that begin with a "#' character, will be ignored.</p>
+                <p>The "Canonicalization" tab controls how a web job converts URLs
into a standard form.  It looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-canonicalization.PNG" alt="Web Job, Canonicalization
tab" width="80%"/>
+                <br/><br/>
+                <p>The tab displays a list of canonicalization rules.  Each rule consists
of a regular expression (which is matched against a document's URL), and some switch selections.
+                       The switch selections allow you to specify whether arguments are reordered,
or whether certain specific kinds of session cookies are removed.  Specific kinds of
+                       session cookies that are recognized and can be removed are: JSP (Java
applications servers), ASP (.NET), PHP, and Broadvision (BV).</p>
+                <p>If a URL matches more than one rule, the first matching rule is
the one selected.</p>
+                <p>To add a rule, enter an appropriate regular expression, and make
your checkbox selections, then click the "Add" button.</p>
+                <p>The "Inclusions" tab lets you specify, by means of a set of regular
expressions, exactly what URLs will be included as part of the document set for a web job.
 The tab
+                       looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-inclusions.PNG" alt="Web Job, Inclusions tab"
width="80%"/>
+                <br/><br/>
+                <p>You will need to provide a series of zero or more regular expressions,
separated by newlines.</p>
+                <p>Remember that, by default, a web job includes <b>all</b>
documents in the world that are linked to your seeds in any way that the web connection type
can determine.</p>
+                <p>If you wish to restrict which documents are actually processed within
your overall set of included documents, you may want to supply some regular expressions on
the
+                       "Exclusions" tab, which looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-exclusions.PNG" alt="Web Job, Exclusions tab"
width="80%"/>
+                <br/><br/>
+                <p>Once again you will need to provide a series of zero or more regular
expressions, separated by newlines.  It is typical to use the "Exclusions" tab to remove documents
from
+                       consideration which are suspected to contain content that both has
no extractable links, and is not useful to the index you are trying to build, e.g. movie files.</p>
+                <p>The "Security" tab allows you to specify the access tokens that
the documents in the web job get indexed with, and looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-security.PNG" alt="Web Job, Security tab"
width="80%"/>
+                <br/><br/>
+                <p>You will need to know the format of the access tokens for the
+                       governing authority before you can add security to your documents
in this way.  Enter the access token you desire and click the "Add" button.</p>
+                <p>The "Metadata" tab allows you to include specified metadata along
with all documents belonging to a web job.  It looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-job-metadata.PNG" alt="Web Job, Metadata tab"
width="80%"/>
+                <br/><br/>
+                <p>Enter the name of the desired metadata on the left, and the desired
value (if any) on the right, and click the "Add" button.</p>
             </section>
 
             <section id="jcifsrepository">

Added: incubator/lcf/site/src/documentation/resources/images/web-job-canonicalization.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-canonicalization.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-canonicalization.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-job-exclusions.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-exclusions.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-exclusions.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-job-inclusions.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-inclusions.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-inclusions.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-job-metadata.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-metadata.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-metadata.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-job-security.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-security.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-security.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-job-seeds.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-job-seeds.PNG?rev=938188&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-job-seeds.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream



Mime
View raw message