cocoon-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From da...@cocoon.zones.apache.org
Subject [DAISY] Updated: LuceneIndexTransformer
Date Fri, 13 Jul 2007 22:20:54 GMT
A document has been updated:

http://cocoon.zones.apache.org/daisy/documentation/1104.html

Document ID: 1104
Branch: main
Language: default
Name: LuceneIndexTransformer (unchanged)
Document Type: Sitemap Component (unchanged)
Updated on: 7/13/07 10:20:04 PM
Updated by: Grzegorz Kossakowski

A new version has been created, state: publish

Parts
=====

Long description
----------------
This part has been added.
Mime type: text/xml
File name: null
Size: 10767 bytes
Content:
<html>
<body>

<h4 id="head-0b39056584778d584af2f2cdd81c6998caa13ba5">LuceneIndexTransformer is
a component that creates or updates Lucene indexes.</h4>

<p>This component only writes the index: to search the index, use the
SearchGenerator component.</p>

<h3 id="head-9b35088110dfcf121e63a9a2b67ec652d667a784">Why use it?</h3>

<p>Instead of using LuceneIndexTransformer, you could generate an index by
crawling your website. However, the LuceneIndexTransformer is <em>much,
much</em> faster than crawling.</p>

<p>The big differences for the developer are:</p>

<ul>
<li>
<p>Using the LuceneIndexTransformer requires you to write a pipeline that can
generate a <tt>lucene:index</tt> document describing your searchable URI space,
so it's necessary to have a well-defined URI space. For a site with a consistent
structure this should not be too hard. This pipeline can use aggregation and
inclusion mechanisms to produce a full list of the pages you want to search. In
this way it's also possible to generate an index for websites with forms which
are not crawlable.</p>
</li>
<li>
<p>On the other hand the crawler is a more generic solution, though far less
efficient. It doesn't require a pipeline to "document" the entire searchable URI
space. Instead, you must create a <tt>content</tt> view and a <tt>links</tt>
view for each of the searchable pipelines. The URI space is then defined by
crawling the <tt>links</tt> view.</p>
</li>
</ul>

<h3 id="head-953c351734de75a525b9777e976c0812a5618736">Declaring the
LuceneIndexTransformer</h3>

<p>The transformer must be declared in the <tt>&lt;transformers&gt;</tt>
section of your sitemap:</p>

<pre>&lt;map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"&gt;

   &lt;map:components&gt;
      ...
      &lt;map:transformers default="xslt"&gt;
         &lt;map:transformer name="index" 
            logger="sitemap.transformer.luceneindextransformer" 
            src="org.apache.cocoon.transformation.LuceneIndexTransformer"/&gt;
      &lt;/map:transformers&gt;
      ...
   &lt;/map:components&gt;
   ...
&lt;/map:sitemap&gt;
</pre>

<h3 id="head-cea5eb78d3cf27bf4fdf96d1049365b4fa984307">Input document for the
LuceneIndexTransformer</h3>

<p>This is a sample of the kind of document that the transformer expects. NB In
this example, I've chosen a couple of simple XHTML documents as the content to
be indexed. This is only because everyone knows XHTML - in practice you should
typically generate the index from an early stage in the pipeline; indexing
DocBook, TEI, etc, rather than a presentation format like HTML.</p>

<pre>&lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
   analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
   directory="index" 
   create="false" 
   merge-factor="20"&gt;

   &lt;lucene:document url="http://localhost/sample.html"&gt;
      &lt;!-- here is some sample content --&gt;
      &lt;html&gt;
         &lt;head&gt;
            &lt;title lucene:store="true"&gt;Sample&lt;/title&gt;
         &lt;/head&gt;
         &lt;body&gt;
            &lt;h1&gt;Blah&lt;/h1&gt;
            &lt;a href="blah.jpg" title="download blah image"
               lucene:text-attr="title"&gt;
               &lt;img src="blah-small.jpg" alt="Blah"
                  lucene:text-attr="alt"/&gt;
            &lt;/a&gt;
         &lt;/body&gt;
      &lt;/html&gt;
   &lt;/lucene:document&gt;

   &lt;lucene:document url="http://localhost/sample-2.html"&gt;
      &lt;!-- Another sample doc --&gt;
      &lt;html&gt;
         &lt;head&gt;
            &lt;title lucene:store="true"&gt;Second Sample&lt;/title&gt;
         &lt;/head&gt;
         &lt;body&gt;
            &lt;h1&gt;Foo&lt;/h1&gt;
            &lt;p&gt;Lorem ipsum dolor sit amet, 
            consectetuer adipiscing elit. &lt;/p&gt;
         &lt;/body&gt;
      &lt;/html&gt;
   &lt;/lucene:document&gt;

&lt;/lucene:index&gt;
</pre>

<h3 id="head-97d27647f366081a18adc8469538e908e6354ed4">What the lucene:index
document means</h3>

<h4 id="head-9e412039c4f6090a2aaac081c56f522ac97b8985">The lucene:index element
</h4>

<p>The root element is <tt>lucene:index</tt>. The attributes of the
<tt>lucene:index</tt> in the sample above are shown with their default values
-
so the effect is as if they were not specified at all.</p>

<h4 id="head-40afef17a5a56ab2e729d18163f1bc960a8ce2cc">The merge-factor and
analyzer attributes</h4>

<p>See
<a href="http://jakarta.apache.org/lucene/docs/index.html"><img width="11" height="11"
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
the Lucene documentation</a> for explanations of what they mean.</p>

<h4 id="head-84967edae247fc0739e57bc3af497f832b880582">The optimize-frequency
attribute (since version 2.2)</h4>

<p>Determines how often the lucene index will be optimized. When you have 1000's
of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000
small docs, P4).</p>

<ul>
<li>
<p>1: always optimize (default)</p>
</li>
<li>
<p>0: never optimize</p>
</li>
<li>
<p>x: update every x times. You can use any number, it is a random generator
which will determine to optimize or not.</p>
</li>
</ul>

<p>You can eg. create a pipe without optimizing, which is used to index you're
document everytime when it's modified. You can then create another pipe which
will optimize, which is called manually. For more info see the Lucene FAQ , What
is index optimization and when should I use it? :</p>

<p>
<a href="http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8</a>
</p>

<h4 id="head-51123b488fc39c0a36b69c0e24608052fd45a86d">The directory attribute
</h4>

<p>This attribute controls where the index files are stored. The path is
relative to the Cocoon <tt>work</tt> directory.</p>

<h4 id="head-9b03e7cb891515af05d6a3bde919087262b146aa">The create attribute</h4>

<p>This attribute controls whether the index is recreated.</p>

<ul>
<li>
<p>If create = "false" and the index already exists then the index will be
updated. Documents which are already indexed will be removed from the index and
reinserted.</p>
</li>
<li>
<p>If the index does not exist then it will be created even if
<tt>create = "false"</tt>.</p>
</li>
<li>
<p>If <tt>create = "true"</tt> then any existing index will be destroyed
and a
new index created. If you are rebuilding your entire index then you should use
<tt>create = "true"</tt> because the indexer doesn't need to remove old
documents from the index, so it will be faster.</p>
</li>
</ul>

<h4 id="head-9585e2ebba0108dc71917a21a4d9ed1edca00732">The lucene:document
element</h4>

<p>Lucene will index the content of each <tt>lucene:document</tt>, which
may
contain any xml content. The index is associated with the url specified by the
<tt>url</tt> attribute. So this url will be returned as the results of a search.
</p>

<h4 id="head-5f2ae3b3aceb65a1fd0cb0942a0385fa7c4a4e2e">The lucene:text-attr
attribute</h4>

<p>Normally Lucene will only index the content of these elements, not attribute
values. To index the attributes of an element as well, give it an attribute
called <tt>lucene:text-attr</tt>, containing a list of the names of the
attributes you want indexed. For example, to index the value of the <tt>alt</tt>
attribute of an <tt>img</tt> element, in <tt>html</tt>:</p>

<pre>&lt;img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/&gt;
</pre>

<p>This would index the text "Blah".</p>

<h4 id="head-b85bebbca6ee9807e0a7165b5208677c1616aca7">The lucene:store
attribute</h4>

<p>Normally Lucene will only index the text of an element, not store it. To
store the text of an element in Lucene's index, add a
<tt>lucene:store="true"</tt> attribute to the element. It's a good idea to store
the title of a document in Lucene, so that your search results can show a
document title as well as a URL.</p>

<h3 id="head-c55afa96d19d0ca7161da59bedf6409cbbfd78c2">The transformation</h3>

<p>The transformer copies the source document to the output, except for the
content of the <tt>lucene:document</tt> elements.</p>

<p>The transformer also adds an <tt>elapsed-time</tt> attribute to the output
<tt>lucene:document</tt> elements, showing the time (in milliseconds) taken to
index that document. You can use XSLT to transform the results into a report on
the indexing operation.</p>

<h4 id="head-c9a731f4df69c482e3c1d40fcc39e94b3fb16307">Sample output</h4>

<pre>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
        merge-factor="20" 
        create="false" 
        directory="index" 
        analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"&gt;
        &lt;lucene:document url="JCB-001/full.html" elapsed-time="3846"/&gt;
        &lt;lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/&gt;
        &lt;lucene:document url="JCB-002/full.html" elapsed-time="361"/&gt;
        &lt;lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/&gt;
        &lt;lucene:document url="JCB-003/full.html" elapsed-time="300"/&gt;
        &lt;lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/&gt;
&lt;/lucene:index&gt;
</pre>

<h5 id="head-24e83f0c8063ca175d6e8a1a80e51e1ed9fbc20b">Note to users of Mac OS X
</h5>

<p>Java can not open more than 256 files at a time by default, so you may get an
error like the following:</p>

<pre>Description: org.apache.cocoon.ProcessingException: 
Failed to execute pipeline.: java.lang.RuntimeException: 
java.io.FileNotFoundException:  
/usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
(Too many open files)
</pre>

<p>To avoid this error, you should set your ulimit in the shell script that
starts Tomcat. My line reads as follows:</p>

<pre>ulimit -S -n 1000
</pre>

<p>Read more about this here:
<a href="http://www.amug.org/%7Eglguerin/howto/More-open-files.html"><img width="11"
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
http://www.amug.org/~glguerin/howto/More-open-files.html</a></p>

<h5 id="head-f9fcf2cc3f693a586067cd49d3cbe85a6297d60e">Note to users of Redhat
Linux</h5>

<p>If you get the following error: (Empty StackException) while creating the
index with the LuceneIndexTransformer try to alter your merge-factor to a lower
value (default should be 10). Look at the
<a href="http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
Lucene documentation</a> for more information.</p>

</body>
</html>


Mime
View raw message