jena-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r1031933 - in /websites/staging/jena/trunk/content: ./ documentation/query/text-query.html
Date Sat, 30 Jun 2018 16:11:05 GMT
Author: buildbot
Date: Sat Jun 30 16:11:05 2018
New Revision: 1031933

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/query/text-query.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Jun 30 16:11:05 2018
@@ -1 +1 @@
-1834697
+1834748

Modified: websites/staging/jena/trunk/content/documentation/query/text-query.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/query/text-query.html (original)
+++ websites/staging/jena/trunk/content/documentation/query/text-query.html Sat Jun 30 16:11:05
2018
@@ -249,8 +249,21 @@ illustrates creating an in-memory datase
 <li><a href="#configuring-an-analyzer">Configuring an analyzer</a></li>
 <li><a href="#configuration-by-code">Configuration by Code</a></li>
 <li><a href="#graph-specific-indexing">Graph-specific Indexing</a></li>
-<li><a href="#linguistic-support-with-lucene-index">Linguistic Support with Lucene
Index</a></li>
-<li><a href="#generic-and-defined-analyzer-support">Generic and Defined Analyzer
Support</a></li>
+<li><a href="#linguistic-support-with-lucene-index">Linguistic Support with Lucene
Index</a><ul>
+<li><a href="#explicit-language-field-in-the-index">Explicit Language Field in
the Index</a></li>
+<li><a href="#sparql-linguistic-clause-forms">SPARQL Linguistic Clause Forms</a></li>
+<li><a href="#localizedanalyzer">LocalizedAnalyzer</a></li>
+<li><a href="#multilingual-support">Multilingual Support</a></li>
+</ul>
+</li>
+<li><a href="#generic-and-defined-analyzer-support">Generic and Defined Analyzer
Support</a><ul>
+<li><a href="#generic-analyzers-tokenizers-and-filters">Generic Analyzers, Tokenizers
and Filters</a></li>
+<li><a href="#defined-analyzers">Defined Analyzers</a></li>
+<li><a href="#extending0-multilingual-support">Extending multilingual support</a></li>
+<li><a href="#multilingual-enhancements-for-multi-encoding-searches">Multilingual
enhancements for multi-encoding searches</a></li>
+<li><a href="#naming-analyzers-for-later-use">Naming analyzers for later use</a></li>
+</ul>
+</li>
 <li><a href="#storing-literal-values">Storing Literal Values</a></li>
 </ul>
 </li>
@@ -1487,7 +1500,7 @@ where some special characters and diacri
 </pre></div>
 
 
-<h5 id="extending-multilingual-support">Extending multilingual support<a class="headerlink"
href="#extending-multilingual-support" title="Permanent link">&para;</a></h5>
+<h4 id="extending-multilingual-support">Extending multilingual support<a class="headerlink"
href="#extending-multilingual-support" title="Permanent link">&para;</a></h4>
 <p>The <a href="#multilingual-support">Multilingual Support</a> described
above allows for a limited set of 
 ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor

 associated with each analyzer. So if one is wanting to use:</p>
@@ -1510,7 +1523,98 @@ implicitly added if not already specifie
 
 <p>this adds an analyzer to be used when the <code>text:langField</code>
has the value <code>sa-x-iast</code> during 
 indexing and search.</p>
-<h5 id="naming-analyzers-for-later-use">Naming analyzers for later use<a class="headerlink"
href="#naming-analyzers-for-later-use" title="Permanent link">&para;</a></h5>
+<h4 id="multilingual-enhancements-for-multi-encoding-searches">Multilingual enhancements
for multi-encoding searches<a class="headerlink" href="#multilingual-enhancements-for-multi-encoding-searches"
title="Permanent link">&para;</a></h4>
+<p>There are two multilingual search situations that are supported as of 3.8.0:</p>
+<ul>
+<li>Search in one encoding and retrieve results that may have been entered in other
encodings. For example, searching via Simplified Chinese (Hans) and retrieving results that
may have been entered in Traditional Chinese (Hant) or Pinyin. This will simplify applications
by permitting encoding independent retrieval without additional layers of transcoding and
so on. It's all done under the covers in Lucene.</li>
+<li>Search with queries entered in a lossy, e.g., phonetic, encoding and retrieve results
entered with accurate encoding. For example, searching via Pinyin without diacritics and retrieving
all possible Hans and Hant triples.</li>
+</ul>
+<p>The first situation arises when entering triples that include languages with multiple
encodings that for various reasons are not normalized to a single encoding. In this situation
it is helpful to be able to retrieve appropriate result sets without regard for the encodings
used at the time that the triples were inserted into the dataset.</p>
+<p>There are several such languages of interest: Chinese, Tibetan, Sanskrit, Japanese
and Korean. There are various Romanizations and ideographic variants.</p>
+<p>Encodings may not normalized when inserting triples for a variety of reasons. A
principle one is that the <code>rdf:langString</code> object often must be entered
in the same encoding that it occurs in some physical text that is being catalogued. Another
is that metadata may be imported from sources that use different encoding conventions and
it is desireable to preserve the original form.</p>
+<p>The second situation arises to provide simple support for phonetic or other forms
of lossy search at the time that triples are indexed directly in the Lucene system.</p>
+<p>To handle the first situation a <code>text</code> assembler predicate,
<code>text:searchFor</code>, is introduced that specifies a list of language tags
that provides a list of language variants that should be searched whenever a query string
of a given encoding (language tag) is used. For example, the following <code>text:TextIndexLucene/text:defineAnalyzers</code>
fragment :</p>
+<div class="codehilite"><pre>    <span class="p">[</span> <span
class="n">text</span><span class="p">:</span><span class="n">addLang</span>
&quot;<span class="n">bo</span>&quot; <span class="p">;</span>

+      <span class="n">text</span><span class="p">:</span><span
class="n">searchFor</span> <span class="p">(</span> &quot;<span
class="n">bo</span>&quot; &quot;<span class="n">bo</span><span
class="o">-</span><span class="n">x</span><span class="o">-</span><span
class="n">ewts</span>&quot; &quot;<span class="n">bo</span><span
class="o">-</span><span class="n">alalc97</span>&quot; <span class="p">)</span>
<span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">analyzer</span> <span class="p">[</span> 
+        <span class="n">a</span> <span class="n">text</span><span
class="p">:</span><span class="n">GenericAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">class</span> &quot;<span class="n">io</span><span class="p">.</span><span
class="n">bdrc</span><span class="p">.</span><span class="n">lucene</span><span
class="p">.</span><span class="n">bo</span><span class="p">.</span><span
class="n">TibetanAnalyzer</span>&quot; <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">params</span> <span class="p">(</span>
+            <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">paramName</span> &quot;<span
class="n">segmentInWords</span>&quot; <span class="p">;</span>
+              <span class="n">text</span><span class="p">:</span><span
class="n">paramValue</span> <span class="n">false</span> <span class="p">]</span>
+            <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">paramName</span> &quot;<span
class="n">lemmatize</span>&quot; <span class="p">;</span>
+              <span class="n">text</span><span class="p">:</span><span
class="n">paramValue</span> <span class="n">true</span> <span class="p">]</span>
+            <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">paramName</span> &quot;<span
class="n">filterChars</span>&quot; <span class="p">;</span>
+              <span class="n">text</span><span class="p">:</span><span
class="n">paramValue</span> <span class="n">false</span> <span class="p">]</span>
+            <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">paramName</span> &quot;<span
class="n">inputMode</span>&quot; <span class="p">;</span>
+              <span class="n">text</span><span class="p">:</span><span
class="n">paramValue</span> &quot;<span class="n">unicode</span>&quot;
<span class="p">]</span>
+            <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">paramName</span> &quot;<span
class="n">stopFilename</span>&quot; <span class="p">;</span>
+              <span class="n">text</span><span class="p">:</span><span
class="n">paramValue</span> &quot;&quot; <span class="p">]</span>
+            <span class="p">)</span>
+        <span class="p">]</span> <span class="p">;</span> 
+      <span class="p">]</span>
+</pre></div>
+
+
+<p>indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo
the Lucene index should also be searched for matches tagged as <code>bo-x-ewts</code>
and <code>bo-alalc97</code>.</p>
+<p>This is made possible by a Tibetan <code>Analyzer</code> that tokenizes
strings in all three encodings into Tibetan Unicode. This is feasible since the <code>bo-x-ewts</code>
and <code>bo-alalc97</code> encodings are one-to-one with Unicode Tibetan.
Since all fields with these language tags will have a common set of indexed terms, i.e., Tibetan
Unicode, it suffices to arrange for the query analyzer to have access to the language tag
for the query string along with the various fields that need to be considered.</p>
+<p>Supposing that the query is:</p>
+<div class="codehilite"><pre><span class="p">(</span>?<span class="n">s</span>
?<span class="n">sc</span> ?<span class="n">lit</span><span class="p">)</span>
<span class="n">text</span><span class="p">:</span><span class="n">query</span>
<span class="p">(</span>&quot;<span class="n">rje</span>&quot;<span
class="p">@</span><span class="n">bo</span><span class="o">-</span><span
class="n">x</span><span class="o">-</span><span class="n">ewts</span><span
class="p">)</span>
+</pre></div>
+
+
+<p>Then the query formed in <code>TextIndexLucene</code> will be:</p>
+<div class="codehilite"><pre><span class="n">label_bo</span><span
class="o">:</span><span class="n">rje</span> <span class="n">label_bo</span><span
class="o">-</span><span class="n">x</span><span class="o">-</span><span
class="n">ewts</span><span class="o">:</span><span class="n">rje</span>
<span class="n">label_bo</span><span class="o">-</span><span class="n">alalc97</span><span
class="o">:</span><span class="n">rje</span>
+</pre></div>
+
+
+<p>which is translated using a suitable <code>Analyzer</code>, <code>QueryMultilingualAnalyzer</code>,
via Lucene's <code>QueryParser</code> to:</p>
+<div class="codehilite"><pre><span class="o">+</span><span class="p">(</span><span
class="n">label_bo</span><span class="p">:</span>རྗེ <span
class="n">label_bo</span><span class="o">-</span><span class="n">x</span><span
class="o">-</span><span class="n">ewts</span><span class="p">:</span>རྗེ
<span class="n">label_bo</span><span class="o">-</span><span class="n">alalc97</span><span
class="p">:</span>རྗེ<span class="p">)</span>
+</pre></div>
+
+
+<p>which reflects the underlying Tibetan Unicode term encoding. During <code>IndexSearcher.search</code>
all documents with one of the three fields in the index for term, "རྗེ", will
be returned even though the value in the fields <code>label_bo-x-ewts</code> and
<code>label_bo-alalc97</code> for the returned documents will be the original
value "rje".</p>
+<p>This support simplifies applications by permitting encoding independent retrieval
without additional layers of transcoding and so on. It's all done under the covers in Lucene.</p>
+<p>Solving the second situation simplifies applications by adding appropriate fields
and indexing via configuration in the <code>text:TextIndexLucene/text:defineAnalyzers</code>.
For example, the following fragment</p>
+<div class="codehilite"><pre>    <span class="p">[</span> <span
class="n">text</span><span class="p">:</span><span class="n">addLang</span>
&quot;<span class="n">zh</span><span class="o">-</span><span
class="n">hans</span>&quot; <span class="p">;</span> 
+      <span class="n">text</span><span class="p">:</span><span
class="n">searchFor</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">hans</span>&quot;
&quot;<span class="n">zh</span><span class="o">-</span><span
class="n">hant</span>&quot; <span class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">auxIndex</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">aux</span><span
class="o">-</span><span class="n">han2pinyin</span>&quot; <span
class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">analyzer</span> <span class="p">[</span>
+        <span class="n">a</span> <span class="n">text</span><span
class="p">:</span><span class="n">DefinedAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">useAnalyzer</span> <span class="p">:</span><span class="n">hanzAnalyzer</span>
<span class="p">]</span> <span class="p">;</span> 
+      <span class="p">]</span>
+    <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">addLang</span> &quot;<span class="n">zh</span><span
class="o">-</span><span class="n">hant</span>&quot; <span class="p">;</span>

+      <span class="n">text</span><span class="p">:</span><span
class="n">searchFor</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">hans</span>&quot;
&quot;<span class="n">zh</span><span class="o">-</span><span
class="n">hant</span>&quot; <span class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">auxIndex</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">aux</span><span
class="o">-</span><span class="n">han2pinyin</span>&quot; <span
class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">analyzer</span> <span class="p">[</span>
+        <span class="n">a</span> <span class="n">text</span><span
class="p">:</span><span class="n">DefinedAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">useAnalyzer</span> <span class="p">:</span><span class="n">hanzAnalyzer</span>
<span class="p">]</span> <span class="p">;</span> 
+      <span class="p">]</span>
+    <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">addLang</span> &quot;<span class="n">zh</span><span
class="o">-</span><span class="n">latn</span><span class="o">-</span><span
class="n">pinyin</span>&quot; <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">searchFor</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">latn</span><span
class="o">-</span><span class="n">pinyin</span>&quot; &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">aux</span><span
class="o">-</span><span class="n">han2pinyin</span>&quot; <span
class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">analyzer</span> <span class="p">[</span>
+        <span class="n">a</span> <span class="n">text</span><span
class="p">:</span><span class="n">DefinedAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">useAnalyzer</span> <span class="p">:</span><span class="n">pinyin</span>
<span class="p">]</span> <span class="p">;</span> 
+      <span class="p">]</span>        
+    <span class="p">[</span> <span class="n">text</span><span
class="p">:</span><span class="n">addLang</span> &quot;<span class="n">zh</span><span
class="o">-</span><span class="n">aux</span><span class="o">-</span><span
class="n">han2pinyin</span>&quot; <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">searchFor</span> <span class="p">(</span> &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">latn</span><span
class="o">-</span><span class="n">pinyin</span>&quot; &quot;<span
class="n">zh</span><span class="o">-</span><span class="n">aux</span><span
class="o">-</span><span class="n">han2pinyin</span>&quot; <span
class="p">)</span> <span class="p">;</span>
+      <span class="n">text</span><span class="p">:</span><span
class="n">analyzer</span> <span class="p">[</span>
+        <span class="n">a</span> <span class="n">text</span><span
class="p">:</span><span class="n">DefinedAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span
class="n">useAnalyzer</span> <span class="p">:</span><span class="n">pinyin</span>
<span class="p">]</span> <span class="p">;</span> 
+      <span class="n">text</span><span class="p">:</span><span
class="n">indexAnalyzer</span> <span class="p">:</span><span class="n">han2pinyin</span>
<span class="p">;</span> 
+      <span class="p">]</span>
+</pre></div>
+
+
+<p>defines language tags for Traditional, Simplified, Pinyin and an <em>auxiliary</em>
tag <code>zh-aux-han2pinyin</code> associated with an <code>Analyzer</code>,
<code>:han2pinyin</code>. The purpose of the auxiliary tag is to define an <code>Analyzer</code>
that will be used during indexing and to specify a list of tags that should be searched when
the auxiliary tag is used with a query string. </p>
+<p>Searching is then done via the multi-encoding support discussed above. In this example
the <code>Analyzer</code>, <code>:han2pinyin</code>, tokenizes strings
in <code>zh-hans</code> and <code>zh-hant</code> as the corresponding
pinyin so that at search time a pinyin query will retrieve appropriate triples inserted in
Traditional or Simplified Chinese. Such a query would appear as:</p>
+<div class="codehilite"><pre><span class="p">(</span>?<span class="n">s</span>
?<span class="n">sc</span> ?<span class="n">lit</span> ?<span class="n">g</span><span
class="p">)</span> <span class="n">text</span><span class="p">:</span><span
class="n">query</span> <span class="p">(</span>&quot;<span class="nb">j</span>Ä«<span
class="n">ng</span>&quot;<span class="p">@</span><span class="n">zh</span><span
class="o">-</span><span class="n">aux</span><span class="o">-</span><span
class="n">han2pinyin</span><span class="p">)</span>
+</pre></div>
+
+
+<p>The auxiliary field support is needed to accommodate situations such as pinyin or
sound-ex which are not exact, i.e., one-to-many rather than one-to-one as in the case of Simplified
and Traditional.</p>
+<p><code>TextIndexLucene</code> adds a field for each of the auxiliary
tags associated with the tag of the triple object being indexed. These fields are in addition
to the un-tagged field and the field tagged with the language of the triple object literal.</p>
+<h4 id="naming-analyzers-for-later-use">Naming analyzers for later use<a class="headerlink"
href="#naming-analyzers-for-later-use" title="Permanent link">&para;</a></h4>
 <p>Repeating a <code>text:GenericAnalyzer</code> specification for use
with multiple fields in an entity map
 may be cumbersome. The <code>text:defineAnalyzer</code> is used in an element
of a <code>text:defineAnalyzers</code> 
 list to associate a resource with an analyzer so that it may be referred to later in a 



Mime
View raw message