From commits-return-19933-archive-asf-public=cust-asf.ponee.io@jena.apache.org Sat Jun 30 18:11:23 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2139B180634 for ; Sat, 30 Jun 2018 18:11:21 +0200 (CEST) Received: (qmail 12370 invoked by uid 500); 30 Jun 2018 16:11:21 -0000 Mailing-List: contact commits-help@jena.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jena.apache.org Delivered-To: mailing list commits@jena.apache.org Received: (qmail 12361 invoked by uid 99); 30 Jun 2018 16:11:21 -0000 Received: from Unknown (HELO svn01-us-west.apache.org) (209.188.14.144) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jun 2018 16:11:21 +0000 Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id C05D03A0142 for ; Sat, 30 Jun 2018 16:11:05 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1031933 - in /websites/staging/jena/trunk/content: ./ documentation/query/text-query.html Date: Sat, 30 Jun 2018 16:11:05 -0000 To: commits@jena.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20180630161105.C05D03A0142@svn01-us-west.apache.org> Author: buildbot Date: Sat Jun 30 16:11:05 2018 New Revision: 1031933 Log: Staging update by buildbot for jena Modified: websites/staging/jena/trunk/content/ (props changed) websites/staging/jena/trunk/content/documentation/query/text-query.html Propchange: websites/staging/jena/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Sat Jun 30 16:11:05 2018 @@ -1 +1 @@ -1834697 +1834748 Modified: websites/staging/jena/trunk/content/documentation/query/text-query.html ============================================================================== --- websites/staging/jena/trunk/content/documentation/query/text-query.html (original) +++ websites/staging/jena/trunk/content/documentation/query/text-query.html Sat Jun 30 16:11:05 2018 @@ -249,8 +249,21 @@ illustrates creating an in-memory datase
  • Configuring an analyzer
  • Configuration by Code
  • Graph-specific Indexing
  • -
  • Linguistic Support with Lucene Index
  • -
  • Generic and Defined Analyzer Support
  • +
  • Linguistic Support with Lucene Index +
  • +
  • Generic and Defined Analyzer Support +
  • Storing Literal Values
  • @@ -1487,7 +1500,7 @@ where some special characters and diacri -
    Extending multilingual support
    +

    Extending multilingual support

    The Multilingual Support described above allows for a limited set of ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor associated with each analyzer. So if one is wanting to use:

    @@ -1510,7 +1523,98 @@ implicitly added if not already specifie

    this adds an analyzer to be used when the text:langField has the value sa-x-iast during indexing and search.

    -
    Naming analyzers for later use
    +

    Multilingual enhancements for multi-encoding searches

    +

    There are two multilingual search situations that are supported as of 3.8.0:

    +
      +
    • Search in one encoding and retrieve results that may have been entered in other encodings. For example, searching via Simplified Chinese (Hans) and retrieving results that may have been entered in Traditional Chinese (Hant) or Pinyin. This will simplify applications by permitting encoding independent retrieval without additional layers of transcoding and so on. It's all done under the covers in Lucene.
    • +
    • Search with queries entered in a lossy, e.g., phonetic, encoding and retrieve results entered with accurate encoding. For example, searching via Pinyin without diacritics and retrieving all possible Hans and Hant triples.
    • +
    +

    The first situation arises when entering triples that include languages with multiple encodings that for various reasons are not normalized to a single encoding. In this situation it is helpful to be able to retrieve appropriate result sets without regard for the encodings used at the time that the triples were inserted into the dataset.

    +

    There are several such languages of interest: Chinese, Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and ideographic variants.

    +

    Encodings may not normalized when inserting triples for a variety of reasons. A principle one is that the rdf:langString object often must be entered in the same encoding that it occurs in some physical text that is being catalogued. Another is that metadata may be imported from sources that use different encoding conventions and it is desireable to preserve the original form.

    +

    The second situation arises to provide simple support for phonetic or other forms of lossy search at the time that triples are indexed directly in the Lucene system.

    +

    To handle the first situation a text assembler predicate, text:searchFor, is introduced that specifies a list of language tags that provides a list of language variants that should be searched whenever a query string of a given encoding (language tag) is used. For example, the following text:TextIndexLucene/text:defineAnalyzers fragment :

    +
        [ text:addLang "bo" ; 
    +      text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
    +      text:analyzer [ 
    +        a text:GenericAnalyzer ;
    +        text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
    +        text:params (
    +            [ text:paramName "segmentInWords" ;
    +              text:paramValue false ]
    +            [ text:paramName "lemmatize" ;
    +              text:paramValue true ]
    +            [ text:paramName "filterChars" ;
    +              text:paramValue false ]
    +            [ text:paramName "inputMode" ;
    +              text:paramValue "unicode" ]
    +            [ text:paramName "stopFilename" ;
    +              text:paramValue "" ]
    +            )
    +        ] ; 
    +      ]
    +
    + + +

    indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the Lucene index should also be searched for matches tagged as bo-x-ewts and bo-alalc97.

    +

    This is made possible by a Tibetan Analyzer that tokenizes strings in all three encodings into Tibetan Unicode. This is feasible since the bo-x-ewts and bo-alalc97 encodings are one-to-one with Unicode Tibetan. Since all fields with these language tags will have a common set of indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query analyzer to have access to the language tag for the query string along with the various fields that need to be considered.

    +

    Supposing that the query is:

    +
    (?s ?sc ?lit) text:query ("rje"@bo-x-ewts)
    +
    + + +

    Then the query formed in TextIndexLucene will be:

    +
    label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
    +
    + + +

    which is translated using a suitable Analyzer, QueryMultilingualAnalyzer, via Lucene's QueryParser to:

    +
    +(label_bo:རྗེ label_bo-x-ewts:རྗེ label_bo-alalc97:རྗེ)
    +
    + + +

    which reflects the underlying Tibetan Unicode term encoding. During IndexSearcher.search all documents with one of the three fields in the index for term, "རྗེ", will be returned even though the value in the fields label_bo-x-ewts and label_bo-alalc97 for the returned documents will be the original value "rje".

    +

    This support simplifies applications by permitting encoding independent retrieval without additional layers of transcoding and so on. It's all done under the covers in Lucene.

    +

    Solving the second situation simplifies applications by adding appropriate fields and indexing via configuration in the text:TextIndexLucene/text:defineAnalyzers. For example, the following fragment

    +
        [ text:addLang "zh-hans" ; 
    +      text:searchFor ( "zh-hans" "zh-hant" ) ;
    +      text:auxIndex ( "zh-aux-han2pinyin" ) ;
    +      text:analyzer [
    +        a text:DefinedAnalyzer ;
    +        text:useAnalyzer :hanzAnalyzer ] ; 
    +      ]
    +    [ text:addLang "zh-hant" ; 
    +      text:searchFor ( "zh-hans" "zh-hant" ) ;
    +      text:auxIndex ( "zh-aux-han2pinyin" ) ;
    +      text:analyzer [
    +        a text:DefinedAnalyzer ;
    +        text:useAnalyzer :hanzAnalyzer ] ; 
    +      ]
    +    [ text:addLang "zh-latn-pinyin" ;
    +      text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
    +      text:analyzer [
    +        a text:DefinedAnalyzer ;
    +        text:useAnalyzer :pinyin ] ; 
    +      ]        
    +    [ text:addLang "zh-aux-han2pinyin" ;
    +      text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
    +      text:analyzer [
    +        a text:DefinedAnalyzer ;
    +        text:useAnalyzer :pinyin ] ; 
    +      text:indexAnalyzer :han2pinyin ; 
    +      ]
    +
    + + +

    defines language tags for Traditional, Simplified, Pinyin and an auxiliary tag zh-aux-han2pinyin associated with an Analyzer, :han2pinyin. The purpose of the auxiliary tag is to define an Analyzer that will be used during indexing and to specify a list of tags that should be searched when the auxiliary tag is used with a query string.

    +

    Searching is then done via the multi-encoding support discussed above. In this example the Analyzer, :han2pinyin, tokenizes strings in zh-hans and zh-hant as the corresponding pinyin so that at search time a pinyin query will retrieve appropriate triples inserted in Traditional or Simplified Chinese. Such a query would appear as:

    +
    (?s ?sc ?lit ?g) text:query ("jīng"@zh-aux-han2pinyin)
    +
    + + +

    The auxiliary field support is needed to accommodate situations such as pinyin or sound-ex which are not exact, i.e., one-to-many rather than one-to-one as in the case of Simplified and Traditional.

    +

    TextIndexLucene adds a field for each of the auxiliary tags associated with the tag of the triple object being indexed. These fields are in addition to the un-tagged field and the field tagged with the language of the triple object literal.

    +

    Naming analyzers for later use

    Repeating a text:GenericAnalyzer specification for use with multiple fields in an entity map may be cumbersome. The text:defineAnalyzer is used in an element of a text:defineAnalyzers list to associate a resource with an analyzer so that it may be referred to later in a