jena-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tomlinson <chris.j.tomlin...@gmail.com>
Subject Re: Text Index build with empty fields
Date Tue, 12 Mar 2019 14:39:00 GMT
Hi Sorin,

I have focussed on the jena text integration w/ Lucene local to jena/fuseki. The solr was
dropped over a year ago due to lack of support/interest and w’ your information about ES
7.x it’s likely going to take someone who is a user of ES to help keep the integration up-to-date.


Anuj Kumar <akumar1@isightpartners.com> did the ES integration about a year ago for
jena 3.9.0 and as I mentioned I made obvious changes to the ES integration to update to Lucene
7.4.0 for jena 3.10.0.

The upgrade to Lucene 7.4.0  <https://issues.apache.org/jira/browse/JENA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673657#comment-16673657>was
prompted by a user, jeanmarc.vanel@gmail.com <mailto:jeanmarc.vanel@gmail.com>, who
was interested in Lucene 7.5, but the released version of ES was built against 7.4 so we upgraded
to that version.

I’ve opened JENA-1681 <https://issues.apache.org/jira/browse/JENA-1681> for the issue
you’ve reported. You can report your findings there and hopefully we can get to the bottom
of the problem.

Regards,
Chris



> On Mar 12, 2019, at 6:40 AM, Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de>
wrote:
> 
> Hi Chris,
> 
> Thank you for your detailed answer. I will still try to find the root cause of this issue.
> But I have a question to you, do you know if Jena will support Elasticsearch in the further
versions?
> 
> I am asking because in Elasticsearch 7.0 are breaking changes which will affect the transport-client
[1]: 
> The TransportClient is deprecated in favour of the Java High Level REST Client and will
be removed in Elasticsearch 8.0.
> This supposes changes in the client’s initialization code, the Migration Guide [2]
explains how to do it.
> 
> [1] https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html
<https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html>
> [2] https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html
<https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html>
> 
> Best regards,
> Sorin
> 
> Am 11.03.2019 um 18:38 schrieb Chris Tomlinson:
>> Hi Sorin,
>> 
>> I haven’t had the time to try and delve further into your issue. Your pcap seems
to clearly indicate that there is no data populating any field/property other than the first
one in the entity map.
>> 
>> I’ve included the configuration file that we use. It has many many fields defined
that are all populated. We load jena/fuseki from a collection of git repos via a git-to-dbs
tool <https://github.com/buda-base/git-to-dbs> and we don’t see the sort of issue
you’re reporting where there is a single field out of all the defined fields that is populated
in the dataset and Lucene index - we don’t use ElasticSearch. 
>> 
>> The point being that whatever is going wrong is apparently not in the parsing of
the configuration and setting up of the internal tables that record information about which
predicates are indexed via Lucene (or Elasticsearch) into what fields.
>> 
>> So it appears to me that the issue is something that is happening in the connection
between the standalone textindexer.java and the Elasticsearch via the TextIndexES.java. The
textindexer.java doesn’t have any post 3.8.0 changes that I can see and the only change
in the TextIndexES.java is a change in the name of org.elasticsearch.common.transport.InetSocketTransportAddress
to org.elasticsearch.common.transport.TransportAddress as part of the upgrade.
>> 
>> I’m really not able to go further at this time.
>> 
>> I’m sorry,
>> Chris
>> 
>> 
>>> # Fuseki configuration for BDRC, configures two endpoints:
>>> #   - /bdrc is read-only
>>> #   - /bdrcrw is read-write
>>> #
>>> # This was painful to come up with but the web interface basically allows no
option
>>> # and there is no subclass inference by default so such a configuration file
is necessary.
>>> #
>>> # The main doc sources are:
>>> #  - https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html
<https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html>
>>> #  - https://jena.apache.org/documentation/assembler/assembler-howto.html <https://jena.apache.org/documentation/assembler/assembler-howto.html>
>>> #  - https://jena.apache.org/documentation/assembler/assembler.ttl <https://jena.apache.org/documentation/assembler/assembler.ttl>
>>> #
>>> # See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html <https://jena.apache.org/documentation/fuseki2/fuseki-layout.html>
for the destination of this file.
>>> 
>>> @prefix fuseki:  <http://jena.apache.org/fuseki# <http://jena.apache.org/fuseki#>>
.
>>> @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://www.w3.org/1999/02/22-rdf-syntax-ns#>>
.
>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema# <http://www.w3.org/2000/01/rdf-schema#>>
.
>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb# <http://jena.hpl.hp.com/2008/tdb#>>
.
>>> @prefix tdb2:    <http://jena.apache.org/2016/tdb# <http://jena.apache.org/2016/tdb#>>
.
>>> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler# <http://jena.hpl.hp.com/2005/11/Assembler#>>
.
>>> @prefix :        <http://base/# <http://base/#>> .
>>> @prefix text:    <http://jena.apache.org/text# <http://jena.apache.org/text#>>
.
>>> @prefix skos:    <http://www.w3.org/2004/02/skos/core# <http://www.w3.org/2004/02/skos/core#>>
.
>>> @prefix adm:     <http://purl.bdrc.io/ontology/admin/ <http://purl.bdrc.io/ontology/admin/>>
.
>>> @prefix bdd:     <http://purl.bdrc.io/data/ <http://purl.bdrc.io/data/>>
.
>>> @prefix bdo:     <http://purl.bdrc.io/ontology/core/ <http://purl.bdrc.io/ontology/core/>>
.
>>> @prefix bdr:     <http://purl.bdrc.io/resource/ <http://purl.bdrc.io/resource/>>
.
>>> @prefix f:       <java:io.bdrc.ldspdi.sparql.functions.> .
>>> 
>>> # [] ja:loadClass "org.seaborne.tdb2.TDB2" .
>>> # tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .
>>> # tdb2:GraphTDB2    rdfs:subClassOf  ja:Model .
>>> 
>>> [] rdf:type fuseki:Server ;
>>>    fuseki:services (
>>>      :bdrcrw
>>>    ) .
>>> 
>>> :bdrcrw rdf:type fuseki:Service ;
>>>     fuseki:name                       "bdrcrw" ;     # name of the dataset in
the url
>>>     fuseki:serviceQuery               "query" ;    # SPARQL query service
>>>     fuseki:serviceUpdate              "update" ;   # SPARQL update service
>>>     fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
>>>     fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store protocol
(read and write)
>>>     fuseki:dataset                    :bdrc_text_dataset ;
>>>     .
>>> 
>>> # using TDB
>>> :dataset_bdrc rdf:type      tdb:DatasetTDB ;
>>>      tdb:location "/usr/local/fuseki/base/databases/bdrc" ;
>>>      tdb:unionDefaultGraph true ;
>>>      .
>>> 
>>> # using TDB2
>>> # :dataset_bdrc rdf:type      tdb2:DatasetTDB2 ;
>>> #      tdb2:location "/usr/local/fuseki/base/databases/bdrc" ;
>>> #      tdb2:unionDefaultGraph true ;
>>> #   .
>>> 
>>> :bdrc_text_dataset rdf:type     text:TextDataset ;
>>>     text:dataset   :dataset_bdrc ;
>>>     text:index     :bdrc_lucene_index ;
>>>     .
>>> 
>>> # Text index description
>>> :bdrc_lucene_index a text:TextIndexLucene ;
>>>     text:directory <file:/usr/local/fuseki/base/lucene-bdrc> <file:///usr/local/fuseki/base/lucene-bdrc>
;
>>>     text:storeValues true ;
>>>     text:multilingualSupport true ;
>>>     text:entityMap :bdrc_entmap ;
>>>     text:defineAnalyzers (
>>>         [ text:defineAnalyzer :romanWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :devaWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "deva" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :slpWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "SLP" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :romanLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :devaLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "deva" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :slpLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "SLP" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :romanLenientQueryAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "query" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :hanzAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "TC2SC" ]
>>>                 [ text:paramName "stopwords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue 0 ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :han2pinyin ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "TC2PYstrict" ]
>>>                 [ text:paramName "stopwords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue 0 ]
>>>                 )
>>>             ] ; 
>>>           ]
>>>         [ text:defineAnalyzer :pinyin ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "PYstrict" ]
>>>                 )
>>>             ] ; 
>>>           ]
>>>         [ text:addLang "bo" ; 
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "unicode" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "bo-x-ewts" ;
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "ewts" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "bo-alalc97" ;
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "alalc" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-hans" ;
>>>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>>>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :hanzAnalyzer ] ;
>>>           ]
>>>         [ text:addLang "zh-hant" ; 
>>>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>>>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :hanzAnalyzer
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-latn-pinyin" ;
>>>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :pinyin
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-aux-han2pinyin" ;
>>>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :pinyin
>>>             ] ;
>>>           text:indexAnalyzer :han2pinyin ;
>>>           ]
>>>         [ text:addLang "sa-x-ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" "sa-aux-roman2Ndia"
"sa-aux-slp2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           ]
>>>         [ text:addLang "sa-aux-deva2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-roman2Ndia" "sa-aux-slp2Ndia"
) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           text:indexAnalyzer :devaLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-aux-roman2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" "sa-aux-slp2Ndia" )
;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer 
>>>             ] ; 
>>>           text:indexAnalyzer :romanLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-aux-slp2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" "sa-aux-roman2Ndia"
) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           text:indexAnalyzer :slpLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-deva" ;
>>>           text:searchFor ( "sa-deva" "sa-x-iast" "sa-x-slp1" "sa-x-iso" "sa-alalc97"
) ;
>>>           text:auxIndex ( "sa-aux-deva2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :devaWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-iso" ;
>>>           text:searchFor ( "sa-x-iso" "sa-x-iast" "sa-x-slp1" "sa-deva" "sa-alalc97"
) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-slp1" ;
>>>           text:searchFor ( "sa-x-slp1" "sa-x-iast" "sa-x-iso" "sa-deva" "sa-alalc97"
) ;
>>>           text:auxIndex ( "sa-aux-slp2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :slpWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-iast" ;
>>>           text:searchFor ( "sa-x-iast" "sa-x-slp1" "sa-x-iso" "sa-deva" "sa-alalc97"
) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-alalc97" ;
>>>           text:searchFor ( "sa-alalc97" "sa-x-slp1" "sa-x-iso" "sa-deva" "sa-iast"
) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>       ) ;
>>>     .
>>> 
>>> # Index mappings
>>> :bdrc_entmap a text:EntityMap ;
>>>     text:entityField      "uri" ;
>>>     text:uidField         "uid" ;
>>>     text:defaultField     "label" ;
>>>     text:langField        "lang" ;
>>>     text:graphField       "graph" ; ## enable graph-specific indexing
>>>     text:map (
>>>          [ text:field "label" ; 
>>>            text:predicate skos:prefLabel ]
>>>          [ text:field "altLabel" ; 
>>>            text:predicate skos:altLabel ; ]
>>>          [ text:field "rdfsLabel" ;
>>>            text:predicate rdfs:label ; ]
>>>          [ text:field "chunkContents" ;
>>>            text:predicate bdo:chunkContents ; ]
>>>          [ text:field "eTextTitle" ;
>>>            text:predicate bdo:eTextTitle ; ]
>>>          [ text:field "logMessage" ;
>>>            text:predicate adm:logMessage ; ]
>>>          [ text:field "noteText" ;
>>>            text:predicate bdo:noteText ; ]
>>>          [ text:field "workAuthorshipStatement" ;
>>>            text:predicate bdo:workAuthorshipStatement ; ]
>>>          [ text:field "workColophon" ; 
>>>            text:predicate bdo:workColophon ; ]
>>>          [ text:field "workEditionStatement" ;
>>>            text:predicate bdo:workEditionStatement ; ]
>>>          [ text:field "workPublisherLocation" ;
>>>            text:predicate bdo:workPublisherLocation ; ]
>>>          [ text:field "workPublisherName" ;
>>>            text:predicate bdo:workPublisherName ; ]
>>>          [ text:field "workSeriesName" ;
>>>            text:predicate bdo:workSeriesName ; ]
>>>          ) ;
>>>     .
>> 
>> 
>>> On Mar 11, 2019, at 11:42 AM, Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de
<mailto:sorin.gheorghiu@uni-konstanz.de>> wrote:
>>> 
>>> Hi Chris,
>>> 
>>> have you had time to look in my results, by chance? Would this help to isolate
the issue?
>>> Let me know if you need any other data to collect, please.
>>> Best regards,
>>> Sorin
>>> 
>>> -------- Weitergeleitete Nachricht --------
>>> Betreff:	Re: Text Index build with empty fields
>>> Datum:	Mon, 4 Mar 2019 17:35:56 +0100
>>> Von:	Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
>>> An:	users@jena.apache.org <mailto:users@jena.apache.org>
>>> Kopie (CC):	Chris Tomlinson <chris.j.tomlinson@gmail.com> <mailto:chris.j.tomlinson@gmail.com>
>>> 
>>> Hi Chris,
>>> 
>>> when I reduce the entity map to 3 fields:
>>> 
>>>          [ text:field "oldgndid";
>>>            text:predicate gndo:oldAuthorityNumber
>>>          ]
>>>          [ text:field "prefName";
>>>            text:predicate gndo:preferredNameForThePerson
>>>          ]
>>>          [ text:field "varName";
>>>            text:predicate gndo:variantNameForThePerson
>>>          ]
>>> then oldgndid field only contains data (see textindexer_3params_040319.pcap attached):
>>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/4000002-3
<http://d-nb.info/gnd/4000002-3>........
>>> ES...B..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/4000023-0......painless..if
<http://d-nb.info/gnd/4000023-0......painless..if>((ctx._source == null) || (ctx._source.oldgndid
== null) || (ctx._source.oldgndid.empty == true)) {ctx._source.oldgndid=[params.fieldValue]
} else {ctx._source.oldgndid.add(params.fieldValue)}..fieldValue..(DE-588c)4000023-0...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/4000023-0
<http://d-nb.info/gnd/4000023-0>..>{"varName":[],"prefName":[],"oldgndid":["(DE-588c)4000023-0"]}.............
>>> moreover with 2 fields:
>>> 
>>>          [ text:field "prefName";
>>>            text:predicate gndo:preferredNameForThePerson
>>>          ]
>>>          [ text:field "varName";
>>>            text:predicate gndo:variantNameForThePerson
>>>          ]
>>> then prefName field only contains data (see textindexer_2params_040319.pcap attached):
>>> 
>>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/134316541
<http://d-nb.info/gnd/134316541>........
>>> ES...$..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/1153446294......painless..if
<http://d-nb.info/gnd/1153446294......painless..if>((ctx._source == null) || (ctx._source.prefName
== null) || (ctx._source.prefName.empty == true)) {ctx._source.prefName=[params.fieldValue]
} else {ctx._source.prefName.add(params.fieldValue)}..fieldValue.     Pharmakon...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/1153446294
<http://d-nb.info/gnd/1153446294>..'{"varName":[],"prefName":["Pharmakon"]}.................
>>> 
>>> Regards,
>>> Sorin
>>> 
>>> Am 01.03.2019 um 18:06 schrieb Chris Tomlinson:
>>>> Hi Sorin,
>>>> 
>>>> tcpdump -A -r works fine to view the pcap file; however, I don’t have the
time to delve into the data. I’ll take your word for it that the whole setup worked in 3.8.0
and I encourage you to try simplifying the entity map perhaps by having a unique field per
property to see if the problem appears related to prefName and varName fields mapping to multiple
properties. 
>>>> 
>>>> I do notice that the field oldgndid only maps to a single property but not
knowing the data I have no idea whether there’s any of that data in your tests.
>>>> 
>>>> Since you indicate that only the field, gndtype, has data (per the pcap file)
then if there is oldgndid data (i.e., occurrences of gndo:oldAuthorityNumber, then that suggests
that there is some rather generic issue w/ textindexer; however if there is no oldgndid data
then there may be a problem that has crept in since 3.8.0 that is leading to a problem with
data for multiple properties assigned to a single field which I would guess might be related
to google.common.collection.MultiMap that holds the results of parsing the entity map.
>>>> 
>>>> I have no idea how to enable the debug when running the standalone textindexer,
perhaps someone else can answer that.
>>>> 
>>>> Regards,
>>>> Chris
>>>> 
>>>> 
>>>>> On Mar 1, 2019, at 2:57 AM, Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> wrote:
>>>>> 
>>>>> Hi Chris,
>>>>> 
>>>>> 1) As I said before, this entity map worked in 3.8.0. 
>>>>> The pcap file I sent you is the proof that Jena delivers inconsistent
data. You may open it with Wireshark
>>>>> 
>>>>> <jndbgnifbhkopbdd.png>
>>>>> 
>>>>> or read it with tcpick:
>>>>> # tcpick -C -yP -r textindexer_280219.pcap | more
>>>>> 
>>>>> ES...}..........\*.......gnd_fts_es_131018_index.cp-dFuCVTg-dUwvfyREG2w..GndSubjectheadings.http://d-nb.info/gnd/102968225
<dfucvtg-duwvfyreg2w..gndsubjectheadings.http://d-nb.info/gnd/102968225>.........
>>>>> ES..............\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/102968438......painless..if
<http://d-nb.info/gnd/102968438......painless..if>((ctx._source == null) || (ctx._source.gndtype
== null) || (ctx._source.gndtype.empty == true)) {ctx._source.gndtype=[params.fieldValue]
} else {ctx._source.gndtype.add(params.fieldValue)}
>>>>> ..fieldValue..Person...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/102968438
<http://d-nb.info/gnd/102968438>....{"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"oldgndid":[],"gndtype":["Person"]}..................................
>>>>> As a remark, Jena sends whole text index data within one TCP packet for
one Elasticsearch document.
>>>>> 
>>>>> 3) fuseki.log collects logs when Fuseki server is running, but for text
indexer we have to run java command line, i.e.
>>>>> 
>>>>> 	java -cp ./fuseki-server.jar:<other_jars> jena.textindexer --desc=run/config.ttl
>>>>> The question is how to activate the debug logs during text indexer?
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Sorin
>>>>> 
>>>>> Am 28.02.2019 um 21:41 schrieb Chris Tomlinson:
>>>>>> Hi Sorin,
>>>>>> 
>>>>>> 1) I suggest trying to simplify the entity map. I assume there’s
data for each of the properties other than skos:altLabel in the entity map:
>>>>>> 
>>>>>>>          [ text:field "gndtype";
>>>>>>>            text:predicate skos:altLabel
>>>>>>>          ]
>>>>>>>          [ text:field "oldgndid";
>>>>>>>            text:predicate gndo:oldAuthorityNumber
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheSubjectHeading
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheSubjectHeading
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForThePlaceOrGeographicName
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForThePlaceOrGeographicName
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheWork
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheWork
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheConferenceOrEvent
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheConferenceOrEvent
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheCorporateBody
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheCorporateBody
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForThePerson
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForThePerson
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheFamily
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheFamily
>>>>>>>          ]
>>>>>> 2) You might try a TextIndexLucene
>>>>>> 
>>>>>> 3) Adding the line log4j.logger.org.apache.jena.query.text.es=DEBUG
should work. I see no problem with it.
>>>>>> 
>>>>>> Sorry to be of little help,
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>>> On Feb 28, 2019, at 8:53 AM, Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> wrote:
>>>>>>> 
>>>>>>> Hi Chris,
>>>>>>> Thank you for answering, I reply you directly because users@jena
doesn't accept messages larger than 1Mb.
>>>>>>> 
>>>>>>> The previous text index successful attempt we did was with 3.8.0,
not 3.9.0, sorry for the misinformation.
>>>>>>> Attached is the assembler file for 3.10.0 as requested, as well
as the packet capture file to see that only the 'gndtype' field has data.
>>>>>>> I tried to enable the debug logs in log4j.properties with log4j.logger.org.apache.jena.query.text.es=DEBUG
but no output in the log file.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Sorin
>>>>>>> 
>>>>>>> Am 27.02.2019 um 20:01 schrieb Chris Tomlinson:
>>>>>>>> Hi Sorin,
>>>>>>>> 
>>>>>>>> Please provide the assembler file for Elasticsearch that
has the problematic entity map definitions.
>>>>>>>> 
>>>>>>>> There haven’t been any changes in over a year to textindexer
since well before 3.9. I don’t see any relevant changes to the handling of entity maps either
so I can’t begin to pursue the issue further w/o perhaps seeing your current assembler file.

>>>>>>>> 
>>>>>>>> I don't have any experience with Elasticsearch or with using
jena-text-es beyond a simple change to TextIndexES.java to change org.elasticsearch.common.transport.InetSocketTransportAddress
to org.elasticsearch.common.transport.TransportAddress as part of the upgrade to Lucene 7.4.0
and Elasticsearch 6.4.2.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 25, 2019, at 2:37 AM, Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> wrote:
>>>>>>>>> 
>>>>>>>>> Correction: only the *latest field *from the /text:map/
list contains a value.
>>>>>>>>> 
>>>>>>>>> To reformulate:
>>>>>>>>> 
>>>>>>>>> * if there are 3 fields in /text:map/, then during indexing
the first
>>>>>>>>>   two are empty (let's name them 'text1' and 'text2')
and the latest
>>>>>>>>>   field contains data (let's name it 'text3')
>>>>>>>>> * if on the next attempt the field 'text3' is commented
out, then
>>>>>>>>>   'text1' is empty and 'text2' contains data
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 22.02.2019 um 15:01 schrieb Sorin Gheorghiu:
>>>>>>>>>> In addition:
>>>>>>>>>> 
>>>>>>>>>>  * if there are 3 fields in /text:map/, then during
indexing one
>>>>>>>>>>    contains data (let's name it 'text1'), the others
are empty (let's
>>>>>>>>>>    name them 'text2' and 'text3'),
>>>>>>>>>>  * if on the next attempt the field 'text1' is commented
out, then
>>>>>>>>>>    'text2' contains data and 'text3' is empty
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -------- Weitergeleitete Nachricht --------
>>>>>>>>>> Betreff: 	Text Index build with empty fields
>>>>>>>>>> Datum: 	Fri, 22 Feb 2019 14:01:18 +0100
>>>>>>>>>> Von: 	Sorin Gheorghiu <sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de>
>>>>>>>>>> Antwort an: 	users@jena.apache.org <mailto:users@jena.apache.org>
<mailto:users@jena.apache.org> <mailto:users@jena.apache.org> <mailto:users@jena.apache.org>
<mailto:users@jena.apache.org> <mailto:users@jena.apache.org> <mailto:users@jena.apache.org>
>>>>>>>>>> An: 	users@jena.apache.org <mailto:users@jena.apache.org>
<mailto:users@jena.apache.org> <mailto:users@jena.apache.org> <mailto:users@jena.apache.org>
<mailto:users@jena.apache.org> <mailto:users@jena.apache.org> <mailto:users@jena.apache.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> When building the text index with the /jena.textindexer/
tool in Jena 3.10 for an external full-text search engine (Elasticsearch of course) and having
multiple fields with different names in /text:map/, just *one field is indexed* (more precisely
one field contains data, the others are empty). It doesn't look to be an issue with Elasticsearch,
in the logs generated during the indexing the fields are already missing the values, but one.
The same setup worked in Jena 3.9. Changing the Java version from 8 to 9 or 11 didn't change
anything.
>>>>>>>>>> 
>>>>>>>>>> Could it be that changes of the new release have
affected this tool and we deal with a bug?
>>>>>>>>>> 
>>>>>>> -- 
>>>>>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>>>>>> Universität Konstanz        Raum: B705
>>>>>>> 78464 Konstanz              sorin.gheorghiu@uni-konstanz.de <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
>>>>>>> 
>>>>>>> - KIM: Abteilung Contentdienste -
>>>>> -- 
>>>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>>>> Universität Konstanz        Raum: B705
>>>>> 78464 Konstanz              sorin.gheorghiu@uni-konstanz.de <mailto:sorin.gheorghiu@uni-konstanz.de>
<mailto:sorin.gheorghiu@uni-konstanz.de> <mailto:sorin.gheorghiu@uni-konstanz.de>
>>>>> 
>>>>> - KIM: Abteilung Contentdienste -
>>> -- 
>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>> Universität Konstanz        Raum: B705
>>> 78464 Konstanz              sorin.gheorghiu@uni-konstanz.de <mailto:sorin.gheorghiu@uni-konstanz.de>
>>> 
>>> - KIM: Abteilung Contentdienste -
>>> <textindexer_2params_040319.pcap><textindexer_3params_040319.pcap>
>> 
> -- 
> Sorin Gheorghiu             Tel: +49 7531 88-3198
> Universität Konstanz        Raum: B705
> 78464 Konstanz              sorin.gheorghiu@uni-konstanz.de <mailto:sorin.gheorghiu@uni-konstanz.de>
> 
> - KIM: Abteilung Contentdienste -


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message