jena-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sorin Gheorghiu <sorin.gheorg...@uni-konstanz.de>
Subject Fwd: Re: Text Index build with empty fields
Date Mon, 04 Mar 2019 16:37:51 GMT
Hi Chris,

when I reduce the entity map to 3 fields:

          [ text:field "oldgndid";
            text:predicate gndo:oldAuthorityNumber
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForThePerson
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForThePerson
          ]

then *oldgndid *field only contains data:

ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/4000002-3........
ES...B..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/4000023-0......painless..if((ctx._source

== null) || (ctx._source.oldgndid == null) || 
(ctx._source.oldgndid.empty == true)) 
{ctx._source.oldgndid=[params.fieldValue] } else 
{ctx._source.oldgndid.add(params.fieldValue)}..fieldValue..(DE-588c)4000023-0...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/4000023-0..>{"varName":[],"prefName":[],"oldgndid":["(DE-588c)4000023-0"]}.............

moreover with 2 fields:

          [ text:field "prefName";
            text:predicate gndo:preferredNameForThePerson
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForThePerson
          ]

then *prefName* field only contains data:

ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/134316541........
ES...$..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/1153446294......painless..if((ctx._source

== null) || (ctx._source.prefName == null) || 
(ctx._source.prefName.empty == true)) 
{ctx._source.prefName=[params.fieldValue] } else 
{ctx._source.prefName.add(params.fieldValue)}..fieldValue. 
Pharmakon...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/1153446294..'{"varName":[],"prefName":["Pharmakon"]}.................

Regards,
Sorin

Am 01.03.2019 um 18:06 schrieb Chris Tomlinson:
> Hi Sorin,
>
> tcpdump -A -r works fine to view the pcap file; however, I don’t have the time to delve
into the data. I’ll take your word for it that the whole setup worked in 3.8.0 and I encourage
you to try simplifying the entity map perhaps by having a unique field per property to see
if the problem appears related to prefName and varName fields mapping to multiple properties.
>
> I do notice that the field oldgndid only maps to a single property but not knowing the
data I have no idea whether there’s any of that data in your tests.
>
> Since you indicate that only the field, gndtype, has data (per the pcap file) then if
there is oldgndid data (i.e., occurrences of gndo:oldAuthorityNumber, then that suggests that
there is some rather generic issue w/ textindexer; however if there is no oldgndid data then
there may be a problem that has crept in since 3.8.0 that is leading to a problem with data
for multiple properties assigned to a single field which I would guess might be related to
google.common.collection.MultiMap that holds the results of parsing the entity map.
>
> I have no idea how to enable the debug when running the standalone textindexer, perhaps
someone else can answer that.
>
> Regards,
> Chris
>
>
>> On Mar 1, 2019, at 2:57 AM, Sorin Gheorghiu<sorin.gheorghiu@uni-konstanz.de>
 wrote:
>>
>> Hi Chris,
>>
>> 1) As I said before, this entity map worked in 3.8.0.
>> The pcap file I sent you is the proof that Jena delivers inconsistent data. You may
open it with Wireshark
>>
>> <jndbgnifbhkopbdd.png>
>>
>> or read it with tcpick:
>> # tcpick -C -yP -r textindexer_280219.pcap | more
>>
>> ES...}..........\*.......gnd_fts_es_131018_index.cp-dFuCVTg-dUwvfyREG2w..GndSubjectheadings.http://d-nb.info/gnd/102968225.........
>> ES..............\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/102968438......painless..if((ctx._source
== null) || (ctx._source.gndtype == null) || (ctx._source.gndtype.empty == true)) {ctx._source.gndtype=[params.fieldValue]
} else {ctx._source.gndtype.add(params.fieldValue)}
>> ..fieldValue..Person...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/102968438....{"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"oldgndid":[],"gndtype":["Person"]}..................................
>> As a remark, Jena sends whole text index data within one TCP packet for one Elasticsearch
document.
>>
>> 3) fuseki.log collects logs when Fuseki server is running, but for text indexer we
have to run java command line, i.e.
>>
>> 	java -cp ./fuseki-server.jar:<other_jars> jena.textindexer --desc=run/config.ttl
>> The question is how to activate the debug logs during text indexer?
>>
>>
>> Regards,
>> Sorin
>>
>> Am 28.02.2019 um 21:41 schrieb Chris Tomlinson:
>>> Hi Sorin,
>>>
>>> 1) I suggest trying to simplify the entity map. I assume there’s data for each
of the properties other than skos:altLabel in the entity map:
>>>
>>>>           [ text:field "gndtype";
>>>>             text:predicate skos:altLabel
>>>>           ]
>>>>           [ text:field "oldgndid";
>>>>             text:predicate gndo:oldAuthorityNumber
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForTheSubjectHeading
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForTheSubjectHeading
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForThePlaceOrGeographicName
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForThePlaceOrGeographicName
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForTheWork
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForTheWork
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForTheConferenceOrEvent
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForTheConferenceOrEvent
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForTheCorporateBody
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForTheCorporateBody
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForThePerson
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForThePerson
>>>>           ]
>>>>           [ text:field "prefName";
>>>>             text:predicate gndo:preferredNameForTheFamily
>>>>           ]
>>>>           [ text:field "varName";
>>>>             text:predicate gndo:variantNameForTheFamily
>>>>           ]
>>> 2) You might try a TextIndexLucene
>>>
>>> 3) Adding the line log4j.logger.org.apache.jena.query.text.es=DEBUG should work.
I see no problem with it.
>>>
>>> Sorry to be of little help,
>>> Chris
>>>
>>>
>>>> On Feb 28, 2019, at 8:53 AM, Sorin Gheorghiu<sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>  wrote:
>>>>
>>>> Hi Chris,
>>>> Thank you for answering, I reply you directly because users@jena doesn't
accept messages larger than 1Mb.
>>>>
>>>> The previous text index successful attempt we did was with 3.8.0, not 3.9.0,
sorry for the misinformation.
>>>> Attached is the assembler file for 3.10.0 as requested, as well as the packet
capture file to see that only the 'gndtype' field has data.
>>>> I tried to enable the debug logs in log4j.properties with log4j.logger.org.apache.jena.query.text.es=DEBUG
but no output in the log file.
>>>>
>>>> Regards,
>>>> Sorin
>>>>
>>>> Am 27.02.2019 um 20:01 schrieb Chris Tomlinson:
>>>>> Hi Sorin,
>>>>>
>>>>> Please provide the assembler file for Elasticsearch that has the problematic
entity map definitions.
>>>>>
>>>>> There haven’t been any changes in over a year to textindexer since
well before 3.9. I don’t see any relevant changes to the handling of entity maps either
so I can’t begin to pursue the issue further w/o perhaps seeing your current assembler file.
>>>>>
>>>>> I don't have any experience with Elasticsearch or with using jena-text-es
beyond a simple change to TextIndexES.java to change org.elasticsearch.common.transport.InetSocketTransportAddress
to org.elasticsearch.common.transport.TransportAddress as part of the upgrade to Lucene 7.4.0
and Elasticsearch 6.4.2.
>>>>>
>>>>> Regards,
>>>>> Chris
>>>>>
>>>>>
>>>>>> On Feb 25, 2019, at 2:37 AM, Sorin Gheorghiu<sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>  <mailto:sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>  wrote:
>>>>>>
>>>>>> Correction: only the *latest field *from the /text:map/ list contains
a value.
>>>>>>
>>>>>> To reformulate:
>>>>>>
>>>>>> * if there are 3 fields in /text:map/, then during indexing the first
>>>>>>    two are empty (let's name them 'text1' and 'text2') and the latest
>>>>>>    field contains data (let's name it 'text3')
>>>>>> * if on the next attempt the field 'text3' is commented out, then
>>>>>>    'text1' is empty and 'text2' contains data
>>>>>>
>>>>>>
>>>>>> Am 22.02.2019 um 15:01 schrieb Sorin Gheorghiu:
>>>>>>> In addition:
>>>>>>>
>>>>>>>   * if there are 3 fields in /text:map/, then during indexing
one
>>>>>>>     contains data (let's name it 'text1'), the others are empty
(let's
>>>>>>>     name them 'text2' and 'text3'),
>>>>>>>   * if on the next attempt the field 'text1' is commented out,
then
>>>>>>>     'text2' contains data and 'text3' is empty
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------- Weitergeleitete Nachricht --------
>>>>>>> Betreff: 	Text Index build with empty fields
>>>>>>> Datum: 	Fri, 22 Feb 2019 14:01:18 +0100
>>>>>>> Von: 	Sorin Gheorghiu<sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>  <mailto:sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>
>>>>>>> Antwort an: 	users@jena.apache.org  <mailto:users@jena.apache.org>
 <mailto:users@jena.apache.org>  <mailto:users@jena.apache.org>
>>>>>>> An: 	users@jena.apache.org  <mailto:users@jena.apache.org>
 <mailto:users@jena.apache.org>  <mailto:users@jena.apache.org>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> When building the text index with the /jena.textindexer/ tool
in Jena 3.10 for an external full-text search engine (Elasticsearch of course) and having
multiple fields with different names in /text:map/, just *one field is indexed* (more precisely
one field contains data, the others are empty). It doesn't look to be an issue with Elasticsearch,
in the logs generated during the indexing the fields are already missing the values, but one.
The same setup worked in Jena 3.9. Changing the Java version from 8 to 9 or 11 didn't change
anything.
>>>>>>>
>>>>>>> Could it be that changes of the new release have affected this
tool and we deal with a bug?
>>>>>>>
>>>> -- 
>>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>>> Universität Konstanz        Raum: B705
>>>> 78464 Konstanzsorin.gheorghiu@uni-konstanz.de  <mailto:sorin.gheorghiu@uni-konstanz.de>
 <mailto:sorin.gheorghiu@uni-konstanz.de>  <mailto:sorin.gheorghiu@uni-konstanz.de>
>>>>
>>>> - KIM: Abteilung Contentdienste -
>> -- 
>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>> Universität Konstanz        Raum: B705
>> 78464 Konstanzsorin.gheorghiu@uni-konstanz.de  <mailto:sorin.gheorghiu@uni-konstanz.de>
>>
>> - KIM: Abteilung Contentdienste -

-- 
Sorin Gheorghiu             Tel: +49 7531 88-3198
Universität Konstanz        Raum: B705
78464 Konstanzsorin.gheorghiu@uni-konstanz.de

- KIM: Abteilung Contentdienste -


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message