lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pachzelt, Adrian" <>
Subject Manipulate stored string in Lucene
Date Wed, 09 May 2018 05:57:40 GMT
Dear all,

currently I am reading text fields that contain xml text. Hence, the solr input may look like

<field name=”tagged_text”>&lt;sec sec-type="Introduction" id="SECID0E4F"&gt;

With all “<” and “>” escaped.
I wrote a tokenizer that indexes the tag attributes (e.g. sec-type=”Introduction”) on
the position of the tagged word (“Introduction” in this case) and hence I need the HTML
tags when indexing. However, I want to strip the HTML in the stored string that is shown to
the user on a query. So far, I figured out that the index and the stored string a separated.
Thus, I thought it should be possible to manipulate the stored string either after indexing.

Is there a way to do so? I would prefer to manipulate the stored string and not introduce
a second field with the plain text in the input file.

I am glad for any help!

Best Regards,


Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382<>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message