lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pachzelt, Adrian" <A.Pachz...@ub.uni-frankfurt.de>
Subject Manipulate stored string in Lucene
Date Wed, 09 May 2018 05:57:40 GMT
Dear all,

currently I am reading text fields that contain xml text. Hence, the solr input may look like
this:

<field name=”tagged_text”>&lt;sec sec-type="Introduction" id="SECID0E4F"&gt;
&lt;title&gt;Introduction&lt;/title&gt;
&lt;/sec&gt;
</field>

With all “<” and “>” escaped.
I wrote a tokenizer that indexes the tag attributes (e.g. sec-type=”Introduction”) on
the position of the tagged word (“Introduction” in this case) and hence I need the HTML
tags when indexing. However, I want to strip the HTML in the stored string that is shown to
the user on a query. So far, I figured out that the index and the stored string a separated.
Thus, I thought it should be possible to manipulate the stored string either after indexing.

Is there a way to do so? I would prefer to manipulate the stored string and not introduce
a second field with the plain text in the input file.

I am glad for any help!

Best Regards,

Adrian

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
-------------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message