lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Help with a DIH config file
Date Thu, 14 Mar 2019 21:00:11 GMT
sorry for my late reply. thanks for sharing

yes this is possible.

maybe my last mail were confusing. I hope the examples below help

Alternative 1 - Use only DIH without update processor
tika-data-config-2xml - add transformer in entity and the transformation in
field (here done for id and for fulltext) - additioanlly set
TikaEntityProcessor format to "text":
    <dataSource type="BinFileDataSource" />
            <entity name="files" dataSource="null" rootEntity="false"
            recursive="true" transformer="RegexTransformer">
                <field column="fileAbsolutePath" name="id" regex="^\w|\.]"
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                    format="text" transformer="RegexTransformer">
                    <field column="file" name="fileName"/>
                    <field column="description" name="description"
                    <field column="title" name="title" meta="true"/>
<field column="mime_type" name="type" meta="true"/>
                    <field column="text" name="fulltext" regex="\n|\r"
replaceWith=" "/>
<field column="keywords" name="keywords" meta="true"/>
<field column="count" name="page_count" meta="true"/>
<field column="dc:terms" name="keywords_alt" meta="true"/>
<field column="Content-Type" name="content_type" meta="true"/>
<field column="xmpTPg:NPages" name="page_count_alt" meta="true"/>

Alternative 2 - Regex processor in solrconfig.xml - you need to put
everything into ONE chain

<updateRequestProcessorChain name="my-chain"> <processor
class="solr.HTMLStripFieldUpdateProcessorFactory"> <str
name="fieldName">_text_</str> <str name="fieldName">fulltext</str>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">_text_</str>
<str name="fieldName">fulltext</str>
<str name="pattern">\n|\r</str>
<str name="replacement"/>
<bool name="literalReplacement">true</bool>

<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">id</str>
<str name="fieldName">url</str>
<str name="pattern">[^\w|\.]</str>
<str name="replacement">/</str>
<bool name="literalReplacement">true</bool>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>

<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config-2.xml</str>
<str name="update.chain">my-chain</str>

On Thu, Mar 14, 2019 at 6:41 AM wclarke <> wrote:

> Got each one working individually, but not multiples.  Is it possible?
> Please see attached files.
> Thanks!!! tika-data-config-2.xml
> <>
> solrconfig.xml
> <>
> --
> Sent from:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message