lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Help with a DIH config file
Date Thu, 14 Mar 2019 21:00:11 GMT
sorry for my late reply. thanks for sharing

yes this is possible.

maybe my last mail were confusing. I hope the examples below help

Alternative 1 - Use only DIH without update processor
tika-data-config-2xml - add transformer in entity and the transformation in
field (here done for id and for fulltext) - additioanlly set
TikaEntityProcessor format to "text":
<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
            <entity name="files" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="d:\normalized\webcontent\bibleforchildren.org"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)|(htm)|(html)"
            onError="skip"
            recursive="true" transformer="RegexTransformer">
                <field column="fileAbsolutePath" name="id" regex="^\w|\.]"
replaceWith=""/>
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                <entity
                    name="documentImport"
                    processor="TikaEntityProcessor"
                    url="${files.fileAbsolutePath}"
                    format="text" transformer="RegexTransformer">
                    <field column="file" name="fileName"/>
                    <field column="description" name="description"
meta="true"/>
                    <field column="title" name="title" meta="true"/>
<field column="mime_type" name="type" meta="true"/>
                    <field column="text" name="fulltext" regex="\n|\r"
replaceWith=" "/>
<field column="keywords" name="keywords" meta="true"/>
<field column="count" name="page_count" meta="true"/>
<field column="dc:terms" name="keywords_alt" meta="true"/>
<field column="Content-Type" name="content_type" meta="true"/>
<field column="xmpTPg:NPages" name="page_count_alt" meta="true"/>
                </entity>
        </entity>
    </document>
</dataConfig>

Alternative 2 - Regex processor in solrconfig.xml - you need to put
everything into ONE chain

<updateRequestProcessorChain name="my-chain"> <processor
class="solr.HTMLStripFieldUpdateProcessorFactory"> <str
name="fieldName">_text_</str> <str name="fieldName">fulltext</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">_text_</str>
<str name="fieldName">fulltext</str>
<str name="pattern">\n|\r</str>
<str name="replacement"/>
<bool name="literalReplacement">true</bool>
</processor>

<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">id</str>
<str name="fieldName">url</str>
<str name="pattern">[^\w|\.]</str>
<str name="replacement">/</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

[..]
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config-2.xml</str>
<str name="update.chain">my-chain</str>
</lst>
</requestHandler>

On Thu, Mar 14, 2019 at 6:41 AM wclarke <wclarke@widernet.org> wrote:

> Got each one working individually, but not multiples.  Is it possible?
> Please see attached files.
>
> Thanks!!! tika-data-config-2.xml
> <http://lucene.472066.n3.nabble.com/file/t494707/tika-data-config-2.xml>
> solrconfig.xml
> <http://lucene.472066.n3.nabble.com/file/t494707/solrconfig.xml>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message