lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Help with a DIH config file
Date Tue, 12 Mar 2019 21:37:14 GMT
Some addition: You can also strip HTML in DIH using the HTML Strip
transformer:
https://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

In that way you can probably live without a UpdateRequestProcessorChain

On Tue, Mar 12, 2019 at 10:24 PM Jörn Franke <jornfranke@gmail.com> wrote:

> Would it be possible to share the DIH config file?
>
> I am not sure if I get all your points correctly.
>
> Ad 1) is this about a value in a field? Then use the regex transformer:
> https://wiki.apache.org/solr/DataImportHandler#RegexTransformer
> Alternatively, use a RegexReplaceProcessorFactoryin solrconfig.xml or a
> ScriptTransformer in DIH. E.g. a RegexReplaceProcessorFactory (
> https://lucene.apache.org/solr/7_3_0//solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html)
> in a custom processing chain in solrconfig.xml
> <updateRequestProcessorChain name="regex_replace>
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">\n|\t|\r</str>
>    <str name="replacement"></str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>   <processor class="solr.LogUpdateProcessorFactory" />
>   <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> and attach it to your dih in solrconfig.xml
> <requestHandler name="/dataimport" class="solr.DataImportHandler">
> <lst name="defaults">
>   <str name="config">data-config.xml</str>
>     <str name="update.chain">regex_replace</str>
> </lst>
> </requestHandler>
>
>
>
>
> ad 2) was this html part of the original document or is it "HTML"
> generated by Tika. In the first case then you can use a
> HTMLStripFieldUpdateProcessorFactory that should be configured in the
> solrconfig.xml:
> https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
> You need to create an update processor chain
> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
>
>
> <updateRequestProcessorChain name="remove_html">
>   <processor class="solr.HTMLStripFieldUpdateProcessorFactory">
> <str name="fieldName">myfyfield</str>
>   </processor>
>   <processor class="solr.LogUpdateProcessorFactory" />
>   <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> and attach it to your dih in solrconfig.xml
> <requestHandler name="/dataimport" class="solr.DataImportHandler">
> <lst name="defaults">
>   <str name="config">data-config.xml</str>
>     <str name="update.chain">remove_html</str>
> </lst>
> </requestHandler>
>
> In the second case (Tika attaches XML elements) specify
> extractFormat="text" for Tika in DIH :
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> add 3) see 1)
>
> Note: You can only create one chain / DIH, so you need to put all the
> processors that you want to apply into one chain. The transformers are
> independent of the processors and are configured in the DIH.
>
>
>
> On Tue, Mar 12, 2019 at 7:47 PM wclarke <wclarke@widernet.org> wrote:
>
>> I have a previous post that looks like this:
>>
>> I am pulling a large amount of data from a local source
>> D:\foo\resource\.  I
>> am using tika through a DIH to index the multiple file formats with text
>> and
>> metadata.  I have almost all the information being pulled that I want,
>> however, I am having a couple of issues:
>>
>> 1. I need to run a regex replace of the D:\foo\resource\ to be http://,
>> which is part of what I want to use XPath for.  I have the regex written,
>> but not the replacement and I am not sure of where it needs to be located
>> in
>> my data-config.xml file.
>>
>> 2. I want to strip html where necessary also using XPath.
>>
>> 3. I need to remove \n, \t, \r, and any other extra crap I am getting in
>> the
>> text field to just get to the text content of the document, whatever mime
>> type that might be so that it can be searchable.
>>
>> I am running it through the solr admin data import as opposed to the
>> post.jar (I have tried both).  And this is running on Windows and cannot
>> be
>> run on Linux as we have no one who can support it.  I am posting my
>> tika-data-config.xml (not tikaconfig) I named it this way so as not to be
>> confused with our db-config for our catalog pull.
>>
>> Thanks in advance for any help.  And I will upload any additional files
>> that
>> might be helpful upon request - I don't want to overload the post.
>>
>> We are a small non-profit without a great deal of money, however, if there
>> is someone who could finish writing it we would be willing to pay a little
>> something for time.  We really need this done ASAP!
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message