lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: DIH transformer problems
Date Tue, 04 Nov 2014 15:07:22 GMT
What are you actually trying to do on a business level? Maybe that's
something that can be handled better by sticking an
UpdateRequestProcessor chain _after_ DIH?

As to your configuration, you have xxCONTENT column definition twice.
It might be working, but I think it is non-deterministic. For ilang,
you don't seem to have xpath attribute, so I suspect it is just being
skipped all together.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 November 2014 09:05, Lemke, Michael  ST/HZA-ZSW
<lemkemch@schaeffler.com> wrote:
> I am having a little fight with the DataImportHandler and the
> application of RegexTransformer and TemplateTransformer.
> A stripped down version of what I try in data-config.xml, which
> is taken pretty much from the various solr wikis:
>
> <dataConfig>
>     <dataSource type="FileDataSource" encoding="UTF-8" />
>     <document>
>          <entity name="wf" rootEntity="false" dataSource="null"
>              processor="FileListEntityProcessor"
>              baseDir="d:\inetpub\webapps\searchserver\solr\importdaten\import_wiki"
>              fileName="wiki_..\.xml">
>             <entity name="doc"
>                  processor="XPathEntityProcessor"
>                  forEach="/mediawiki/page"
>                  stream="true"
>                  url="${wf.fileAbsolutePath}"
>                  transformer="RegexTransformer,HTMLStripTransformer,TemplateTransformer"
>                  >
>               <field column="ilang" template="${wf.fileAbsolutePath}" regex=".*?(..)\.xml"
replaceWith="$1"/>
>               <field column="HEADER" xpath="/mediawiki/page/title" required="true"
stripHTML="true"/>
>
>               <field column="xxCONTENT" xpath="/mediawiki/page/revision/text"/>
>               <field column="xxCONTENT" regex="(?m)^=====(.+?)=====$"
>                       replaceWith="&lt;h4&gt;$1&lt;/h4&gt;"/>
>
>               <!-- more regex transforms here -->
>               <field column="xxCONTENT" stripHTML="true"/>
>
>               <field column="NGLANG"             template="${doc.ilang}" />
>               <field column="CONTENTPREVIEW" template="${doc.xxCONTENT}"/>
>             </entity>
>          </entity>
>     </document>
> </dataConfig>
>
> The problem is with ilang.  The regex is not applied, no matter what I try.  Even
> a straight forward  <... regex=".*" replaceWith="en" ...> doesn't work.  I always
> end up with the full pathname.
>
> The regexs on xxCONTENT work fine, however.  So it's not that my regex is wrong or
> that regexs don't work at all.
>
> I tried all sorts of things like intermediate columns, sourceColumn or different
> sequences in the transformer attribute.  It all lead to different errors.  Nothing
> worked or lead to any clues.
>
> What am I doing wrong here?  This is with solr 1.4.1.
>
>
> Thanks,
> Michael
>

Mime
View raw message