lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From meghana <meghana.rav...@amultek.com>
Subject PlainTextEntityProcessor and RegexTransformer in DataImport Handler
Date Fri, 23 Dec 2011 09:43:55 GMT
Hi all, 

I need to import data from my text file (which have HTML text). and need to
apply some formatting on it. i want all text with in <p> tag , and i want it
to be preceded by one element of p tag in my output,  like below.

Original Text
------------------------------------------------------------------------------------------
<div><p  myvar="12" myvar1="xyz">Hello World!!</p><p  myvar="14"
myvar1="abc">Welcome to Solr.</p><p  myvar="15" myvar1="def">Enjoy</p></div>


Needed Text After Formattting
------------------------------------------------------------------------------------------
12 : Hello World!!
14 : Welcome to Solr.
15 : Enjoy

I have applied combination of PlainTextEntityProcessor with RegexTransformer
and TemplateTransformer for that as below. but i am receiving
ConfigurationError when i set that.

<entity name="xx" onError="continue"  processor="PlainTextEntityProcessor"
transformer="TemplateTransformer,RegexTransformer" url="${URL.MyTxtFile}"
dataSource="MDataSource">
                       <field column="plainText" name="FullText"   />
                       <field column=&quot;FullText&quot;    
template=&quot;${xx.FullText}&quot; regex='&lt;p (?:\s+[^>]+)?
myvar="([^<"]*)" (?:\s+[^>]+)?>([^<]*)</p>' replaceWith="$2 : $4"/>
               </entity>

I like to add here that i am able do this using TemplateTransformer and
multivalued field by setting foreach on entity, but i need above format in
single valued field, for which i am failed to do it.

Can any body help me, how can i get my desired result? or what i am doing
wrong on above transformer?
Thanks
Meghana

--
View this message in context: http://lucene.472066.n3.nabble.com/PlainTextEntityProcessor-and-RegexTransformer-in-DataImport-Handler-tp3608449p3608449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message