lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Owen ...@conx.ch>
Subject Re: dataimporter tika doesn't extract certain div
Date Wed, 04 Sep 2013 07:08:01 GMT
so could i just nest it in a XPathEntityProcessor to filter the html or is there something
like xpath for tika?

<entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" forEach="/div[@id='content']"
dataSource="main">
			<entity name="tika" processor="TikaEntityProcessor" url="${htm}" dataSource="dataUrl"
onError="skip" htmlMapper="identity" format="html" >
				<field column="text" />
			</entity>
		</entity>

but now i dont know how to pass the text to tika, what do i put in url and datasource?


On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

> I don't know much about Tika but in the example data-config.xml that
> you posted, the "xpath" attribute on the field "text" won't work
> because the xpath attribute is used only by a XPathEntityProcessor.
> 
> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <ao@conx.ch> wrote:
>> I want tika to only index the content in <div id="content">...</div>
for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>> 
>> data-config.xml:
>> 
>> <dataConfig>
>>        <dataSource type="BinFileDataSource" name="data"/>
>>        <dataSource type="BinURLDataSource" name="dataUrl"/>
>>        <dataSource type="URLDataSource" name="main"/>
>> <document>
>>        <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml"
forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
>>                <field column="title" xpath="//title" />
>>                <field column="id" xpath="//id" />
>>                <field column="file" xpath="//file" />
>>                <field column="path" xpath="//path" />
>>                <field column="url" xpath="//url" />
>>                <field column="Author" xpath="//author" />
>> 
>>                <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}"
dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>>                        <field column="text" xpath="//div[@id='content']" />
>> 
>>                </entity>
>>        </entity>
>> </document>
>> </dataConfig>
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message