lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Owen ...@conx.ch>
Subject Re: dataimporter tika fields empty
Date Fri, 23 Aug 2013 12:33:22 GMT
ok but i'm not doing any path extraction, at least i don't think so.

htmlMapper="identity" isn't preserving html

it's reading the content of the pages but it's not putting it into "text_test" and "text".
it's only in "text_test" the copyField isn't working. 

data-config.xml:

<dataConfig>
	<dataSource type="BinFileDataSource" name="data"/>
	<dataSource type="BinURLDataSource" name="dataUrl"/>
	<dataSource type="URLDataSource" name="main"/>
<document>
	<entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml"
forEach="/docs/doc" dataSource="main"> 
		<field column="title" xpath="//title" />
		<field column="id" xpath="//id" />
		<field column="file" xpath="//file" />
		<field column="path" xpath="//path" />
		<field column="url" xpath="//url" />
		<field column="Author" xpath="//author" />
		
		<entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl"
onError="skip" htmlMapper="identity" >
			<field column="text" name="text_test" />
			<copyField source="text_test" dest="text" />
			<!-- <field column="text_test" xpath="//div[@id='content']" /> 	-->
		</entity>
	</entity>
</document>
</dataConfig>


On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

> Ah. That's because Tika processor does not support path extraction. You
> need to nest one more level.
> 
> Regards,
>      Alex
> On 22 Aug 2013 13:34, "Andreas Owen" <ao@conx.ch> wrote:
> 
>> i can do it like this but then the content isn't copied to text. it's just
>> in text_test
>> 
>> <entity name="tika" processor="TikaEntityProcessor"
>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>        <field column="text" name="text_test">
>>        <copyField source="text_test" dest="text" />
>> </entity>
>> 
>> 
>> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
>> 
>>> i put it in the tika-entity as attribute, but it doesn't change
>> anything. my bigger concern is why text_test isn't populated at all
>>> 
>>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
>>> 
>>>> Can you try SOLR-4530 switch:
>>>> https://issues.apache.org/jira/browse/SOLR-4530
>>>> 
>>>> Specifically, setting htmlMapper="identity" on the entity definition.
>> This
>>>> will tell Tika to send full HTML rather than a seriously stripped one.
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> Personal website: http://www.outerthoughts.com/
>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>> - Time is the quality of nature that keeps events from happening all at
>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>>> 
>>>> 
>>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <ao@conx.ch> wrote:
>>>> 
>>>>> i'm trying to index a html page and only user the div with the
>>>>> id="content". unfortunately nothing is working within the tika-entity,
>> only
>>>>> the standard text (content) is populated.
>>>>> 
>>>>>      do i have to use copyField for test_text to get the data?
>>>>>      or is there a problem with the entity-hirarchy?
>>>>>      or is the xpath wrong, even though i've tried it without and just
>>>>> using text?
>>>>>      or should i use the updateextractor?
>>>>> 
>>>>> data-config.xml:
>>>>> 
>>>>> <dataConfig>
>>>>>      <dataSource type="BinFileDataSource" name="data"/>
>>>>>      <dataSource type="BinURLDataSource" name="dataUrl"/>
>>>>>      <dataSource type="URLDataSource" baseUrl="
>>>>> http://127.0.0.1/tkb/internet/" name="main"/>
>>>>> <document>
>>>>>      <entity name="rec" processor="XPathEntityProcessor"
>>>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>>>>              <field column="title" xpath="//title" />
>>>>>              <field column="id" xpath="//id" />
>>>>>              <field column="file" xpath="//file" />
>>>>>              <field column="path" xpath="//path" />
>>>>>              <field column="url" xpath="//url" />
>>>>>              <field column="Author" xpath="//author" />
>>>>> 
>>>>>              <entity name="tika" processor="TikaEntityProcessor"
>>>>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>>>>                      <!-- <copyField source="text" dest="text_test"
/>
>>>>> -->
>>>>>                      <field column="text_test"
>>>>> xpath="//div[@id='content']" />
>>>>>              </entity>
>>>>>      </entity>
>>>>> </document>
>>>>> </dataConfig>
>>>>> 
>>>>> docImporterUrl.xml:
>>>>> 
>>>>> <?xml version="1.0" encoding="utf-8"?>
>>>>> <docs>
>>>>> <doc>
>>>>>              <id>5</id>
>>>>>              <author>tkb</author>
>>>>>              <title>Startseite</title>
>>>>>              <description>blabla ...</description>
>>>>>              <file>http://localhost/tkb/internet/index.cfm</file>
>>>>>              <url>http://localhost/tkb/internet/index.cfm/url</url>
>>>>>              <path2>http\specialConf</path2>
>>>>>      </doc>
>>>>>      <doc>
>>>>>              <id>6</id>
>>>>>              <author>tkb</author>
>>>>>              <title>Eigenheim</title>
>>>>>              <description>Machen Sie sich erste Gedanken über
den
>>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
>> gar ein
>>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
>> finanzieller
>>>>> Hinsicht gelingt.</description>
>>>>>              <file>
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file>
>>>>>              <url>
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url>
>>>>>      </doc>
>>>>> </docs>
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message