lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Baer <jonb...@gmail.com>
Subject Re: DIH Http input bug - problem with two-level RSS walker
Date Sun, 02 Nov 2008 02:38:40 GMT
Another idea is to use create the logic you need and dump to a temp  
MySQL table and then fetch the feeds, that has worked pretty nicely  
for me, it removes the need for the outer feed to do the work.  @  
first I could not figure out if this was a bug or feature ...  
Something like ...

	<entity dataSource="db" name="db" query="SELECT id FROM table"  
processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">
			<entity dataSource="feeds" url="http://{$db.id}.somedomain.com/ 
feed.xml" name="feeds" pk="link"  
processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"  
forEach="/rss/channel/item"  
transformer="org.apache.solr.handler.dataimport.TemplateTransformer,  
org.apache.solr.handler.dataimport.DateFormatTransformer">
				<field column="title" xpath="/rss/channel/item/title"/>
				<field column="link" xpath="/rss/channel/item/link"/>
				<field column="docid" template="DOC-${feeds.link}"/>
				<field column="doctype" template="video"/>
				<field column="description" xpath="/rss/channel/item/description"/>
				<field column="thumbnail" xpath="/rss/channel/item/enclosure/@url"/>
				<field column="pubdate" xpath="/rss/channel/item/pubDate"  
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>
			</entity>
		</entity>

- Jon

On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:

> The inner entity drills down and gets more detail about each item in  
> the
> outer loop. It creates one document.
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Friday, October 31, 2008 10:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <goksron@gmail.com>
> wrote:
>
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>> rss feed which contains N links to other rss feeds. The nested loop
>> then reads each one of those to create documents. (Yes, this is an
>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>> Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes
>> inside the channel. This should create two separate XML streams with
>> two separate Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>   <field column="name" xpath="/rss/channel/title" />
>>   <field column="url" xpath="/rss/channel/item/link"/>
>>
>>   <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>       <field column="title" xpath="/rss/channel/item/title" />
>>   </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the
>> inner rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may  
>> be
>
>> related. The first problem is that it only stores the first document
>> from each "inner" feed. Each feed has several documents with  
>> different
>
>> title fields but it only grabs the first.
>>
>
> The idea behind nested entities is to join them together so that one
> Solr document is created for each root entity and the child entities
> provide more fields which are added to the parent document.
>
> I guess you want to create separate Solr documents from the root  
> entity
> as well as the child entities. I don't think that is possible with
> nested entities. Essentially, you are trying to crawl feeds, not join
> them.
>
> Probably an integration with Apache Droids can be thought about.
> http://incubator.apache.org/projects/droids.html
> http://people.apache.org/~thorsten/droids/
>
> If you are going to crawl only one level, there may be a workaround.
> However, it may be easier to implement all this with your own Java
> program and just post results to Solr as usual.
>
>
>
>> The other is an off-by-one bug. The outer loop iterates through the  
>> 10
>
>> items and then tries to pull an 11th.  It then gives this exception
>> trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM
>> org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>       at java.net.URL.<init>(URL.java:567)
>>       at java.net.URL.<init>(URL.java:464)
>>       at java.net.URL.<init>(URL.java:413)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:90)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>       at
>>
>> org.apache.solr.handler.dataimport.DebugLogger 
>> $2.getData(DebugLogger.j
>> ava:18
>> 3)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>> hEntit
>> yProcessor.java:210)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>> PathEn
>> tityProcessor.java:180)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>> ntityP
>> rocessor.java:160)
>>       at
>>
>>
> org 
> .apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:
>> 285)
>> ...
>> Oct 31, 2008 11:21:20 PM  
>> org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> Exception in invoking url null Processing Document # 11
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:115)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.


Mime
View raw message