lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul നോബിള്‍ नोब्ळ्" <noble.p...@gmail.com>
Subject Re: DIH Http input bug - problem with two-level RSS walker
Date Mon, 03 Nov 2008 04:11:49 GMT
Hi Jon ,
Using a CachedSqlEntityProcessor is the root entity is of no use. it
must be only as good as using a SqlEntityProcessor .for classes
belonging to the package 'org.apache.solr.handler.dataimport' the
package name can be omited (for better readability).


On Sun, Nov 2, 2008 at 8:08 AM, Jon Baer <jonbaer@gmail.com> wrote:
> Another idea is to use create the logic you need and dump to a temp MySQL
> table and then fetch the feeds, that has worked pretty nicely for me, it
> removes the need for the outer feed to do the work.  @ first I could not
> figure out if this was a bug or feature ... Something like ...
>
>        <entity dataSource="db" name="db" query="SELECT id FROM table"
> processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">
>                        <entity dataSource="feeds"
> url="http://{$db.id}.somedomain.com/feed.xml" name="feeds" pk="link"
> processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"
> forEach="/rss/channel/item"
> transformer="org.apache.solr.handler.dataimport.TemplateTransformer,
> org.apache.solr.handler.dataimport.DateFormatTransformer">
>                                <field column="title"
> xpath="/rss/channel/item/title"/>
>                                <field column="link"
> xpath="/rss/channel/item/link"/>
>                                <field column="docid"
> template="DOC-${feeds.link}"/>
>                                <field column="doctype" template="video"/>
>                                <field column="description"
> xpath="/rss/channel/item/description"/>
>                                <field column="thumbnail"
> xpath="/rss/channel/item/enclosure/@url"/>
>                                <field column="pubdate"
> xpath="/rss/channel/item/pubDate"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>
>                        </entity>
>                </entity>
>
> - Jon
>
> On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:
>
>> The inner entity drills down and gets more detail about each item in the
>> outer loop. It creates one document.
>>
>> -----Original Message-----
>> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
>> Sent: Friday, October 31, 2008 10:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>>
>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <goksron@gmail.com>
>> wrote:
>>
>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>>> rss feed which contains N links to other rss feeds. The nested loop
>>> then reads each one of those to create documents. (Yes, this is an
>>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>>> Both feeds use the same
>>> structure: /rss/channel with a <title> node and then N <item> nodes
>>> inside the channel. This should create two separate XML streams with
>>> two separate Xpath iterators, right?
>>>
>>> <entity name="outer" http stuff>
>>>  <field column="name" xpath="/rss/channel/title" />
>>>  <field column="url" xpath="/rss/channel/item/link"/>
>>>
>>>  <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>      <field column="title" xpath="/rss/channel/item/title" />
>>>  </entity>
>>> </entity>
>>>
>>> This does indeed walk each url from the outer feed and then fetch the
>>> inner rss feed. Bravo!
>>>
>>> However, I found two separate problems in xpath iteration. They may be
>>
>>> related. The first problem is that it only stores the first document
>>> from each "inner" feed. Each feed has several documents with different
>>
>>> title fields but it only grabs the first.
>>>
>>
>> The idea behind nested entities is to join them together so that one
>> Solr document is created for each root entity and the child entities
>> provide more fields which are added to the parent document.
>>
>> I guess you want to create separate Solr documents from the root entity
>> as well as the child entities. I don't think that is possible with
>> nested entities. Essentially, you are trying to crawl feeds, not join
>> them.
>>
>> Probably an integration with Apache Droids can be thought about.
>> http://incubator.apache.org/projects/droids.html
>> http://people.apache.org/~thorsten/droids/
>>
>> If you are going to crawl only one level, there may be a workaround.
>> However, it may be easier to implement all this with your own Java
>> program and just post results to Solr as usual.
>>
>>
>>
>>> The other is an off-by-one bug. The outer loop iterates through the 10
>>
>>> items and then tries to pull an 11th.  It then gives this exception
>>> trace:
>>>
>>> INFO: Created URL to:  [inner url]
>>> Oct 31, 2008 11:21:20 PM
>>> org.apache.solr.handler.dataimport.HttpDataSource
>>> getData
>>> SEVERE: Exception thrown while getting data
>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>      at java.net.URL.<init>(URL.java:567)
>>>      at java.net.URL.<init>(URL.java:464)
>>>      at java.net.URL.<init>(URL.java:413)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:90)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:47)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
>>> ava:18
>>> 3)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>>> hEntit
>>> yProcessor.java:210)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>>> PathEn
>>> tityProcessor.java:180)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>>> ntityP
>>> rocessor.java:160)
>>>      at
>>>
>>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
>> ava:
>>>
>>> 285)
>>> ...
>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>>> buildDocument
>>> SEVERE: Exception while processing: album document :
>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>> Exception in invoking url null Processing Document # 11
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:115)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:47)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>
>



-- 
--Noble Paul

Mime
View raw message