lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul നോബിള്‍ नोब्ळ्" <noble.p...@gmail.com>
Subject Re: DIH Http input bug - problem with two-level RSS walker
Date Mon, 03 Nov 2008 04:14:29 GMT
Hi Lance,
Do a full import w/o debug and let us know if my suggestion worked
(rootEntity="false" ) . If it didn't , I can suggest u something else
(Writing a Transformer )


On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍ नोब्ळ्
<noble.paul@gmail.com> wrote:
> If you wish to create 1 doc per inner entity the set
> rootEntity="false" for the entity outer.
> The exception is because the url is wrong
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <goksron@gmail.com> wrote:
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
>> which contains N links to other rss feeds. The nested loop then reads each
>> one of those to create documents. (Yes, this is an obnoxious thing to do.)
>> Let's say the outer RSS feed gives 10 items. Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes inside
>> the channel. This should create two separate XML streams with two separate
>> Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>    <field column="name" xpath="/rss/channel/title" />
>>    <field column="url" xpath="/rss/channel/item/link"/>
>>
>>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>        <field column="title" xpath="/rss/channel/item/title" />
>>    </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the inner
>> rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may be
>> related. The first problem is that it only stores the first document from
>> each "inner" feed. Each feed has several documents with different title
>> fields but it only grabs the first.
>>
>> The other is an off-by-one bug. The outer loop iterates through the 10 items
>> and then tries to pull an 11th.  It then gives this exception trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>        at java.net.URL.<init>(URL.java:567)
>>        at java.net.URL.<init>(URL.java:464)
>>        at java.net.URL.<init>(URL.java:413)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:90)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:47)
>>        at
>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
>> 3)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>> yProcessor.java:210)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>> tityProcessor.java:180)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>> rocessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>> 285)
>>  ...
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
>> invoking url null Processing Document # 11
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:115)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul
Mime
View raw message