lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul നോബിള്‍ नोब्ळ्" <noble.p...@gmail.com>
Subject Re: DIH Http input bug - problem with two-level RSS walker
Date Tue, 04 Nov 2008 06:57:49 GMT
On Tue, Nov 4, 2008 at 1:31 AM, Lance Norskog <goksron@gmail.com> wrote:
> Thank you for the "rootEntity" tip. Does this mean that the inner loop only walks the
first item and breaks out of the loop? This is very good because it allows me to drill down
a few levels without downloading 10,000 feeds. (Public API sites tend to dislike this behavior
:)
>

nope . It goes through each item in the inner loop and create one
document for each item.

> The URL is wrong because the streaming parser is iterating past the end of the element
entries. It is an off-by-one bug of some sort in the DIH code.
>
> Thanks,
>
> Lance
>
> -----Original Message-----
> From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:noble.paul@gmail.com]
> Sent: Saturday, November 01, 2008 7:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>
> If you wish to create 1 doc per inner entity the set rootEntity="false" for the entity
outer.
> The exception is because the url is wrong
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <goksron@gmail.com> wrote:
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>> rss feed which contains N links to other rss feeds. The nested loop
>> then reads each one of those to create documents. (Yes, this is an
>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>> Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes
>> inside the channel. This should create two separate XML streams with
>> two separate Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>    <field column="name" xpath="/rss/channel/title" />
>>    <field column="url" xpath="/rss/channel/item/link"/>
>>
>>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>        <field column="title" xpath="/rss/channel/item/title" />
>>    </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the
>> inner rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may be
>> related. The first problem is that it only stores the first document
>> from each "inner" feed. Each feed has several documents with different
>> title fields but it only grabs the first.
>>
>> The other is an off-by-one bug. The outer loop iterates through the 10
>> items and then tries to pull an 11th.  It then gives this exception trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM
>> org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>        at java.net.URL.<init>(URL.java:567)
>>        at java.net.URL.<init>(URL.java:464)
>>        at java.net.URL.<init>(URL.java:413)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:90)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>        at
>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
>> ava:18
>> 3)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>> hEntit
>> yProcessor.java:210)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>> PathEn
>> tityProcessor.java:180)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>> ntityP
>> rocessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>> 285)
>>  ...
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> Exception in invoking url null Processing Document # 11
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:115)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
>
> --
> --Noble Paul
>
>



-- 
--Noble Paul
Mime
View raw message