lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog" <goks...@gmail.com>
Subject DIH Http input bug - problem with two-level RSS walker
Date Sat, 01 Nov 2008 05:00:58 GMT
I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
which contains N links to other rss feeds. The nested loop then reads each
one of those to create documents. (Yes, this is an obnoxious thing to do.)
Let's say the outer RSS feed gives 10 items. Both feeds use the same
structure: /rss/channel with a <title> node and then N <item> nodes inside
the channel. This should create two separate XML streams with two separate
Xpath iterators, right?

<entity name="outer" http stuff>
    <field column="name" xpath="/rss/channel/title" />
    <field column="url" xpath="/rss/channel/item/link"/>

    <entity name="inner" http stuff url="${outer.url}" pk="title" >
        <field column="title" xpath="/rss/channel/item/title" />
    </entity>
</entity>

This does indeed walk each url from the outer feed and then fetch the inner
rss feed. Bravo! 

However, I found two separate problems in xpath iteration. They may be
related. The first problem is that it only stores the first document from
each "inner" feed. Each feed has several documents with different title
fields but it only grabs the first.

The other is an off-by-one bug. The outer loop iterates through the 10 items
and then tries to pull an 11th.  It then gives this exception trace:

INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:90)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)
        at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
3)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
yProcessor.java:210)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
tityProcessor.java:180)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
rocessor.java:160)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
285)
 ...
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
invoking url null Processing Document # 11
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:115)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)






Mime
View raw message