lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Baer <jonb...@gmail.com>
Subject Re: DIH Http input bug - problem with two-level RSS walker
Date Mon, 03 Nov 2008 06:11:36 GMT
On a side note ... it would be nice if your data source could also be  
the result of a script (instead of trying to hack around it w/  
JdbcDataSource) ...

Something similar to what ScriptTransformer does ...
(http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9

)

An example would be:

<dataSource type="ScriptDataSource" name="outerloop"  
script="outerloop.js" />

(The script would basically contain just a callback - getData(String  
query) that results in an array set or might set values on it's  
children, etc)

- Jon

On Nov 3, 2008, at 12:40 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> Hi Lance,
> I guess I got your problem
> So you wish to create docs for both entities (as suggested by Jon
> Baer). So the best solution would be to create two root entities. The
> first one should be the outer and write a transformer to store all the
> urls into the db . The JdbcDataSource can do inserts/update too (the
> method is same getData()). The second entity can read from db and
> create docs  (see Jon baer's suggestion) using the
> XPathEntityProcessor as a sub-entity
> --Noble
>
> On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍  
> नोब्ळ्
> <noble.paul@gmail.com> wrote:
>> Hi Lance,
>> Do a full import w/o debug and let us know if my suggestion worked
>> (rootEntity="false" ) . If it didn't , I can suggest u something else
>> (Writing a Transformer )
>>
>>
>> On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍  
>> नोब्ळ्
>> <noble.paul@gmail.com> wrote:
>>> If you wish to create 1 doc per inner entity the set
>>> rootEntity="false" for the entity outer.
>>> The exception is because the url is wrong
>>>
>>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <goksron@gmail.com>  
>>> wrote:
>>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads  
>>>> an rss feed
>>>> which contains N links to other rss feeds. The nested loop then  
>>>> reads each
>>>> one of those to create documents. (Yes, this is an obnoxious  
>>>> thing to do.)
>>>> Let's say the outer RSS feed gives 10 items. Both feeds use the  
>>>> same
>>>> structure: /rss/channel with a <title> node and then N <item>
 
>>>> nodes inside
>>>> the channel. This should create two separate XML streams with two  
>>>> separate
>>>> Xpath iterators, right?
>>>>
>>>> <entity name="outer" http stuff>
>>>>   <field column="name" xpath="/rss/channel/title" />
>>>>   <field column="url" xpath="/rss/channel/item/link"/>
>>>>
>>>>   <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>>       <field column="title" xpath="/rss/channel/item/title" />
>>>>   </entity>
>>>> </entity>
>>>>
>>>> This does indeed walk each url from the outer feed and then fetch  
>>>> the inner
>>>> rss feed. Bravo!
>>>>
>>>> However, I found two separate problems in xpath iteration. They  
>>>> may be
>>>> related. The first problem is that it only stores the first  
>>>> document from
>>>> each "inner" feed. Each feed has several documents with different  
>>>> title
>>>> fields but it only grabs the first.
>>>>
>>>> The other is an off-by-one bug. The outer loop iterates through  
>>>> the 10 items
>>>> and then tries to pull an 11th.  It then gives this exception  
>>>> trace:
>>>>
>>>> INFO: Created URL to:  [inner url]
>>>> Oct 31, 2008 11:21:20 PM  
>>>> org.apache.solr.handler.dataimport.HttpDataSource
>>>> getData
>>>> SEVERE: Exception thrown while getting data
>>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>>       at java.net.URL.<init>(URL.java:567)
>>>>       at java.net.URL.<init>(URL.java:464)
>>>>       at java.net.URL.<init>(URL.java:413)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:90)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:47)
>>>>       at
>>>> org.apache.solr.handler.dataimport.DebugLogger 
>>>> $2.getData(DebugLogger.java:18
>>>> 3)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>>>> yProcessor.java:210)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>>>> tityProcessor.java:180)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>>>> rocessor.java:160)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>>>> 285)
>>>> ...
>>>> Oct 31, 2008 11:21:20 PM  
>>>> org.apache.solr.handler.dataimport.DocBuilder
>>>> buildDocument
>>>> SEVERE: Exception while processing: album document :
>>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>>> org.apache.solr.handler.dataimport.DataImportHandlerException:  
>>>> Exception in
>>>> invoking url null Processing Document # 11
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:115)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:47)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>
>
>
> -- 
> --Noble Paul


Mime
View raw message