lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [DataImportHandler] Changes in child entity and delta-import
Date Fri, 03 Oct 2008 17:14:09 GMT

On Oct 3, 2008, at 12:34 PM, Shalin Shekhar Mangar wrote:

> On Fri, Oct 3, 2008 at 9:20 PM, Grant Ingersoll  
> <gsingers@apache.org> wrote:
>
>>
>> Now, my question.  Let's say I have an initial set of ratings for a  
>> feed.
>> I then do a full import of the articles on that feed.  Everything  
>> is peachy
>> so far.  Then, I get a new rating for an existing article that I've  
>> already
>> indexed, thus the child entity (named "rating")
>> has a delta.    However, when I run the delta-import, it doesn't  
>> pick up
>> any changes, since, I believe, the parent hasn't changed.  Either  
>> that, or I
>> am doing something wrong.  It seems like it is akin to the  
>> parentDeltaQuery
>> problem, but, of course, there is no parent query since there is no  
>> parent
>> table, in the DB sense, at least
>> not how I see it.  The relevant logs are in [3].
>>
>> Is this case handled?  If not, Any suggestions for alternatives?   
>> Any help
>> would be appreciated.
>
>
> XPathEntityProcessor does not support delta imports. It might be  
> possible to
> enhance it to accept an xpath condition for joining child to parent  
> but it
> seems point-less because we'd need to parse the whole XML anyway for  
> each
> changed child row (Joel Spolsky's words echo in my mind!). If the  
> XML data
> is small, we can also have a cached implementation like the
> CachedSqlEntityProcessor.

What about somehow using the fact that the variable resolver needs to  
resolve solrFeed.link and then go get all entries from Solr to get  
those values, such that the child entity can then be tested?

>
>
> The easiest workaround here is to reverse the parent-child. Make the  
> DB as
> the parent and join on the child which will let you do delta imports,
> however full imports may be expensive. Depending on the size of XML,  
> you may
> be better off doing a full import always.

I thought of reversing the parent-child, but I don't see how it works,  
since there isn't necessarily a DB entry for every article.  How would  
you associate the two to make sure you get all articles?

Also, the current approach seems more intuitive, since the RSS feed is  
the authoritative content.

Essentially, what I am interested in is a join across data sources.  I  
realize that is non-trivial, but boy would it be powerful.

>
>
> Another thing I noticed from your logs: the ModifiedRowKey count is  
> 0. Are
> you sure the timestamp column is getting updated? IIRC, you need a  
> stored
> proc to do this for postgres.
>
> INFO: Completed ModifiedRowKey for Entity: rating rows obtained : 0

Yeah, that bothers me, too.  My dataimport.properties contains:
grantingersoll@molly[1041]$ cat dataimport.properties
#Fri Oct 03 12:09:29 EDT 2008
last_index_time=2008-10-03 12\:09\:28

And, when querying by hand my DB shows:

  select * from feeds where last_modified >  
'10 
/ 
03 
/ 
2008 
';                                                                                       
                                                                                         
                                       feed 
                                     | rating |    last_modified
---------------------------------------------------------------------------+ 
--------+---------------------
  http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/ 
  |    4.9 | 2008-10-04 11:04:00
(1 row)

So, I am reasonably certain there is a change.

I think the reason is, if you notice further down in the log, is that  
it processes the entities separately.  In other words, is the DB  
entity even getting resolved in the context of the parent entity?  Or,  
is it not resolving the ${solrFeed.link} clause of the delta query?

-Grant



Mime
View raw message