lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Sandiford <bob.sandif...@sirsidynix.com>
Subject RE: updating existing data in index vs inserting new data in index
Date Thu, 07 Jul 2011 14:14:19 GMT
Hi, Mark.

I haven't used DIH myself - so I'll need to leave comments on your set up to others who have
done so.

Another question - after your initial index create (and after each delta), do you run a 'commit'?
 Do you run an 'optimize'?  (Without the optimize, 'deleted' records still show up in query
results...)

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com


> -----Original Message-----
> From: Mark juszczec [mailto:mark.juszczec@gmail.com]
> Sent: Thursday, July 07, 2011 10:04 AM
> To: solr-user@lucene.apache.org
> Subject: Re: updating existing data in index vs inserting new data in
> index
> 
> Bob
> 
> Thanks very much for the reply!
> 
> I am using a unique integer called order_id as the Solr index key.
> 
> My query, deltaQuery and deltaImportQuery are below:
> 
> <entity name="item1"
>   pk="ORDER_ID"
>   query="select 1 as TABLE_ID , orders.order_id,
> orders.order_booked_ind,
> orders.order_dt, orders.cancel_dt,     orders.account_manager_id,
> orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
> orders.approved_discount_pct, orders.campaign_nm,
> orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
> orders"
> 
>   deltaImportQuery="select 1 as TABLE_ID, orders.order_id,
> orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
> orders.account_manager_id, orders.of_header_id,
> orders.order_status_lov_id,
> orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm,
> orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders
> where orders.order_id = '${dataimporter.delta.ORDER_ID}'"
> 
>   deltaQuery="select orders.order_id from orders where orders.change_dt
> >
> to_date('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')" >
>         </entity>
> 
> The test I am running is two part:
> 
> 1.  After I do a full import of the index, I insert a brand new record
> (with
> a never existed before order_id) in the database.  The delta import
> picks
> this up just fine.
> 
> 2.  After the full import, I modify a record with an order_id that
> already
> shows up in the index.  I have verified there is only one record with
> this
> order_id in both the index and the db before I do the delta update.
> 
> I guess the question is, am I screwing myself up by defining my own Solr
> index key?  I want to, ultimately, be able to search on ORDER_ID in the
> Solr
> index.  However, the docs say (I think) a field does not have to be the
> Solr
> primary key in order to be searchable.  Would I be better off letting
> Solr
> manage the keys?
> 
> Mark
> 
> On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
> <bob.sandiford@sirsidynix.com>wrote:
> 
> > What are you using as the unique id in your Solr index?  It sounds
> like you
> > may have one value as your Solr index unique id, which bears no
> resemblance
> > to a unique[1] id derived from your data...
> >
> > Or - another way to put it - what is it that makes these two records
> in
> > your Solr index 'the same', and what are the unique id's for those two
> > entries in the Solr index?  How are those id's related to your
> original
> > data?
> >
> > [1] not only unique, but immutable.  I.E. if you update a row in your
> > database, the unique id derived from that row has to be the same as it
> would
> > have been before the update.  Otherwise, there's nothing for Solr to
> > recognize as a duplicate entry, and do a 'delete' and 'insert' instead
> of
> > just an 'insert'.
> >
> > Bob Sandiford | Lead Software Engineer | SirsiDynix
> > P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
> > www.sirsidynix.com
> >
> >
> > > -----Original Message-----
> > > From: Mark juszczec [mailto:mark.juszczec@gmail.com]
> > > Sent: Thursday, July 07, 2011 9:15 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: updating existing data in index vs inserting new data in
> index
> > >
> > > Hello all
> > >
> > > I'm using Solr 3.2 and am confused about updating existing data in
> an
> > > index.
> > >
> > > According to the DataImportHandler Wiki:
> > >
> > > *"delta-import* : For incremental imports and change detection run
> the
> > > command `http://<host>:<port>/solr/dataimport?command=delta-import
.
> It
> > > supports the same clean, commit, optimize and debug parameters as
> > > full-import command."
> > >
> > > I know delta-import will find new data in the database and insert it
> > > into
> > > the index.  My problem is how it handles updates where I've got a
> record
> > > that exists in the index and the database, the database record is
> > > changed
> > > and I want to incorporate those changes in the existing record in
> the
> > > index.
> > >  IOW I don't want to insert it again.
> > >
> > > I've tried this and wound up with 2 records with the same key in the
> > > index.
> > >  The first contains the original db values found when the index was
> > > created,
> > > the 2nd contains the db values after the record was changed.
> > >
> > > I've also found this
> > >
> http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720
> > > 66.n3.nabble.com%2FDelta-import-with-solrj-client-
> tp1085763p1086173.html
> > > the
> > > subject is 'Delta-import with solrj client'
> > >
> > > "Greetings. I have a *solrj* client for fetching data from database.
> I
> > > am
> > > using *delta*-*import* for fetching data. If a column is changed in
> > > database
> > > using timestamp with *delta*-*import* i get the latest column
> indexed
> > > but
> > > there are *duplicate* values in the index similar to the column but
> the
> > > data
> > > is older. This works with cleaning the index but i want to update
> the
> > > index
> > > without cleaning it. Is there a way to just update the index with
> the
> > > updated column without having *duplicate* values. Appreciate for any
> > > feedback.
> > >
> > > Hando"
> > >
> > > There are 2 responses:
> > >
> > > "Short answer is no, there isn't a way. *Solr* doesn't have the
> concept
> > > of
> > > 'Update' to an indexed document. You need to add the full document
> (all
> > > 'columns') each time any one field changes. If doing that in your
> > > DataImportHandler logic is difficult you may need to write a
> separate
> > > Update
> > > Service that does:
> > >
> > > 1) Read UniqueID, UpdatedColumn(s)  from database
> > > 2) Using UniqueID Retrieve document from *Solr*
> > > 3) Add/Update field(s) with updated column(s)
> > > 4) Add document back to *Solr*
> > >
> > > Although, if you use DIH to do a full *import*, using the same query
> in
> > > your *Delta*-*Import* to get the whole document shouldn't be that
> > > difficult."
> > >
> > > and
> > >
> > > "Hi,
> > >
> > > Make sure you use a proper "ID" field, which does *not* change even
> if
> > > the
> > > content in the database changes. In this way, when your
> > > *delta*-*import* fetches
> > > changed rows to index, they will update the existing rows in your
> index.
> > > "
> > >
> > > I have an ID field that doesn't change.  It is the primary key field
> > > from
> > > the database table I am trying to index and I have verified it is
> > > unique.
> > >
> > > So, does Solr allow updates (not inserts) of existing records?  Is
> > > anyone
> > > able to do this?
> > >
> > > Mark
> >
> >


Mime
View raw message