lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingramcontent.com>
Subject RE: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Date Thu, 20 Jun 2013 16:51:03 GMT
Instead of specifying CachedSqlEntityProcessor, you can specify SqlEntityProcessor with "cacheImpl='SortedMapBackedCache'".
 If you parametertize this, to have "SortedMapBackedCache" for full updates but blank for
deltas I think it will cache only on the full import.

Another option is to parameterize the child queries with a "where" clause, so if it is creating
a new cache with every row, the cache will only contain the data needed for that child row.

A third option is to do your delta imports like described here:  http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
My experience is that this generally performs better than using the delta import feature anyhow.
 The trick is on handling deletes, which will require its own entity and the $deleteDocById
command.  See http://wiki.apache.org/solr/DataImportHandler#Special_Commands

But these are all workarounds.  This sounds like a bug or some subtle configuration problem.
 I looked through the JIRA issues and did not see anything like this reported yet, but if
you're pretty sure you are doing everything correctly you may want to open a bug ticket. 
Be sure to flag it as "contrib - Dataimporthandler".

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Constantin Wolber [mailto:constantin.wolber@medicalcolumbus.de] 
Sent: Thursday, June 20, 2013 3:21 AM
To: solr-user@lucene.apache.org
Subject: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor

Hi,

i searched for a solution for quite some time but did not manage to find some real hints on
how to fix it. 


I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container.

My data import setup is basically the following:

Data-config.xml:

<entity
	name="article"
	dataSource="ds1"
	query="SELECT * FROM article"
	deltaQuery="SELECT myownid FROM articleHistory WHERE modified_date &gt; '${dih.last_index_time}
	deltaImportQuery="SELECT * FROM article WHERE myownid=${dih.delta.myownid}"
	pk="myownid">
	<field column="myownid" name="id"/>

	<entity
		name="supplier"
		dataSource="ds2"
		query="SELECT * FROM supplier WHERE status=1"
		processor="CachedSqlEntityProcessor"
		cacheKey="SUPPLIER_ID"
		cacheLookup="article.ARTICLE_SUPPLIER_ID">
	</entity>

	<entity
		name="attributes"
		dataSource="ds1"
		query="SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes"
		cacheKey="ARTICLE_ID"
		cacheLookup="article.myownid"
		processor="CachedSqlEntityProcessor">
	</entity>		
</entity>


Ok now for the problem: 

At first I tried everything without the Cache. But the full-import took a very long time.
Because the attributes query is pretty slow compared to the rest. As a result I got a processing
speed of around 150 Documents/s.
When switching everything to the CachedSqlEntityProcessor the full import processed at the
speed of 4000 Documents/s

So full import is running quite fine. Now I wanted to use the delta import. When running the
delta import I was expecting the ramp up time to be about the same as in full import since
I need to load the whole table supplier and attributes to the cache in the first step. But
when looking into the log file the weird thing is solr seems to refresh the Cache for every
single document that is processed. So currently my delta-import is a lot slower than the full-import.
I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the
behavior at all (of course I know it is not supposed to change anything in the setup I run).

The following solutions would be possible in my opinion: 

1. Is there any way to tell the config to ignore the Cache when running a delta import? That
would help already because we are talking about the maximum of 500 documents changed in 15
minutes compared to over 5 million documents in total. 
2. Get solr to not refresh the cash for every document. 

Best Regards

Constantin Wolber




Mime
View raw message