lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingrambook.com>
Subject Need some DIH Entity Processor development advice...
Date Mon, 25 Oct 2010 16:54:36 GMT
We have a situation where we have data coming from several long-running queries hitting multiple
relational databases.  Other data comes in fixed-width text file feeds, etc.  All of this
has to be joined and denormalized and made into nice SOLR documents.  I've been wanting to
use DIH as it seems to already provide 90% of what we need.  The rest can some in the form
of custom transformers & Entity Processors that I can write...

One big need is to have disk-backed caches.  For instance, a child entity that pulls back
millions of rows will beat up the db using a regular SQLEntityProcessor whereas the CachedSQLEntityProcessor
puts everything in memory in a HashMap so it will only scale to a point.  For fixed-width
text files, there doesn't seem to be any Cached implementations at all.

So I've written a custom Entity Processor that creates a temporary Lucene index to use as
a disk cache.  Initial tests are promising but with one little problem.  I need a place to
close the Lucene index reader and then delete the temporary index.  It seemed easy enough
to override the "destroy()" method from EntityProcessorBase.  But to my surprise, it seems
that both destroy() and init() get called every time a new Primary Key is called up from the
cache.  (see DocBuilder.buildDocument()).  Just to be sure I wasn't crazy, I added a "destroy()"
method to CachedSqlEntityProcessor and found it indeed gets called every time a new Primary
Key is called from the cache.  In fact, the first couple of lines in cacheInit() in EntityProcessorBase
seem to be there to cope with the fact that both destroy() and init() get called over and
over again during the lifecycle of the object.

I've also noticed that destroy() isn't actually implemented anywhere in the prepacked Entity
Processors.  This makes me wonder if it is a mistake.  Should DocBuilder be changed to call
destroy() only once per lifecycle for each EntityProcessor object?  If so I think I can have
a patch in JIRA in short order.

Otherwise...How do I best accomplish my clean-up tasks?  Advice is greatly appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


Mime
View raw message