lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Dyer (Commented) (JIRA)" <>
Subject [jira] [Commented] (SOLR-2382) DIH Cache Improvements
Date Thu, 20 Oct 2011 14:56:10 GMT


James Dyer commented on SOLR-2382:

The entities are reused . But it has always been like that. Why do you need that initialized
flag? What is initialized? 

Perhaps "intialized" is the wrong name for this flag, but let me explain how its used.  In
"Implementation Details" #6 in this issue's description, I mentioned the need to change the
semantics for "entity.destroy()".  Previous to this patch, for child entities, both "entity.destroy()"
& "entity.init" get called once per parent row.  So throughout the course of a DIH import,
child entities constantly get their "init" and "destroy" methods called over and over again.
 But what if we have "init" and "destroy" operations that are meant to be executed only once?
 "init" copes with this by setting a "firstInit" flag on each entity and having any init steps
that get called only once controlled by this flag.  

But there was no such coping mechanism built into "destroy".  There was never a need because
in actuality only one of our prepacked entities implements "destroy()". But entities that
use persistent caching require that there be a way to clean up any unneeded caches at the
end.  Because "destroy()" was largely unused, I decided to change its semantics to handle
this end-of-lifecycle cleanup operation.  (The one entity that already implements "destroy"
is LineEntitiyProcessor, but prior to this patch we cannot use LineEntityProcessor as a child
entity and do joins, so the semantic change here doesn't matter.)  

Thus the "entityWrapper.initalized" flag gets set (DocBuilder lines 637-640) the first time
a particular entity is encountered.  The flag ensures that the entity gets added to the "Destroy-List"
only once.  When any entity is done being used (its parent is finished), the appropriate "Destroy-List"
is looped through, the children are destroyed, and their initialized flags get set back to
"false". (DocBuilder lines 617-621).  "resetEntity()" sets the flag back, existing in its
own method so that it may be done recursively.

I apologize for this very long explanation, but I hope this is helpful.  Obviously I've made
design decisions here that you may (or perhaps not) differ on.  Basically I need to have an
"entity.destroy()" that is guaranteed to get called only once, at the time the entity is done
executing.  If you would like this done differently, let me know what you have in mind and
I can try and change it.  

Do you now understand why I am using an "initalized" flag?  Is this ok as-is, or if not, how
would you like the design changed?
> DIH Cache Improvements
> ----------------------
>                 Key: SOLR-2382
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-properties.patch, SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation
that best suits their data and application.
>  2. Provide a means to temporarily cache a child Entity's data without needing to create
a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor).
>  3. Provide a means to write the final (root entity) DIH output to a cache rather than
to Solr.  Then provide a way for a subsequent DIH call to use the cache as an Entity input.
 Also provide the ability to do delta updates on such persistent caches.
>  4. Provide the ability to partition data across multiple caches that can then be fed
back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity data prior
to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism
and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  2. We needed the ability to gather data from long-running entities by a process that
runs separate from our main indexing process.
>  3. We wanted the ability to do a delta import of only the entities that changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 entity's
data changed.
>   - Persistent DIH caches solve this problem.
>  4. We want the ability to index several documents in parallel (using 1.4.1, which did
not have the "threads" parameter).
>  5. In the future, we may need to use Shards, creating a need to easily partition our
source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
>     - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor
(now deprecated).
>     - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar
>        - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  I believe
this may be incompatible due to Generic Usage.
>        - NOTE: I did not modify the ant script to automatically get this jar, so to use
or evaluate this patch, download bdb-je from

>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the entity data
to be cached (see EntityProcessorBase & DIHCacheProperties).
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>    - SolrWriter (refactored)
>    - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache
as DIH Entity Input.
>  5. Support a "partition" parameter with both DIHCacheWriter and DIHCacheProcessor to
allow for easy partitioning of source entity data.
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache)
once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented destroy() was
LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy search engine
to Solr.  However, I found that DIH did not support caching to the level of our prior product's
data import utility.  In order to get our data into Solr, I created these caching enhancements.
 Because I believe this has broad application, and because we would like this feature to be
supported by the Community, I have front-ported this, enhanced, to Trunk.  I have also added
unit tests and verified that all existing test cases pass.  I believe this patch maintains
backwards-compatibility and would be a welcome addition to a future version of Solr.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message