lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Dyer (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2943) DIHCacheWriter & DIHCacheProcessor (entity processor)
Date Mon, 05 Dec 2011 19:02:40 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2943:
-----------------------------

    Description: 
This is a spin-off of SOLR-2382.

Currently DIH requires users to retrieve, join and index all data for a full or delta update
in one big step.  This issue is to allow us to break this into individual steps.  The idea
is to have multiple "data-config.xml" files, some of which retrieve and cache data while others
join and index data.  

This is useful when Solr Records are a conglomeration of several data sources.  With this
feature, each data source can be retrieved and cached separately.  Once all data sources have
been retrieved, they can be joined and indexed in a final step.  When doing a delta update,
only the data sources that change need to have their caches updated (or frequently-changing
data can remain un-cached while caching the more static data).  This is particularly useful
in light of the fact that Lucene/Solr cannot do a true "update" operation.  DIH Caches also
provide a handy way to archive source data for which there is no stable system-of-record.

Implementation Details:

- The DIHCacheWriter allows us to write the final (root entity) DIH output to a DIHCache rather
than to Solr.  Caches can be created from scratch ("full-update") or existing caches can be
modified ("delta-update").

- The DIHCacheProcessor is an Entity Processor that reads a DIHCache.  This Entity Processor
can be used for both Root Entities and Child Entities.  Cached data can be read back, joined
to other Entities and indexed.

- Both DIHCacheWriter and DIHCacheProcessor support partitioning.  DIHCacheWriter can write
to a partitioned cache while DIHCacheProcessor can read back a particular partition.  This
can be handy when indexing to multiple shards.

- This patch is 100% stand-alone from the rest of DIH, so while users can patch and rebuild
the DIH .jar file to include these classes, it is unnecessary.  To use this functionality,
simply include the code here in the classpath. (ex: in SOLR_HOME/lib)

- In addition to this patch, a persistent cache implementation is required. 
  - See SOLR-2948 for a DIH Cache Implementation built on Lucene (no additional dependencies).

  - See SOLR-2613 for a DIH Cache Implementation backed with BDB-JE (we use this in Production).
  - Other Cache Implementations (hopefully) will be developed in the future and become available
for general use.

- This patch includes extensive unit tests.  A MockDIHCache that supports persistence and
delta updates facilitates the tests.  Do not attempt to use MockDIHCache for anything other
than testing or as a reference for developing your own DIHCache implementations.


  was:
This is a spin-off of SOLR-2382.

Currently DIH requires users to retrieve, join and index all data for a full or delta update
in one big step.  This issue is to allow us to break this into individual steps.  The idea
is to have multiple "data-config.xml" files, some of which retrieve and cache data while others
join and index data.  

This is useful when Solr Records are a conglomeration of several data sources.  With this
feature, each data source can be retrieved and cached separately.  Once all data sources have
been retrieved, they can be joined and indexed in a final step.  When doing a delta update,
only the data sources that change need to have their caches updated (or frequently-changing
data can remain un-cached while caching the more static data).  This is particularly useful
in light of the fact that Lucene/Solr cannot do a true "update" operation.  DIH Caches also
provide a handy way to archive source data for which there is no stable system-of-record.

Implementation Details:

- The DIHCacheWriter allows us to write the final (root entity) DIH output to a DIHCache rather
than to Solr.  Caches can be created from scratch ("full-update") or existing caches can be
modified ("delta-update").

- The DIHCacheProcessor is an Entity Processor that reads a DIHCache.  This Entity Processor
can be used for both Root Entities and Child Entities.  Cached data can be read back, joined
to other Entities and indexed.

- Both DIHCacheWriter and DIHCacheProcessor support partitioning.  DIHCacheWriter can write
to a partitioned cache while DIHCacheProcessor can read back a particular partition.  This
can be handy when indexing to multiple shards.

- This patch is 100% stand-alone from the rest of DIH, so while users can patch and rebuild
the DIH .jar file to include these classes, it is unnecessary.  To use this functionality,
simply include the code here in the classpath. (ex: in SOLR_HOME/lib)

- In addition to this patch, a persistent cache implementation is required.  See SOLR-2613
for a DIH Cache Implementation backed with BDB-JE.  Other Cache Implementations (hopefully)
will be developed in the future and become available for general use.

- This patch includes extensive unit tests.  A MockDIHCache that supports persistence and
delta updates facilitates the tests.  Do not attempt to use MockDIHCache for anything other
than testing or as a reference for developing your own DIHCache implementations.


    
> DIHCacheWriter & DIHCacheProcessor (entity processor)
> -----------------------------------------------------
>
>                 Key: SOLR-2943
>                 URL: https://issues.apache.org/jira/browse/SOLR-2943
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 4.0
>            Reporter: James Dyer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: SOLR-2943.patch, SOLR-2943.patch
>
>
> This is a spin-off of SOLR-2382.
> Currently DIH requires users to retrieve, join and index all data for a full or delta
update in one big step.  This issue is to allow us to break this into individual steps.  The
idea is to have multiple "data-config.xml" files, some of which retrieve and cache data while
others join and index data.  
> This is useful when Solr Records are a conglomeration of several data sources.  With
this feature, each data source can be retrieved and cached separately.  Once all data sources
have been retrieved, they can be joined and indexed in a final step.  When doing a delta update,
only the data sources that change need to have their caches updated (or frequently-changing
data can remain un-cached while caching the more static data).  This is particularly useful
in light of the fact that Lucene/Solr cannot do a true "update" operation.  DIH Caches also
provide a handy way to archive source data for which there is no stable system-of-record.
> Implementation Details:
> - The DIHCacheWriter allows us to write the final (root entity) DIH output to a DIHCache
rather than to Solr.  Caches can be created from scratch ("full-update") or existing caches
can be modified ("delta-update").
> - The DIHCacheProcessor is an Entity Processor that reads a DIHCache.  This Entity Processor
can be used for both Root Entities and Child Entities.  Cached data can be read back, joined
to other Entities and indexed.
> - Both DIHCacheWriter and DIHCacheProcessor support partitioning.  DIHCacheWriter can
write to a partitioned cache while DIHCacheProcessor can read back a particular partition.
 This can be handy when indexing to multiple shards.
> - This patch is 100% stand-alone from the rest of DIH, so while users can patch and rebuild
the DIH .jar file to include these classes, it is unnecessary.  To use this functionality,
simply include the code here in the classpath. (ex: in SOLR_HOME/lib)
> - In addition to this patch, a persistent cache implementation is required. 
>   - See SOLR-2948 for a DIH Cache Implementation built on Lucene (no additional dependencies).

>   - See SOLR-2613 for a DIH Cache Implementation backed with BDB-JE (we use this in Production).
>   - Other Cache Implementations (hopefully) will be developed in the future and become
available for general use.
> - This patch includes extensive unit tests.  A MockDIHCache that supports persistence
and delta updates facilitates the tests.  Do not attempt to use MockDIHCache for anything
other than testing or as a reference for developing your own DIHCache implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message