lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandler" by NoblePaul
Date Mon, 31 Mar 2008 05:17:12 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by NoblePaul:
http://wiki.apache.org/solr/DataImportHandler

------------------------------------------------------------------------------
  = Overview =
  
  == Motivation ==
- Most applications store data in relational databases and searching over such data is a common
use-case. However, there is no standard way to import this data into SOLR index requiring
custom tools external to SOLR.
+ Most applications store data in relational databases and searching over such data is a common
use-case. However, there is no standard way to import this data into SOLR index requiring
custom tools external to SOLR. Another common use case is data available in REST datasources
(eg: RSS)  , xml files etc
  
  == Goals ==
-  * Read data residing in relational databases
+  * Read data residing in relational databases 
   * Build SOLR documents by aggregating data from multiple columns and tables according to
configuration
   * Update SOLR with such documents
   * Provide ability to do full imports according to configuration
   * Detect inserts/update deltas (changes) and do delta imports (we assume a last-modified
timestamp column for this to work)
   * Schedule full imports and delta imports
+  * Read and Index data from xml/(http/file) based on configuration 
+  * Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format
of user choice (JSON,csv etc)
  
  = Design Overview =
  As the name suggests, this is implemented as a SolrRequestHandler. The configuration is
provided in two places:
-  * solrconfig.xml (data source information is read from here e.g. JDBC Driver, JDBC URL,
Username, Password etc.)
+  * solrconfig.xml . data source information is read from here.  (For a Jdbc datasource 
JDBC Driver, JDBC URL, User name, Password etc.)
-  * data-config.xml (DB Table/column to SOLR document mapping comes here)
- 
- 
+  * data-config.xml 
+    * How to fetch data (queries,url etc)
+    * What to read ( resultset columns, xml fields etc)
+    * How to process (modify/add/remove fields)   
  = Usage with databases =
  In order to use this handler, the following steps are required.
   * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler
section
@@ -52, +55 @@

      </lst>
    </requestHandler>
  }}}
- note: It is possible to have more than one datasources for a configuration. To configure
another datasource , just keep an another `<lst name="datasource">` entry . There is
an implicit attribute "name" for a datasource. If there are more than one, each extra datasource
must be identified by a unique name like this `<str name="name">datasource-2/str>`
+ note: It is possible to have more than one datasources for a configuration. To configure
another datasource , just keep an another `<lst name="datasource">` entry . There is
an implicit attribute "name" for a datasource. If there are more than one, each extra datasource
must be identified by a unique name . eg: `<str name="name">datasource-2/str>`
  
  == Configuration in data-config.xml ==
  A SOLR document can be considered as a de-normalized schema having fields whose values come
from multiple tables.
@@ -62, +65 @@

  In order to get data from the database, our design philosophy revolves around 'templatized
sql' entered by the user for each entity. This gives the user the entire power of SQL if he
needs it. The root entity is the central table whose columns can be used to join this table
with other child entities.
  
  === Schema for the data config ===
-   The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary
and depends on the `processor` and `transformer`. For !JdbcdataSource the entity attributes
are 
+   The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary
and depends on the `processor` and `transformer`. 
- The default attributes for an entity
+ The default attributes for an entity are:
   * '''`name`''' (required) : A unique name used to identify an entity
   * '''`processor`''' : Required only if the datasource is not RDBMS . (The default value
is `SqlEntityProcessor`)
   * '''`transformer`'''  : Transformers to be applied on this entity. (See the transformer
section)

Mime
View raw message