lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandler" by TatsuyaOiwa
Date Thu, 01 Mar 2012 11:24:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandler" page has been changed by TatsuyaOiwa:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=309&rev2=310

  = Overview =
  == Goals ==
   * Read data residing in relational databases
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
   * Build Solr doc
  
+ ---- /!\ '''Edit conflict - your version:''' ----
+  * Build Solr documents by aggregating data from multiple columns and tables according to
configuration
+  * Update Solr with such documents
+  * Provide ability to do full imports according to configuration
+  * Detect inserts/update deltas (changes) and do delta imports (we assume a last-modified
timestamp column for this to work)
+  * Schedule full imports and delta imports
+  * Read and Index data from xml/(http/file) based on configuration
+  * Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format
of user choice (JSON,csv etc)
+ 
+ = Design Overview =
+ The Handler has to be registered in the solrconfig.xml as follows.
+ 
+ {{{
+   <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
+     <lst name="defaults">
+       <str name="config">/home/username/data-config.xml</str>
+     </lst>
+   </requestHandler>
+ }}}
+ As the name suggests, this is implemented as a SolrRequestHandler. The configuration is
provided in two places:
+ 
+  * solrconfig.xml . The data config file location is added here
+  * The datasource also can be added here. Or it can be put directly into the data-config.xml
+  * data-config.xml
+   * How to fetch data (queries,url etc)
+   * What to read ( resultset columns, xml fields etc)
+   * How to process (modify/add/remove fields)
+ 
+ = Usage with RDBMS =
+ In order to use this handler, the following steps are required.
+ 
+  * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler
section
+  * Give connection information (if you choose to put the datasource information in solrconfig)
+  * Open the DataImportHandler page to verify if everything is in order http://localhost:8983/solr/dataimport
+  * Use full-import command to do a full import from the database and add to Solr index
+  * Use delta-import command to do a delta import (get new inserts/updates) and add to Solr
index
+ 
+ <<Anchor(dsconfig)>>
+ 
+ == Configuring DataSources ==
+ Add the tag 'dataSource' directly under the 'dataConfig' tag.
+ 
+ {{{
+ <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/dbname"
user="db_username" password="db_password"/>
+ }}}
+  * The datasource configuration can also be done in solr config xml [[#solrconfigdatasource]]
+  * The attribute 'type' specifies the implementation class. It is optional. The default
value is `'JdbcDataSource'`
+  * The attribute 'name' can be used if there are [[#multipleds|multiple datasources]] used
by multiple entities
+  * All other attributes in the <dataSource> tag are specific to the particular dataSource
implementation being configured.
+  * [[#datasource|See here]] for plugging in your own
+ 
+ === Oracle Example ===
+ You might need to download and install the [[http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html|Oracle
JDBC Driver]] in the /lib directory of your Solr installation.
+ 
+ {{{
+ <dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//hostname:port/SID"
user="db_username" password="db_password"/>
+ }}}
+ <<Anchor(multipleds)>>
+ 
+ === Multiple DataSources ===
+ It is possible to have more than one datasources for a configuration. To configure an extra
datasource , just keep an another 'dataSource'  tag . There is an implicit attribute "name"
for a datasource. If there are more than one, each extra datasource must be identified by
a unique name  `'name="datasource-2"'` .
+ 
+ eg:
+ 
+ {{{
+ <dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname"
user="db_username" password="db_password"/>
+ <dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db2-host/dbname"
user="db_username" password="db_password"/>
+ }}}
+ in your entities:
+ 
+ {{{
+ ..
+ <entity name="one" dataSource="ds-1" ...>
+    ..
+ </entity>
+ <entity name="two" dataSource="ds-2" ...>
+    ..
+ </entity>
+ ..
+ }}}
+ <<Anchor(jdbcdatasource)>>
+ 
+ == Configuring JdbcDataSource ==
+ The attributes accepted by !JdbcDataSource are ,
+ 
+  * '''`driver`''' (required): The jdbc driver classname
+  * '''`url`''' (required) : The jdbc connection url
+  * '''`user`''' : User name
+  * '''`password`''' : The password
+  * '''`batchSize`''' : The batchsize used in jdbc connection. Use a value of '-1' in case
of   `setFetchSize() ` exception.
+  * '''`convertType`''' :(true/false)Default is 'false' Automatically reads the data in the
target Solr data-type
+  * '''`autoCommit`''' : If set to 'false' it sets  `setAutoCommit(false)` <!> [[Solr1.4]]
+  * '''`readOnly`''' : If this is set to 'true' , it sets `setReadOnly(true)`, `setAutoCommit(true)`,
`setTransactionIsolation(TRANSACTION_READ_UNCOMMITTED)`,`setHoldability(CLOSE_CURSORS_AT_COMMIT)`
on the connection <!> [[Solr1.4]]
+  * '''`transactionIsolation`''' : The possible values are [TRANSACTION_READ_UNCOMMITTED,
TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ,TRANSACTION_SERIALIZABLE,TRANSACTION_NONE]
<!> [[Solr1.4]]
+ 
+ Any extra attributes put into the tag are directly passed on to the jdbc driver.
+ 
+ == Configuration in data-config.xml ==
+ A Solr document can be considered as a de-normalized schema having fields whose values come
from multiple tables.
+ 
+ The data-config.xml starts by defining a `document` element. A `document` represents one
kind of document.  A document contains one or more root entities. A root entity can contain
multiple sub-entities which in turn can  contain other entities. An entity is a table/view
in a relational database. Each entity can contain multiple fields. Each field corresponds
to a column in the resultset returned by the ''query'' in the entity. For each field, mention
the column name in the resultset. If the column name is different from the solr field name,
then another attribute ''name'' should be given. Rest of the required attributes such as ''type''
will be inferred directly from the Solr schema.xml. (Can be overridden)
+ 
+ In order to get data from the database, our design philosophy revolves around 'templatized
sql' entered by the user for each entity. This gives the user the entire power of SQL if he
needs it. The root entity is the central table whose columns can be used to join this table
with other child entities.
+ 
+ === Schema for the data config ===
+  . The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary
and depends on the `processor` and `transformer`.
+ 
+ The default attributes for an entity are:
+ 
+  * '''`name`''' (required) : A unique name used to identify an entity
+  * '''`processor`''' : Required only if the datasource is not RDBMS . (The default value
is `SqlEntityProcessor`)
+  * '''`transformer`'''  : Transformers to be applied on this entity. (See the transformer
section)
+  * '''`dataSource`''' : The name of a datasource as put in the the datasource .(Used if
there are multiple datasources)
+  * '''`threads`''' :  The no:of of threads to use to run this entity. This must be placed
on or above a 'rootEntity'. [[Solr3.1]]
+  * '''`pk`''' : The primary key for the entity. It is '''optional''' and only needed when
using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both
can be the same.
+  * '''`rootEntity`''' : By default the entities falling under the document are root entities.
If it is set to false , the entity directly falling under that entity will be treated as the
root entity (so on and so forth). For every row returned by the root entity a document is
created in Solr
+  * '''`onError`''' : (abort|skip|continue) . The default value is 'abort' . 'skip' skips
the current document. 'continue' continues as if the error did not happen . <!> [[Solr1.4]]
+  * '''`preImportDeleteQuery`''' : before full-import this will be used to cleanup the index
instead of using '*:*' .This is honored only on an entity that is an immediate sub-child of
<document> <!> [[Solr1.4]].
+  * '''`postImportDeleteQuery`''' : after full-import this will be used to cleanup the index
<!>. This is honored only on an entity that is an immediate sub-child of <document>
[[Solr1.4]].
+ 
+ For !SqlEntityProcessor the entity attributes are :
+ 
+  * '''`query`''' (required) : The sql string using which to query the db
+  * '''`deltaQuery`''' : Only used in delta-import
+  * '''`parentDeltaQuery`''' : Only used in delta-import
+  * '''`deletedPkQuery`''' : Only used in delta-import
+  * '''`deltaImportQuery`''' : (Only used in delta-import) . If this is not present , DIH
tries to construct the import query by(after identifying the delta) modifying the '`query`'
(this is error prone). There is a namespace `${dataimporter.delta.<column-name>}` which
can be used in this query.  e.g: `select * from tbl where id=${dataimporter.delta.id}`  <!>
[[Solr1.4]].
+ 
+ == Commands ==
+ <<Anchor(commands)>> The handler exposes all its API as http requests . The
following are the possible operations
+ 
+  * '''full-import''' : Full Import operation can be started by hitting the URL `http://<host>:<port>/solr/dataimport?command=full-import`
+ 
+   * This operation will be started in a new thread and the ''status'' attribute in the response
should be shown ''busy'' now.
+   * The operation may take some time depending on size of dataset.
+   * When full-import command is executed, it stores the start time of the operation in a
file located at ''conf/dataimport.properties''
+   * This stored timestamp is used when a delta-import operation is executed.
+   * Queries to Solr are not blocked during full-imports.
+   * It takes in extra parameters:
+    * '''entity''' : Name of an entity directly under the <document> tag. Use this
to execute one or more entities selectively. Multiple 'entity' parameters can be passed on
to run multiple entities at once. If nothing is passed, all entities are executed.
+    * '''clean''' : (default 'true'). Tells whether to clean up the index before the indexing
is started.
+    * '''commit''' : (default 'true'). Tells whether to commit after the operation.
+    * '''optimize''' : (default 'true' up to Solr 3.6, 'false' afterwards). Tells whether
to optimize after the operation. Please note: this can be a very expensive operation and usually
does not make sense for delta-imports. 
+    * '''debug''' : (default 'false'). Runs in debug mode. It is used by the interactive
development mode ([[#interactive|see here]]).
+ 
+     * Please note that in debug mode, documents are never committed automatically. If you
want to run debug mode and commit the results too, add 'commit=true' as a request parameter.
+  * '''delta-import''' : For incremental imports and change detection run the command `http://<host>:<port>/solr/dataimport?command=delta-import`
. It supports the same clean, commit, optimize and debug parameters as full-import command.
+  * '''status''' : To know the status of the current command, hit the URL `http://<host>:<port>/solr/dataimport`
. It gives an elaborate statistics on no. of docs created, deleted, queries run, rows fetched,
status etc.
+  * '''reload-config''' : If the data-config is changed and you wish to reload the file without
restarting Solr. Run the command `http://<host>:<port>/solr/dataimport?command=reload-config`
.
+  * '''abort''' : Abort an ongoing operation by hitting the URL `http://<host>:<port>/solr/dataimport?command=abort`
.
+ 
+ == Full Import Example ==
+ Let us consider an example. Suppose we have the following schema in our database
+ 
+ {{attachment:example-schema.png}}
+ 
+ This is a relational model of the same schema that Solr currently ships with. We will use
this as an example to build a data-config.xml for DataImportHandler. We've created a sample
database with this schema using [[http://hsqldb.org/|HSQLDB]].  To run it, do the following
steps:
+ 
+  1. Look at the example/example-DIH directory in the solr download. It contains a complete
solr home with all the configuration you need to execute this as well as the RSS example (given
later in this page).
+  1. Use the ''example-DIH/solr'' directory as your solr home.  Start Solr by running from
the root {{{/examples}}} directory: {{{java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar}}}
+  1. Hit http://localhost:8983/solr/db/dataimport with a browser to verify the configuration.
+  1. Hit http://localhost:8983/solr/db/dataimport?command=full-import to do a full import.
+ 
+ The ''solr'' directory is a MultiCore Solr home. It has two cores, one for the DB example
(this one) and one for an RSS example (new feature).
+ 
+  * The data-config.xml used for this example is:
+ 
+ {{{
+ <dataConfig>
+ <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa"
/>
+     <document name="products">
+         <entity name="item" query="select * from item">
+             <field column="ID" name="id" />
+             <field column="NAME" name="name" />
+             <field column="MANU" name="manu" />
+             <field column="WEIGHT" name="weight" />
+             <field column="PRICE" name="price" />
+             <field column="POPULARITY" name="popularity" />
+             <field column="INSTOCK" name="inStock" />
+             <field column="INCLUDES" name="includes" />
+ 
+             <entity name="feature" query="select description from feature where item_id='${item.ID}'">
+                 <field name="features" column="description" />
+             </entity>
+             <entity name="item_category" query="select CATEGORY_ID from item_category
where item_id='${item.ID}'">
+                 <entity name="category" query="select description from category where
id = '${item_category.CATEGORY_ID}'">
+                     <field column="description" name="cat" />
+                 </entity>
+             </entity>
+         </entity>
+     </document>
+ </dataConfig>
+ }}}
+ Here, the root entity is a table called "item" whose primary key is a column "id". Data
can be read from this table with the query "select * from item". Each item can have multiple
"features" which are in the table ''feature'' inside the column ''description''. Note the
query in ''feature'' entity:
+ 
+ {{{
+    <entity name="feature" query="select description from feature where item_id='${item.id}'">
+        <field name="feature" column="description" />
+    </entity>
+ }}}
+ The ''item_id'' foreign key in feature table is joined together with ''id'' primary key
in ''item'' to retrieve rows for each row in ''item''. In a similar fashion, we join ''item''
and 'category' (which is a many-to-many relationship). Notice how we join these two tables
using the intermediate table ''item_category'' again using templated SQL.
+ 
+ {{{
+  <entity name="item_category" query="select category_id from item_category where item_id='${item.id}'">
+                 <entity name="category" query="select description from category where
id = '${item_category.category_id}'">
+                     <field column="description" name="cat" />
+                 </entity>
+             </entity>
+ }}}
+ <<Anchor(shortconfig)>>
+ 
+ === A shorter data-config ===
+ In the above example, there are mappings of fields to Solr fields. It is p
+ 

Mime
View raw message