Return-Path: Delivered-To: apmail-lucene-solr-commits-archive@locus.apache.org Received: (qmail 60628 invoked from network); 31 Mar 2008 05:17:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Mar 2008 05:17:33 -0000 Received: (qmail 97742 invoked by uid 500); 31 Mar 2008 05:17:33 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 97718 invoked by uid 500); 31 Mar 2008 05:17:33 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 97709 invoked by uid 99); 31 Mar 2008 05:17:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Mar 2008 22:17:33 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 05:17:00 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 2C1D5D2EB for ; Mon, 31 Mar 2008 05:17:12 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: solr-commits@lucene.apache.org Date: Mon, 31 Mar 2008 05:17:12 -0000 Message-ID: <20080331051712.25039.70002@eos.apache.org> Subject: [Solr Wiki] Update of "DataImportHandler" by NoblePaul X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification. The following page has been changed by NoblePaul: http://wiki.apache.org/solr/DataImportHandler ------------------------------------------------------------------------------ = Overview = == Motivation == - Most applications store data in relational databases and searching over such data is a common use-case. However, there is no standard way to import this data into SOLR index requiring custom tools external to SOLR. + Most applications store data in relational databases and searching over such data is a common use-case. However, there is no standard way to import this data into SOLR index requiring custom tools external to SOLR. Another common use case is data available in REST datasources (eg: RSS) , xml files etc == Goals == - * Read data residing in relational databases + * Read data residing in relational databases * Build SOLR documents by aggregating data from multiple columns and tables according to configuration * Update SOLR with such documents * Provide ability to do full imports according to configuration * Detect inserts/update deltas (changes) and do delta imports (we assume a last-modified timestamp column for this to work) * Schedule full imports and delta imports + * Read and Index data from xml/(http/file) based on configuration + * Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format of user choice (JSON,csv etc) = Design Overview = As the name suggests, this is implemented as a SolrRequestHandler. The configuration is provided in two places: - * solrconfig.xml (data source information is read from here e.g. JDBC Driver, JDBC URL, Username, Password etc.) + * solrconfig.xml . data source information is read from here. (For a Jdbc datasource JDBC Driver, JDBC URL, User name, Password etc.) - * data-config.xml (DB Table/column to SOLR document mapping comes here) - - + * data-config.xml + * How to fetch data (queries,url etc) + * What to read ( resultset columns, xml fields etc) + * How to process (modify/add/remove fields) = Usage with databases = In order to use this handler, the following steps are required. * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler section @@ -52, +55 @@ }}} - note: It is possible to have more than one datasources for a configuration. To configure another datasource , just keep an another `` entry . There is an implicit attribute "name" for a datasource. If there are more than one, each extra datasource must be identified by a unique name like this `datasource-2/str>` + note: It is possible to have more than one datasources for a configuration. To configure another datasource , just keep an another `` entry . There is an implicit attribute "name" for a datasource. If there are more than one, each extra datasource must be identified by a unique name . eg: `datasource-2/str>` == Configuration in data-config.xml == A SOLR document can be considered as a de-normalized schema having fields whose values come from multiple tables. @@ -62, +65 @@ In order to get data from the database, our design philosophy revolves around 'templatized sql' entered by the user for each entity. This gives the user the entire power of SQL if he needs it. The root entity is the central table whose columns can be used to join this table with other child entities. === Schema for the data config === - The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary and depends on the `processor` and `transformer`. For !JdbcdataSource the entity attributes are + The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary and depends on the `processor` and `transformer`. - The default attributes for an entity + The default attributes for an entity are: * '''`name`''' (required) : A unique name used to identify an entity * '''`processor`''' : Required only if the datasource is not RDBMS . (The default value is `SqlEntityProcessor`) * '''`transformer`''' : Transformers to be applied on this entity. (See the transformer section)