lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandler" by ShalinMangar
Date Mon, 28 Sep 2009 20:06:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandler" page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=210&rev2=211

  === XPathEntityProcessor ===
  Used when indexing XML type data. The !DataSource must be of type `DataSource<Reader>`
. URL!DataSource <!> [[Solr1.4]] or !FileDataSource is commonly used with X!PathEntityProcessor.
  
+ === urce, URL!DataSourewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EventListener.java?view=markup|EventListener]].
- === FileListEntityProcessor ===
- A simple entity processor which can be used to enumerate the list of files from a File System
based on some criteria. It does not use a !DataSource. The entity attributes are:
-  * '''`fileName`''' :(required) A regex pattern to identify files
-  * '''`baseDir`''' : (required) The Base directory (absolute path)
-  * '''`recursive`''' : Recursive listing or not. Default is 'false'
-  * '''`excludes`''' : A Regex pattern of excluded file names
-  * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd HH:mm:ss`) . It can also
be a datemath string eg: ('NOW-3DAYS'). The single quote is necessary . Or it can be a valid
variableresolver format like (${var.name})
-  * '''`olderThan`''' : A date param . Same rules as above
-  * '''`rootEntity`''' :It must be false for this (Unless you wish to just index filenames)
An entity directly under the <document> is a root entity. That means that for each row
emitted by the root entity one document is created in Solr/Lucene. But as in this case we
do not wish to make one document per file. We wish to make one document per row emitted by
the following entity 'x'. Because the entity 'f' has rootEntity=false the entity directly
under it becomes a root entity automatically and each row emitted by that becomes a document.
-  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because this does not
use any DataSource. No need to specify that in Solr1.4 .It just means that we won't create
a DataSource instance. (In most of the cases there is only one !DataSource (A !JdbcDataSource)
and all entities just use them. In case of !FileListEntityProcessor a !DataSource is not necessary.)
- 
- example:
- {{{
- <dataConfig>
-     <dataSource type="FileDataSource" />
-     <document>
-         <entity name="f" processor="FileListEntityProcessor" baseDir="/some/path/to/files"
fileName=".*xml" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
-             <entity name="x" processor="XPathEntityProcessor" forEach="/the/record/xpath"
url="${f.fileAbsolutePath}">
-                 <field column="full_name" xpath="/field/xpath"/>
-             </entity>
-         </entity>
-     </document>
- </dataConfig>
- }}}
- Do not miss the `rootEntity` attribute. The implicit fields generated by the !FileListEntityProcessor
are `fileAbsolutePath, fileSize, fileLastModified, fileName` and these are available for use
within the entity X as shown above. It should be noted that !FileListEntityProcessor returns
a list of pathnames and that the subsequent entity must use the !FileDataSource to fetch the
files content.
- 
- === CachedSqlEntityProcessor ===
- <<Anchor(cached)>>
- 
- This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the
no: of DB queries executed by caching the rows. It does not help to use it in the root most
entity because only one sql is run for the entity.
- 
- Example 1.
- {{{
- <entity name="x" query="select * from x">
-     <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
-     </entity>
- <entity>
- }}}
- 
- The usage is exactly same as the other one. When a query is run the results are stored and
if the same query is run again it is fetched from the cache and returned
- 
- Example 2:
- {{{
- <entity name="x" query="select * from x">
-     <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
-     </entity>
- <entity>
- }}}
- 
- The difference with the previous one is the 'where' attribute. In this case the query fetches
all the rows from the table and stores all the rows in the cache. The magic is in the 'where'
value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id'
is evaluated every time the entity has to be run and the value is looked up in the cache an
the rows are returned.
- 
- In the where the lhs (the part before '=') is the column in y and the rhs (the part after
'=') is the value to be computed for looking up the cache.
- 
- === PlainTextEntityProcessor ===
- <<Anchor(plaintext)>>
- <!> [[Solr1.4]]
- 
- This !EntityProcessor reads all content from the data source into an single implicit field
called 'plainText'. The content is not parsed in any way, however you may add transformers
to manipulate the data within 'plainText' as needed or to create other additional fields.
- 
- example:
- {{{
- <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
-    <!-- copies the text to a field called 'text' in Solr-->
-   <field column="plainText" name="text"/>
- </entity>
- }}}
- 
- Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URL!DataSource)
- 
- === LineEntityProcessor ===
- <<Anchor(LineEntityProcessor)>>
- <!> [[Solr1.4]]
- 
- This !EntityProcessor reads all content from the data source on a line by line basis, a
field called 'rawLine' is returned for each line read. The content is not parsed in any way,
however you may add transformers to manipulate the data within 'rawLine' or to create other
additional fields.
- 
- The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
- This entities additional attributes are:
-  * '''`url`''' : a required attribute that specifies the location of the input file in a
way that is compatible with the configured datasource. If this value is relative and you are
using !FileDataSource or URL!DataSource, it assumed to be relative to '''baseLoc'''.
-  * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which
does not match the regExp.
-  * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex
and discards any line which matches this regExp.
- example:
- {{{
- <entity name="jc"
-         processor="LineEntityProcessor"
-         acceptLineRegex="^.*\.xml$"
-         omitLineRegex="/obsolete"
-         url="file:///Volumes/ts/files.lis"
-         rootEntity="false"
-         dataSource="myURIreader1"
-         transformer="RegexTransformer,DateFormatTransformer"
-         >
-    ...
- }}}
- While there are use cases where you might need to create a solr document per line read from
a file, it is expected that in most cases that the lines read will consist of a pathname which
is in turn consumed by another !EntityProcessor
- such as X!PathEntityProcessor.
- 
- == DataSource ==
- <<Anchor(datasource)>>
- A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See
source]]
- 
- and can be used as a !DataSource. It must be configured in the dataSource definition
- {{{
- <dataSource type="com.foo.FooDataSource" prop1="hello"/>
- }}}
- and it can be used in the entities like a standard one
- 
- === JdbcdataSource ===
- This is the default. See the  [[#jdbcdatasource|example]] . The signature is as follows
- {{{
- public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
- }}}
- It is designed to iterate rows in DB one by one. A row is represented as a Map.
- 
- === URLDataSource ===
- <!> [[Solr1.4]]
- This datasource is often used with X!PathEntityProcessor to fetch content from an underlying
file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as
follows
- {{{
- public class URLDataSource extends DataSource<Reader>
- }}}
- 
- === HttpDataSource ===
- <!> Http!DataSource is being deprecated in favour of URL!DataSource in [[Solr1.4]].
There is no change in functionality between URL!DataSource and !Http!DataSource, only a name
change.
- 
- === FileDataSource ===
- This can be used like an URL!DataSource but used to fetch content from files on disk. The
only difference from URL!DataSource, when accessing disk files, is how a pathname is specified.
The signature is as follows
- {{{
- public class FileDataSource extends DataSource<Reader>
- }}}
- 
- The attributes are:
-  * '''`basePath`''': (optional) The base path relative to which the value is evaluated if
it is not absolute
-  * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same
as the platform encoding
- 
- === FieldReaderDataSource ===
- <!> [[Solr1.4]]
- 
- This can be used like an URL!DataSource . The signature is as follows
- {{{
- public class FieldReaderDataSource extends DataSource<Reader>
- }}}
- This can be useful for users who have a DB field containing XML and wish to use a nested
X!PathEntityProcessor to process the fields contents.
- The datasouce may be configured as follows
- {{{
-   <datasource name="f" type="FieldReaderDataSource" />
- }}}
- 
- The enity which uses this datasource must keep the url value as the variable name dataField="field-name".
For instance , if the parent entity 'dbEntity' has a field called 'xmlData' . Then he child
entity woould look like,
- {{{
- <entity dataSource="f" processor="XPathEntityProcessor" dataField="dbEntity.xmlData"/>
- }}}
- 
- === ContentStreamDataSource ===
- <!> [[Solr1.4]]
- 
- Use this to use the POST data as the DataSource. This can be used with any !EntityProcessor
that uses a !DataSource<Reader>
- 
- == EventListeners ==
- !EventListener can be registered for "onImportStart" and onImportEnd" .It must implement
the interface  [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EventListener.java?view=markup|EventListener]].
  
  {{{
  <dataConfig>
@@ -924, +767 @@

   * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString
of a number
   * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value hast to be the
unniqueKey value of the document <!> [[Solr1.4]]
   * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr
Query <!> [[Solr1.4]]
+  * '''`$stopTransform`''' : Prevents other transformers from executing and affecting the
current row. <!> [[Solr1.4]]
  
+ == Adding datasource in solrconfig.xml ==
+ <<Anchor(solrconfigdatasource)>>
+ 
+ It is possible to configure datasource in solrconfig.xml as well as the data-config.xml,
however the datasource attributes are expressed differently.
+ {{{
+   <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
+     <lst name="defaults">
+       <str name="config">/home/username/data-config.xml</str>
+       <lst name="datasource">
+          <str name="driver">com.mysql.jdbc.Driver</str>
+          <str name="url">jdbc:mysql://localhost/dbname</str>
+          <str name="user">db_username</str>
+          <str name="password">db_password</str>
+       </lst>
+     </lst>
+   </requestHandler>
+ }}}
+ <<Anchor(arch)>>
+ = Architecture =
+ The following diagram describes the logical flow for a sample configuration.
+ 
+ {{attachment:DataImportHandlerOverview.png}}
+ 
+ The use case is as follows:
+ There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
+ 
+  * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are configured inor<Map<String,
Object>>>
+ }}}
+ It is designed to iterate rows in DB one by one. A row is represented as a Map.
+ 
+ === URLDataSource ===
+ <!> [[Solr1.4]]
+ This datasource is often used with X!PathEntityProcessor to fetch content from an underlying
file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as
follows
+ {{{
+ public class URLDataSource extends DataSource<Reader>
+ }}}
+ 
+ === HttpDataSource ===
+ <!> Http!DataSource is being deprecated in favour of URL!DataSource in [[Solr1.4]].
There is no change in functionality between URL!DataSource and !Http!DataSource, only a name
change.
+ 
+ === FileDataSource ===
+ This can be used like an URL!DataSource but used to fetch content from files on disk. The
only difference from URL!DataSource, when accessing disk files, is how a pathname is specified.
The signature is as follows
+ {{{
+ public class FileDataSource extends DataSource<Reader>
+ }}}
+ 
+ The attributes are:
+  * '''`basePath`''': (optional) The base path relative to which the value is evaluated if
it is not absolute
+  * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same
as the platform encoding
+ 
+ === FieldReaderDataSource ===
+ <!> [[Solr1.4]]
+ 
+ This can be used like an URL!DataSource . The signature is as follows
+ {{{
+ public class FieldReaderDataSource extends DataSource<Reader>
+ }}}
+ This can be useful for users who have a DB field containing XML and wish to use a nested
X!PathEntityProcessor to process the fields contents.
+ The datasouce may be configured as follows
+ {{{
+   <datasource name="f" type="FieldReaderDataSource" />
+ }}}
+ 
+ The enity which uses this datasource must keep the url value as the variable name dataField="field-name".
For instance , if the parent entity 'dbEntity' has a field called 'xmlData' . Then he child
entity woould look like,
+ {{{
+ <entity dataSource="f" processor="XPathEntityProcessor" dataField="dbEntity.xmlData"/>
+ }}}
+ 
+ === ContentStreamDataSource ===
+ <!> [[Solr1.4]]
+ 
+ Use this to use the POST data as the DataSource. This can be used with any !EntityProcessor
that uses a !DataSource<Reader>
+ 
+ == EventListeners ==
+ !EventListener can be registered for "onImportStart" and onImportEnd" .It must implement
the interface  [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EventListener.java?view=markup|EventListener]].
+ 
+ {{{
+ <dataConfig>
+ <document onImportStart ="com.FooStart" onImportEnd="comFooEnd">
+ ....
+ </document>
+ </dataConfig>
+ }}}
+ 
+ == Special Commands ==
+ Special commands can be given to DIH by adding certain variables to the row returned by
any of the components .
+  * '''`$skipDoc`''' : Skip the current document . Do not add it to Solr. The value can be
String true/false
+  * '''`$skipRow`''' : Skip the current row. The document will be added with rows from other
entities. The value can be String true/false
+  * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString
of a number
+  * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value hast to be the
unniqueKey value of the document <!> [[Solr1.4]]
+  * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr
Query <!> [[Solr1.4]]
+  * '''`$stopTransform`''' : Prevents other transformers from executing and affecting the
current row. <!> [[Solr1.4]]
+ 
+ == Adding datasource in solrconfig.xml ==
+ <<Anchor(solrconfigdatasource)>>
+ 
+ It is possible to configure datasource in solrconfig.xml as well as the data-config.xml,
however the datasource attributes are expressed differently.
+ {{{
+   <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
+     <lst name="defaults">
+       <str name="config">/home/username/data-config.xml</str>
+       <lst name="datasource">
+          <str name="driver">com.mysql.jdbc.Driver</str>
+          <str name="url">jdbc:mysql://localhost/dbname</str>
+          <str name="user">db_username</str>
+          <str name="password">db_password</str>
+       </lst>
+     </lst>
+   </requestHandler>
+ }}}
+ <<Anchor(arch)>>
+ = Architecture =
+ The following diagram describes the logical flow for a sample configuration.
+ 
+ {{attachment:DataImportHandlerOverview.png}}
+ 
+ The use case is as follows:
+ There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
+ 
+  * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are configured inor<Map<String,
Object>>>
+ }}}
+ It is designed to iterate rows in DB one by one. A row is represented as a Map.
+ 
+ === URLDataSource ===
+ <!> [[Solr1.4]]
+ This datasource is often used with X!PathEntityProcessor to fetch content from an underlying
file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as
follows
+ {{{
+ public class URLDataSource extends DataSource<Reader>
+ }}}
+ 
+ === HttpDataSource ===
+ <!> Http!DataSource is being deprecated in favour of URL!DataSource in [[Solr1.4]].
There is no change in functionality between URL!DataSource and !Http!DataSource, only a name
change.
+ 
+ === FileDataSource ===
+ This can be used like an URL!DataSource but used to fetch content from files on disk. The
only difference from URL!DataSource, when accessing disk files, is how a pathname is specified.
The signature is as follows
+ {{{
+ public class FileDataSource extends DataSource<Reader>
+ }}}
+ 
+ The attributes are:
+  * '''`basePath`''': (optional) The base path relative to which the value is evaluated if
it is not absolute
+  * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same
as the platform encoding
+ 
+ === FieldReaderDataSource ===
+ <!> [[Solr1.4]]
+ 
+ This can be used like an URL!DataSource . The signature is as follows
+ {{{
+ public class FieldReaderDataSource extends DataSource<Reader>
+ }}}
+ This can be useful for users who have a DB field containing XML and wish to use a nested
X!PathEntityProcessor to process the fields contents.
+ The datasouce may be configured as follows
+ {{{
+   <datasource name="f" type="FieldReaderDataSource" />
+ }}}
+ 
+ The enity which uses this datasource must keep the url value as the variable name dataField="field-name".
For instance , if the parent entity 'dbEntity' has a field called 'xmlData' . Then he child
entity woould look like,
+ {{{
+ <entity dataSource="f" processor="XPathEntityProcessor" dataField="dbEntity.xmlData"/>
+ }}}
+ 
+ === ContentStreamDataSource ===
+ <!> [[Solr1.4]]
+ 
+ Use this to use the POST data as the DataSource. This can be used with any !EntityProcessor
that uses a !DataSource<Reader>
+ 
+ == EventListeners ==
+ !EventListener can be registered for "onImportStart" and onImportEnd" .It must implement
the interface  [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EventListener.java?view=markup|EventListener]].
+ 
+ {{{
+ <dataConfig>
+ <document onImportStart ="com.FooStart" onImportEnd="comFooEnd">
+ ....
+ </document>
+ </dataConfig>
+ }}}
+ 
+ == Special Commands ==
+ Special commands can be given to DIH by adding certain variables to the row returned by
any of the components .
+  * '''`$skipDoc`''' : Skip the current document . Do not add it to Solr. The value can be
String true/false
+  * '''`$skipRow`''' : Skip the current row. The document will be added with rows from other
entities. The value can be String true/false
+  * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString
of a number
+  * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value hast to be the
unniqueKey value of the document <!> [[Solr1.4]]
+  * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr
Query <!> [[Solr1.4]]
+  * '''`$stopTransform`''' : Prevents other transformers from executing and affecting the
current row. <!> [[Solr1.4]]
  
  == Adding datasource in solrconfig.xml ==
  <<Anchor(solrconfigdatasource)>>

Mime
View raw message