lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandler" by ShalinMangar
Date Sun, 30 Mar 2008 16:47:09 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Example for indexing Slashdot RSS feed

------------------------------------------------------------------------------
  If an API supports chunking (when the dataset is too large) multiple calls need to be made
to complete the process. 
  X!PathEntityprocessor supports this with a transformer. If transformer returns a row which
contains a field '''`$hasMore`''' with a the value `"true"` the Processor makes another request
with the same url template (The actual value is recomputed before invoking ). A transformer
can pass a totally new url too for the next call by returning a row which contains a field
'''`$nextUrl`''' whose value must be the complete url for the next call.
  
- The X!PathEntityProcessor implements a streaming parser which supports a subset of xpath
syntax. Complete xpath syntax is not supported but most of the common use cases are covered
+ The X!PathEntityProcessor implements a streaming parser which supports a subset of xpath
syntax. Complete xpath syntax is not supported but most of the common use cases are covered.
+ 
+ == HttpDataSource Example ==
+ 
+ Download the full import example given in the DB section to try this out. We'll try indexing
the [http://rss.slashdot.org/Slashdot/slashdot Slashdot RSS feed] for this example.
+ 
+ The dataimport section in solrconfig.xml looks like this:
+ {{{
+    <requestHandler name="/dataimport"
+    	class="org.apache.solr.handler.dataimport.DataImportHandler">
+    	<lst name="defaults">
+    		<str name="config">rss-data-config.xml</str>
+    		<lst name="datasource">
+    			<str name="type">HttpDataSource</str>
+    		</lst>
+    	</lst>
+    </requestHandler>
+ }}}
+ 
+ The data-config for this example looks like this:
+ {{{
+ <dataConfig>
+ 
+ 	<document>
+ 		<entity name="slashdot"
+ 				pk="link"
+ 				url="http://rss.slashdot.org/Slashdot/slashdot"
+ 				processor="XPathEntityProcessor"
+ 				forEach="/RDF/channel | /RDF/item"
+ 				transformer="DateFormatTransformer">
+ 				
+ 			<field column="source" xpath="/RDF/channel/title" commonField="true" />
+ 			<field column="source-link" xpath="/RDF/channel/link" commonField="true" />
+ 			<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
+ 			
+ 			<field column="title" xpath="/RDF/item/title" />
+ 			<field column="link" xpath="/RDF/item/link" />
+ 			<field column="description" xpath="/RDF/item/description" />
+ 			<field column="creator" xpath="/RDF/item/creator" />
+ 			<field column="item-subject" xpath="/RDF/item/subject" />
+ 			<field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"
/>
+ 			<field column="slash-department" xpath="/RDF/item/department" />
+ 			<field column="slash-section" xpath="/RDF/item/section" />
+ 			<field column="slash-comments" xpath="/RDF/item/comments" />
+ 		</entity>
+ 	</document>
+ </dataConfig>
+ }}}
+ 
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have all user needs
met by an xml configuration alone. So we expose a few interfaces which can be implemented
by the user to enhance the functionality.
  

Mime
View raw message