lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Nithian (JIRA)" <>
Subject [jira] Created: (SOLR-2096) DIH should be able read data directly from HDFS for indexing
Date Tue, 31 Aug 2010 06:53:54 GMT
DIH should be able read data directly from HDFS for indexing

                 Key: SOLR-2096
             Project: Solr
          Issue Type: New Feature
          Components: contrib - DataImportHandler
    Affects Versions: 1.4.1
            Reporter: Amit Nithian
             Fix For: 1.4.2
         Attachments: hdfs_reader.tar

DIH doesn't support reading from the hdfs:// protocol which makes it hard to index data generated
by a M/R job. This tarball contains a subclass of the URLDataSource along with an HDFSReader
that allows for this. The data is assumed to be in text format and able to be processed by
the LineEntityProcessor.

Here is an example DIH-Config snippet:
  <dataSource name="queryData" type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource"

  baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8" 
  connectionTimeout="5000" readTimeout="10000"/>
	<document name="autoSuggester">
		<entity name="jc" processor="LineEntityProcessor"
			url="<YOUR FOLDER>/part*" dataSource="queryData">
<!-- Field mappings here if necessary -->

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message