lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing files from HDFS
Date Wed, 11 Oct 2017 14:53:46 GMT
You probably get much more informed responses from
the Cloudera folks, especially about Hue.

Best,
Erick

On Wed, Oct 11, 2017 at 6:05 AM, István <leccine@gmail.com> wrote:
> Hi,
>
> I have Solr 4.10.3 part of a CDH5 installation and I would like to index
> huge amount of CSV files on HDFS. I was wondering what is the best way of
> doing that.
>
> Here is the current approach:
>
> data.csv:
>
> id, fruit
> 10, apple
> 20, orange
>
> Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar
>
> hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar
> \
> org.apache.solr.hadoop.MapReduceIndexerTool \
> -D 'mapred.child.java.opts=-Xmx500m' --log4j \
> /opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties
> --morphline-file \
> /home/user/readCSV.conf \
> --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
> --go-live \
> --zk-host name-node.server.com:2181/solr --collection collection0 \
> hdfs://name-node.server.com:8020/user/solr/input
>
> This leads to the following exception:
>
> 2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
> files using 1 real mappers into 1 reducers
> Error: java.io.IOException: Batch Write Failure
>         at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
> ..
> Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown
> field 'file_path'
>         at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
>         at
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
>
> It appears to me that the schema does not have file_path. The collection is
> created through Hue and it properly identifies the two fields id and fruit.
> I found out that the search-mr tool has the following code that references
> the file_path:
>
> https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30
>
> I am not sure what to do in order to be able to index files on HDFS. I have
> two guesses:
>
> - add the fields definied in the search tool to the schema when I create it
> (not sure how that work through Hue)
> - disable the HDFS meatadata insertion when inserting data
>
> Has anybody seen this before?
>
> Thanks,
> Istvan
>
>
>
>
> --
> the sun shines for all

Mime
View raw message