lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wolfgang hoschek (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
Date Mon, 09 Dec 2013 19:32:12 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843443#comment-13843443
] 

wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM:
-----------------------------------------------------------------

I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/dependencies.html

The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy dependences for
solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream
in that currently solr-morphlines-core pulls in a ton of dependencies that it doesn't need,
and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an
out-of-the-box app that bundles user level deps). Correspondingly, would be good to organize
ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all minus cdk-morphlines-solr-cell
(now upstream) minus cdk-morphlines-solr-core (now upstream) plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases downstream review
the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml


was (Author: whoschek):
I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy dependences for
solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream
in that solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those
deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases downstream review
the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

> Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Andrzej Bialecki 
>            Assignee: Mark Miller
>             Fix For: 5.0, 4.7
>
>         Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch,
SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar,
hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
log4j-1.2.15.jar
>
>
> This patch contains  a contrib module that provides distributed indexing (using Hadoop)
to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat
consumes data produced by reduce tasks directly, without storing it in intermediate files.
Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts
as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn
uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer,
and it also instantiates an implementation of SolrDocumentConverter, which is responsible
for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch,
which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the
OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from
which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there
were reduce tasks. The output shards are placed in the output directory on the default filesystem
(e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally,
users can specify the number of reduce tasks, in particular 1 reduce task, in which case the
output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API.
It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should
put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved
for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message