lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Kanarsky (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-1301) Solr + Hadoop
Date Thu, 23 Feb 2012 01:07:48 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214152#comment-13214152
] 

Alexander Kanarsky commented on SOLR-1301:
------------------------------------------

OK, so I changed the patch to work with 3.5 ant build and re-tested it with Solr 3.5 and Cloudera's
CDH3u3 (both the build and csv test run in pseudo-distributed mode). Still no unit tests but
I am working on this :)

No changes compared to previous version except that I had to comment out the code that sets
the debug level dynamically in SolrRecordWriter - because of the conflics with slf4j parts
in current Solr; I think it is minor but if not please feel free to resolve this and update
the patch. With this done, no need to put the log4j and commons-logging jars in the hadoop/lib
at a compile time anymore, only the hadoop jar. I provided the hadoop-core-0.20.2-cdh3u3.jar
used for testing as a part of the patch but you can use the other versions of 0.20.x if you'd
like; it also should work with hadoop 0.21.x. Note that you still need to make the other related
jars (solr, solrj, lucene, commons etc) available while you running your job; one way to do
this is to put all the needed jars into the lib subfolder of apache-solr-hadoop jar, another
ways are described here: http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/.


Finally, the quick steps to get the patch compiled (on linux):
1.  get the solr source tarball (apache-solr-3.5.0-src.tgz in this example), put it into some
folder, cd there
2.  tar -xzf apache-solr-3.5.0-src.tgz
3.  cd apache-solr-3.5.0/solr
4.  wget https://issues.apache.org/jira/secure/attachment/12515662/SOLR-1301.patch
5.  patch -p0 -i SOLR-1301.patch
6.  mkdir contrib/hadoop/lib
7.  cd contrib/hadoop/lib
8.  wget https://issues.apache.org/jira/secure/attachment/12515663/hadoop-core-0.20.2-cdh3u3.jar
9.  cd ../../..
10. ant dist

and you should have the apache-solr-hadoop-3.5-SNAPSHOT.jar in solr/dist folder.
                
> Solr + Hadoop
> -------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.6, 4.0
>
>         Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java,
commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar,
hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar
>
>
> This patch contains  a contrib module that provides distributed indexing (using Hadoop)
to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat
consumes data produced by reduce tasks directly, without storing it in intermediate files.
Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts
as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn
uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer,
and it also instantiates an implementation of SolrDocumentConverter, which is responsible
for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch,
which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the
OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from
which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there
were reduce tasks. The output shards are placed in the output directory on the default filesystem
(e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally,
users can specify the number of reduce tasks, in particular 1 reduce task, in which case the
output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API.
It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should
put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved
for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message