lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1301) Solr + Hadoop
Date Fri, 15 Jan 2010 16:40:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800748#action_12800748
] 

Grant Ingersoll commented on SOLR-1301:
---------------------------------------

bq. Hmm, I don't think this would make sense - the whole point of this patch is to distribute
the load by indexing into multiple Solr instances that use the same config - and this can
be an existing user's config including the components from ${solr.home}/lib .

Obviously, you would need to have a configuration of Solr indexing servers.  This could easily
be obtained from ZK per the other work being done.  Then, the reduce steps just create their
SolrServer based on those values and can index directly.  (I realize the ZK stuff didn't exist
when you put up this patch.)

bq. HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS -
for this reason indexes are always created on local storage of each node, and then after closing
they are copied to HDFS.

Right, and then copied down from HDFS and installed in Solr, correct?  You still have the
issue of knowing which Solr instances get which shards off of HDFS, right?  Just seems like
a little more configuration knowledge could alleviate all that extra copying/installing, etc.


> Solr + Hadoop
> -------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.5
>
>         Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing (using Hadoop)
to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat
consumes data produced by reduce tasks directly, without storing it in intermediate files.
Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts
as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn
uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer,
and it also instantiates an implementation of SolrDocumentConverter, which is responsible
for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch,
which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the
OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from
which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there
were reduce tasks. The output shards are placed in the output directory on the default filesystem
(e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally,
users can specify the number of reduce tasks, in particular 1 reduce task, in which case the
output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API.
It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should
put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved
for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message