lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud
Date Thu, 25 Jul 2013 14:31:48 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719654#comment-13719654
] 

Andrzej Bialecki  commented on SOLR-5069:
-----------------------------------------

An alternative solution for minimizing the amount of data in memory during reduce phase is
to use "re-reduce", or a reduce-side combiner, using Hadoop terminology.

This is an additional function that runs on the reducer and periodically performs intermediate
reductions of already accumulated values for a key, and preserves the intermediate results
(and discards the accumulated values). This function does not emit anything to the final output.
Then the final reduction function operates on a mix of values that arrived since the last
intermediate reduction, plus all results of previous intermediate reductions.

This works well for simple aggregations (where the additional function may be in fact a copy
of the reduce function) but may not be suitable to all classes of problems.
                
> MapReduce for SolrCloud
> -----------------------
>
>                 Key: SOLR-5069
>                 URL: https://issues.apache.org/jira/browse/SOLR-5069
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Noble Paul
>            Assignee: Noble Paul
>
> Solr currently does not have a way to run long running computational tasks across the
cluster. We can piggyback on the mapreduce paradigm so that users have smooth learning curve.
>  * The mapreduce component will be written as a RequestHandler in Solr
>  * Works only in SolrCloud mode. (No support for standalone mode) 
>  * Users can write MapReduce programs in Javascript or Java. First cut would be JS (
? )
> h1. sample word count program
> h2.how to invoke?
> http://host:port/solr/collection-x/mapreduce?map=<map-script>&reduce=<reduce-script>&sink=collectionX
> h3. params 
> * map :  A javascript implementation of the map program
> * reduce : a Javascript implementation of the reduce program
> * sink : The collection to which the output is written. If this is not passed , the request
will wait till completion and respond with the output of the reduce program and will be emitted
as a standard solr response. . If no sink is passed the request will be redirected to the
"reduce node" where it will wait till the process is complete. If the sink param is passed
,the rsponse will contain an id of the run which can be used to query the status in another
command.
> * reduceNode : Node name where the reduce is run . If not passed an arbitrary node is
chosen
> The node which received the command would first identify one replica from each slice
where the map program is executed . It will also identify one another node from the same collection
where the reduce program is run. Each run is given an id and the details of the nodes participating
in the run will be written to ZK (as an ephemeral node). 
> h4. map script 
> {code:JavaScript}
> var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on this index
> while(res.hasMore()){
>   var doc = res.next();
>   var txt = doc.get(“txt”);//the field on which word count is performed
>   var words = txt.split(" ");
>    for(i = 0; i < words.length; i++){
> 	$.map(words[i],{‘count’:1});// this will send the map over to //the reduce host
>     }
> }
> {code}
> Essentially two threads are created in the 'map' hosts . One for running the program
and the other for co-ordinating with the 'reduce' host . The maps emitted are streamed live
over an http connection to the reduce program
> h4. reduce script
> This script is run in one node . This node accepts http connections from map nodes and
the 'maps' that are sent are collected in a queue which will be polled and fed into the reduce
program. This also keeps the 'reduced' data in memory till the whole run is complete. It expects
a "done" message from all 'map' nodes before it declares the tasks are complete. After  reduce
program is executed for all the input it proceeds to write out the result to the 'sink' collection
or it is written straight out to the response.
> {code:JavaScript}
> var pair = $.nextMap();
> var reduced = $.getCtx().getReducedMap();// a hashmap
> var count = reduced.get(pair.key());
> if(count === null) {
>   count = {“count”:0};
>   reduced.put(pair.key(), count);
> }
> count.count += pair.val().count ;
> {code}
> h4.example output
> {code:JavaScript}
> {
> “result”:[
> “wordx”:{ 
>          “count”:15876765
>          },
> “wordy” : {
>            “count”:24657654
>           }
>  
>   ]
> }
> {code}
> TBD
> * The format in which the output is written to the target collection, I assume the reducedMap
will have values mapping to the schema of the collection
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message