lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9240) Support parallel ETL with the topic expression
Date Mon, 18 Jul 2016 01:55:21 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381624#comment-15381624
] 

ASF subversion and git services commented on SOLR-9240:
-------------------------------------------------------

Commit 75d3243647923c462a345205d08e0fbb6dbe73f3 in lucene-solr's branch refs/heads/branch_6x
from jbernste
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=75d3243 ]

SOLR-9240: Update CHANGES.txt

Conflicts:
	solr/CHANGES.txt


> Support parallel ETL with the topic expression
> ----------------------------------------------
>
>                 Key: SOLR-9240
>                 URL: https://issues.apache.org/jira/browse/SOLR-9240
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.2
>
>         Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for SolrCloud to support large scale *Extract, Transform and Load*
work loads with streaming expressions. Instead of using MapReduce for ETL, the topic expression
can be used which allows SolrCloud to be treated like a distributed message queue filled with
data to be processed. The topic expression works in batches and supports retrieval of stored
fields, so large scale *text ETL* will work perfectly with this approach.
> This ticket makes two small changes to the topic() expression that makes this possible:
> 1) Changes the topic expression so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic can start
pulling records from anywhere in the queue.
> Daemons can be sent to worker nodes that each work on processing a partition of the data
from the same topic. The daemon() function's natural behavior is perfect for iteratively calling
a topic until all records in the topic have been processed.
> The sample code below pulls all records from one collection and indexes them into another
collection. A Transform function could be wrapped around the topic() to transform the records
before loading. Custom functions can also be built to load the data in parallel to any outside
system. 
> {code}
> parallel(
>          workerCollection, 
>          workers="2", 
>          sort="_version_ desc", 
>          daemon(
>                   update(
>                         updateCollection, 
>                         batchSize=200, 
>                         topic(
>                             checkpointCollection,
>                             topicCollection, 
>                             q=*:*, 
>                              id="topic1",
>                              fl="id, to , from, body", 
>                              partitionKeys="id",
>                              initialCheckpoint="0")), 
>                runInterval="1000", 
>                id="daemon1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message