From dev-return-2316-archive-asf-public=cust-asf.ponee.io@datafu.apache.org Tue May 28 04:13:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2DA67180638 for ; Tue, 28 May 2019 06:13:02 +0200 (CEST) Received: (qmail 78834 invoked by uid 500); 28 May 2019 04:13:01 -0000 Mailing-List: contact dev-help@datafu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.apache.org Delivered-To: mailing list dev@datafu.apache.org Received: (qmail 78819 invoked by uid 99); 28 May 2019 04:13:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2019 04:13:01 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 869D8E2BE2 for ; Tue, 28 May 2019 04:13:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3420D2459B for ; Tue, 28 May 2019 04:13:00 +0000 (UTC) Date: Tue, 28 May 2019 04:13:00 +0000 (UTC) From: "Russell Jurney (JIRA)" To: dev@datafu.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (DATAFU-148) Setup Spark sub-project MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849318#comment-16849318 ] Russell Jurney edited comment on DATAFU-148 at 5/28/19 4:12 AM: ---------------------------------------------------------------- Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame` class? [`pymongo_spark`](https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py) (part of [mongo-hadoop](https://github.com/mongodb/mongo-hadoop)) does this to add methods like `pyspark.sql.DataFrame.saveToMongoDB` which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {{ import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} }} And internally it looks like this: {{ def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB }} was (Author: russell.jurney): Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame` class? [`pymongo_spark`](https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py) (part of [mongo-hadoop](https://github.com/mongodb/mongo-hadoop)) does this to add methods like `pyspark.sql.DataFrame.saveToMongoDB` which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {{import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} And internally it looks like this: {{def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB}} > Setup Spark sub-project > ----------------------- > > Key: DATAFU-148 > URL: https://issues.apache.org/jira/browse/DATAFU-148 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Priority: Major > Attachments: patch.diff, patch.diff > > Time Spent: 40m > Remaining Estimate: 0h > > Create a skeleton Spark sub project for Spark code to be contributed to DataFu -- This message was sent by Atlassian JIRA (v7.6.3#76005)