Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A34BCCC97 for ; Fri, 14 Nov 2014 23:28:34 +0000 (UTC) Received: (qmail 77904 invoked by uid 500); 14 Nov 2014 23:28:34 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 77726 invoked by uid 500); 14 Nov 2014 23:28:34 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 77601 invoked by uid 99); 14 Nov 2014 23:28:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Nov 2014 23:28:34 +0000 Date: Fri, 14 Nov 2014 23:28:34 +0000 (UTC) From: "Patrick Wendell (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4395: ----------------------------------- Component/s: PySpark > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -------------------------------------------------------------------------- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT > Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > {code} > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID = int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating = int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ... # Call the parse_apace_log_line function on each line. > ... .map(parse_ratings_line) > ... # Caches the objects in memory since they will be queried multiple times. > ... .cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > {code} > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org