spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: About a problem running a spark job in a cdh-5.7.0 vmware image.
Date Sat, 04 Jun 2016 16:23:14 GMT
Hi,

Spark works in local, standalone and yarn-client mode. Start as master =
local. That is the simplest model.You DO not need to start
$SPAK_HOME/sbin/start-master.sh and $SPAK_HOME/sbin/start-slaves.sh


Also you do not need to specify all that in spark-submit. In the Scala code
you can do

val sparkConf = new SparkConf().
             setAppName("CEP_streaming_with_JDBC").
             set("spark.driver.allowMultipleContexts", "true").
             set("spark.hadoop.validateOutputSpecs", "false")

And specify all that in spark-submit itself with minimum resources

${SPARK_HOME}/bin/spark-submit \
                --packages com.databricks:spark-csv_2.11:1.3.0 \
                --driver-memory 2G \
                --num-executors 1 \
                --executor-memory 2G \
                --master local \
                --executor-cores 2 \
                --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps" \
                --jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
                --class "${FILE_NAME}" \
                --class ${FILE_NAME} \
                --conf "spark.ui.port=4040" \
                ${JAR_FILE}

The spark GUI UI port is 4040 (the default). Just track the progress of the
job. You can specify your own port by replacing 4040 by a nom used port
value

Try it anyway.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 3 June 2016 at 11:39, Alonso <alonsoir@gmail.com> wrote:

> Hi, i am developing a project that needs to use kafka, spark-streaming and
> spark-mllib, this is the github project
> <https://github.com/alonsoir/awesome-recommendation-engine/tree/develop>.
>
> I am using a vmware cdh-5.7-0 image, with 4 cores and 8 GB of ram, the
> file that i want to use is only 16 MB, if i finding problems related with
> resources because the process outputs this message:
>
>
>  .set("spark.driver.allowMultipleContexts", "true")
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
> 16/06/03 11:58:09 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources
>
>
> when i go to spark-master page, i can see this:
>
>
> *Spark Master at spark://192.168.30.137:7077*
>
> *    URL: spark://192.168.30.137:7077*
> *    REST URL: spark://192.168.30.137:6066 (cluster mode)*
> *    Alive Workers: 0*
> *    Cores in use: 0 Total, 0 Used*
> *    Memory in use: 0.0 B Total, 0.0 B Used*
> *    Applications: 2 Running, 0 Completed*
> *    Drivers: 0 Running, 0 Completed*
> *    Status: ALIVE*
>
> *Workers*
> *Worker Id Address State Cores Memory*
> *Running Applications*
> *Application ID Name Cores Memory per Node Submitted Time User State
> Duration*
> *app-20160603115752-0001*
> *(kill)*
> * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:52 cloudera WAITING
> 2.0 min*
> *app-20160603115751-0000*
> *(kill)*
> * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:51 cloudera WAITING
> 2.0 min*
>
>
> And this is the spark-worker output:
>
> *Spark Worker at 192.168.30.137:7078*
>
> *    ID: worker-20160603115937-192.168.30.137-7078*
> *    Master URL:*
> *    Cores: 4 (0 Used)*
> *    Memory: 6.7 GB (0.0 B Used)*
>
> *Back to Master*
> *Running Executors (0)*
> *ExecutorID Cores State Memory Job Details Logs*
>
> It is weird isn't ? master url is not set up and there is not any
> ExecutorID, Cores, so on so forth...
>
> If i do a ps xa | grep spark, this is the output:
>
> [cloudera@quickstart bin]$ ps xa | grep spark
>  6330 ?        Sl     0:11 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
> /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
> -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m
> org.apache.spark.deploy.master.Master
>
>  6674 ?        Sl     0:12 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
> /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
> -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory
> -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m
> org.apache.spark.deploy.history.HistoryServer
>
>  8153 pts/1    Sl+    0:14 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
> /home/cloudera/awesome-recommendation-engine/target/pack/lib/*
> -Dprog.home=/home/cloudera/awesome-recommendation-engine/target/pack
> -Dprog.version=1.0-SNAPSHOT example.spark.AmazonKafkaConnector
> 192.168.1.35:9092 amazonRatingsTopic
>
>  8413 ?        Sl     0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
> /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
> spark://quickstart.cloudera:7077
>
>  8619 pts/3    S+     0:00 grep spark
>
> master is set up with four cores and 1 GB and worker has not any dedicated
> core and it is using 1GB, that is weird isn't ? I have configured the
> vmware image with 4 cores (from eight) and 8 GB (from 16).
>
> This is how it looks my build.sbt:
>
> libraryDependencies ++= Seq(
>   "org.apache.kafka" % "kafka_2.10" % "0.8.1"
>       exclude("javax.jms", "jms")
>       exclude("com.sun.jdmk", "jmxtools")
>       exclude("com.sun.jmx", "jmxri"),
>    //not working play module!! check
>    //jdbc,
>    //anorm,
>    //cache,
>    // HTTP client
>    "net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
>    // HTML parser
>    "org.jodd" % "jodd-lagarto" % "3.5.2",
>    "com.typesafe" % "config" % "1.2.1",
>    "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
>    "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
>    "org.twitter4j" % "twitter4j-core" % "4.0.2",
>    "org.twitter4j" % "twitter4j-stream" % "4.0.2",
>    "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
>    "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
>    "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",
>    "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",
>    "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",
>    "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",
>    "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",
>    "com.google.code.gson" % "gson" % "2.6.2",
>    "commons-cli" % "commons-cli" % "1.3.1",
>    "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
>    // Akka
>    "com.typesafe.akka" %% "akka-actor" % akkaVersion,
>    "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
>    // MongoDB
>    "org.reactivemongo" %% "reactivemongo" % "0.10.0"
> )
>
> packAutoSettings
>
> As you can see, i am using the exact version of spark modules for the
> pseudo cluster and i want to use sbt-pack in order to create
> an unix command, this is how i am declaring programmatically the spark
> context :
>
>
> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>                                    //.setMaster("local[4]")
>
>  .setMaster("spark://192.168.30.137:7077")
>                                    .set("spark.cores.max", "2")
>
> ...
>
> val ratingFile= "hdfs://192.168.30.137:8020/user/cloudera/ratings.csv"
>
>
> println("Using this ratingFile: " + ratingFile)
>   // first create an RDD out of the rating file
>   val rawTrainingRatings = sc.textFile(ratingFile).map {
>     line =>
>       val Array(userId, productId, scoreStr) = line.split(",")
>       AmazonRating(userId, productId, scoreStr.toDouble)
>   }
>
>   // only keep users that have rated between MinRecommendationsPerUser and
> MaxRecommendationsPerUser products
>
>
> //THIS IS THE LINE THAT PROVOKES the
> *WARN TaskSchedulerImp*
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
> *!*
>
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId)
>                                           .filter(r =>
> MinRecommendationsPerUser <= r._2.size  && r._2.size <
> MaxRecommendationsPerUser)
>                                           .flatMap(_._2)
>                                           .repartition(NumPartitions)
>                                           .cache()
>
>   println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings
> out of ${rawTrainingRatings.count()}")
>
> My question is, do you see anything wrong with the code? is there anything
> terrible wrong that i have to change? and,
> what can i do to have this up and running with my resources?
>
> What most annoys me is that the above code works perfectly in the console
> spark of the virtual image but when I try to make it run
> programmatically creating the unix with SBT-pack command does not work.
>
> If the dedicated resources are too few to develop this project, what else
> can i do? i mean, do i need to hire a tiny cluster with AWS
> or any another provider? if that is a correct answer, which are yours
> recommendation?
>
> Thank you very much for reading until here.
>
> Regards,
>
> Alonso
>
>
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> ------------------------------
> View this message in context: About a problem running a spark job in a
> cdh-5.7.0 vmware image.
> <http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-running-a-spark-job-in-a-cdh-5-7-0-vmware-image-tp27082.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Mime
View raw message