spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From swastik mittal <smitt...@ncsu.edu>
Subject How does spark operate internally for an indivisual task?
Date Thu, 14 Mar 2019 16:53:57 GMT
I am running a grep application on spark 2.3.4 and scala version 2.11. I have
an input textfile of 813MB stored on a remote source (not a part of spark
infrastructure) using hdfs. My application just reads the textfile line by
line from hdfs server and filters for a given keyword in each line and
output's like grep in Linux. Hdfs divides the file into 128MB chunks so my
application distributes into 7 tasks and 1 stage (stage 0). I want to
analyze the time spark takes for a task in the compute function of
hadoopRDD. For that I record and log every time a hadoopRDD compute, read,
updaterecords or updatebytesread is called. Also when the filter RDD
(MapPartitionsRDD) compute and the spark build filter function is called.
What I observe is that the MapPartitionsRDD which is the child RDD has its
compute and filter function called first and once the hadoopRDD is called it
never logs compute or filter operation of MapPartitionsRDD. But, before
reading the data spark cannot perform any filter on it, then the computing
has to be called after a read operation. Does this filter operation work
simultaneously on every record read, or once the whole text file chunk is
read? Also How can I separate the information about the two or know when
exactly did the first mapPartionRDD operation was done?
Any help is appreciated.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message