spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Is there a limit on the number of tasks in one job?
Date Mon, 13 Jun 2016 23:04:06 GMT
Hi,

You can control an initial num. of partitions (tasks) in v2.0.
https://www.mail-archive.com/user@spark.apache.org/msg51603.html

// maropu


On Tue, Jun 14, 2016 at 7:24 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Have you looked at spark GUI to see what it is waiting for. is that
> available memory. What is the resource manager you are using?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 13 June 2016 at 20:45, Khaled Hammouda <khaled.hammouda@kik.com> wrote:
>
>> Hi Michael,
>>
>> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the
>> preview and tried using it, but I’m running into the exact same issue.
>>
>> Khaled
>>
>> On Jun 13, 2016, at 2:58 PM, Michael Armbrust <michael@databricks.com>
>> wrote:
>>
>> You might try with the Spark 2.0 preview.  We spent a bunch of time
>> improving the handling of many small files.
>>
>> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda <
>> khaled.hammouda@kik.com> wrote:
>>
>>> I'm trying to use Spark SQL to load json data that are split across
>>> about 70k
>>> files across 24 directories in hdfs, using
>>> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>>>
>>> This doesn't seem to work for some reason, I get timeout errors like the
>>> following:
>>>
>>> -------
>>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
>>> 120000
>>> ms while there are outstanding requests. Assuming connection is dead;
>>> please
>>> adjust spark.network.timeout if this is wrong.
>>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>>> outstanding when connection from
>>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
>>> ...
>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>>> ...
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>>> [120 seconds]
>>> ------
>>>
>>> I don't want to start tinkering with increasing timeouts yet. I tried to
>>> load just one sub-directory, which contains around 4k files, and this
>>> seems
>>> to work fine. So I thought of writing a loop where I load the json files
>>> from each sub-dir and then unionAll the current dataframe with the
>>> previous
>>> dataframe. However, this also fails because apparently the json files
>>> don't
>>> have the exact same schema, causing this error:
>>>
>>> ---
>>> Traceback (most recent call last):
>>>   File "/home/hadoop/load_json.py", line 65, in <module>
>>>     df = df.unionAll(hrdf)
>>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>>> line 998, in unionAll
>>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>>> line 813, in __call__
>>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>>> 51, in deco
>>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>>> ---
>>>
>>> I'd like to know what's preventing Spark from loading 70k files the same
>>> way
>>> it's loading 4k files?
>>>
>>> To give you some idea about my setup and data:
>>> - ~70k files across 24 directories in HDFS
>>> - Each directory contains 3k files on average
>>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>>> available to YARN
>>> - Spark 1.6.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message