spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reza Zadeh <r...@databricks.com>
Subject Re: k-means can only run on one executor with one thread?
Date Sat, 28 Mar 2015 08:06:02 GMT
How many dimensions does your data have? The size of the k-means model is k
* d, where d is the dimension of the data.

Since you're using k=1000, if your data has dimension higher than say,
10,000, you will have trouble, because k*d doubles have to fit in the
driver.

Reza

On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen <davidshen84@gmail.com> wrote:

> I have put more detail of my problem at
> http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed
>
> It is really appreciate if you can help me take a look at this problem. I
> have tried various settings and ways to load/partition my data, but I just
> cannot get rid that long pause.
>
>
> Thanks,
> David
>
>
>
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>
> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshen84@gmail.com> wrote:
>
>> Yes, I have done repartition.
>>
>> I tried to repartition to the number of cores in my cluster. Not
>> helping...
>> I tried to repartition to the number of centroids (k value). Not
>> helping...
>>
>>
>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <joseph@databricks.com>
>> wrote:
>>
>>> Can you try specifying the number of partitions when you load the data
>>> to equal the number of executors?  If your ETL changes the number of
>>> partitions, you can also repartition before calling KMeans.
>>>
>>>
>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshen84@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a large data set, and I expects to get 5000 clusters.
>>>>
>>>> I load the raw data, convert them into DenseVector; then I did
>>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>>
>>>> Now the job is running, and data are loaded. But according to the Spark
>>>> UI, all data are loaded onto one executor. I checked that executor, and its
>>>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>>>> all other 3 executors are at rest.
>>>>
>>>> Did I miss something? Is it possible to distribute the workload to all
>>>> 4 executors?
>>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>
>

Mime
View raw message