spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: skip lines in spark
Date Wed, 23 Apr 2014 16:51:08 GMT
If the first partition doesn't have enough records, then it may not
drop enough lines. Try

rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)

It might trigger a job.

Best,
Xiangrui

On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai <dbtsai@stanford.edu> wrote:
> Hi Chengi,
>
> If you just want to skip first n lines in RDD, you can do
>
> rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
> => {
>   if (partitionIdx == 0) {
>     lines.drop(n)
>   }
>   lines
> }
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu <chengi.liu.86@gmail.com> wrote:
>>
>> Hi,
>>   What is the easiest way to skip first n lines in rdd??
>> I am not able to figure this one out?
>> Thanks
>
>

Mime
View raw message