spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: skip lines in spark
Date Wed, 23 Apr 2014 17:02:58 GMT
What I suggested will not work if # of records you want to drop is more
than the data in first partition. In my use-case, I only drop the first
couple lines, so I don't have this issue.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 9:55 AM, Chengi Liu <chengi.liu.86@gmail.com> wrote:

> Xiangrui,
>   So, is it that full code suggestion is :
> val trigger = rddData.zipWithIndex().filter(
> _._2 >= 10L).map(_._1)
>
> and then what DB Tsai recommended
> trigger.mapPartitionsWithIndex((partitionIdx: Int, lines:
> Iterator[String]) => {
>   if (partitionIdx == 0) {
>     lines.drop(n)
>   }
>   lines
> })
>
> Is that the full operation..
>
> What happens, if I have to drop so many records that the number exceeds
> partition 0.. ??
> How do i handle that case?
>
>
>
>
> On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng <mengxr@gmail.com> wrote:
>
>> If the first partition doesn't have enough records, then it may not
>> drop enough lines. Try
>>
>> rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)
>>
>> It might trigger a job.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai <dbtsai@stanford.edu> wrote:
>> > Hi Chengi,
>> >
>> > If you just want to skip first n lines in RDD, you can do
>> >
>> > rddData.mapPartitionsWithIndex((partitionIdx: Int, lines:
>> Iterator[String])
>> > => {
>> >   if (partitionIdx == 0) {
>> >     lines.drop(n)
>> >   }
>> >   lines
>> > }
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu <chengi.liu.86@gmail.com>
>> wrote:
>> >>
>> >> Hi,
>> >>   What is the easiest way to skip first n lines in rdd??
>> >> I am not able to figure this one out?
>> >> Thanks
>> >
>> >
>>
>
>

Mime
View raw message