spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: New Feature Request
Date Wed, 05 Aug 2015 09:39:58 GMT
I don't think countApprox is appropriate here unless approximation is OK.
But more generally, counting everything matching a filter requires applying
the filter to the whole data set, which seems like the thing to be avoided
here.

The take approach is better since it would stop after finding n matching
elements (it might do a little extra work given partitioning and
buffering). It would not filter the whole data set.

The only downside there is that it would copy n elements to the driver.

On Wed, Aug 5, 2015 at 10:34 AM, Sandeep Giri <sandeep@knowbigdata.com>
wrote:

> Hi Jonathan,
>
> Does that guarantee a result? I do not see that it is really optimized.
>
> Hi Carsten,
>
>
> How does the following code work:
>
> data.filter(qualifying_function).take(n).count() >= n
>
>
> Also, as per my understanding, in both the approaches you mentioned the
> qualifying function will be executed on whole dataset even if the value was
> already found in the first element of RDD:
>
>
>    - data.filter(qualifying_function).take(n).count() >= n
>       - val contains1MatchingElement = !(data.filter(qualifying_
>       function).isEmpty())
>
> Isn't it? Am I missing something?
>
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. <http://KnowBigData.com.>
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
> other site icon] <http://knowbigdata.com>  [image: facebook icon]
> <https://facebook.com/knowbigdata> [image: twitter icon]
> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>
>
> On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy <
> jonathan.winandy@gmail.com> wrote:
>
>> Hello !
>>
>> You could try something like that :
>>
>> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = {
>>   rdd.filter(f).countApprox(timeout = 10000).getFinalValue().low > n
>> }
>>
>> If would work for large datasets and large value of n.
>>
>> Have a nice day,
>>
>> Jonathan
>>
>>
>>
>> On 31 July 2015 at 11:29, Carsten Schnober <
>> schnober@ukp.informatik.tu-darmstadt.de> wrote:
>>
>>> Hi,
>>> the RDD class does not have an exist()-method (in the Scala API), but
>>> the functionality you need seems easy to resemble with the existing
>>> methods:
>>>
>>> val containsNMatchingElements =
>>> data.filter(qualifying_function).take(n).count() >= n
>>>
>>> Note: I am not sure whether the intermediate take(n) really increases
>>> performance, but the idea is to arbitrarily reduce the number of
>>> elements in the RDD before counting because we are not interested in the
>>> full count.
>>>
>>> If you need to check specifically whether there is at least one matching
>>> occurrence, it is probably preferable to use isEmpty() instead of
>>> count() and check whether the result is false:
>>>
>>> val contains1MatchingElement =
>>> !(data.filter(qualifying_function).isEmpty())
>>>
>>> Best,
>>> Carsten
>>>
>>>
>>>
>>> Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
>>> > Dear Spark Dev Community,
>>> >
>>> > I am wondering if there is already a function to solve my problem. If
>>> > not, then should I work on this?
>>> >
>>> > Say you just want to check if a word exists in a huge text file. I
>>> could
>>> > not find better ways than those mentioned here
>>> > <
>>> http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
>>> >.
>>> >
>>> > So, I was proposing if we have a function called /exists /in RDD with
>>> > the following signature:
>>> >
>>> > #returns the true if n elements exist which qualify our criteria.
>>> > #qualifying function would receive the element and its index and return
>>> > true or false.
>>> > def /exists/(qualifying_function, n):
>>> >      ....
>>> >
>>> >
>>> > Regards,
>>> > Sandeep Giri,
>>> > +1 347 781 4573 (US)
>>> > +91-953-899-8962 (IN)
>>> >
>>> > www.KnowBigData.com. <http://KnowBigData.com.>
>>> > Phone: +1-253-397-1945 (Office)
>>> >
>>> > linkedin icon <https://linkedin.com/company/knowbigdata> other site
>>> icon
>>> > <http://knowbigdata.com> facebook icon
>>> > <https://facebook.com/knowbigdata>twitter icon
>>> > <https://twitter.com/IKnowBigData><https://twitter.com/IKnowBigData>
>>> >
>>>
>>> --
>>> Carsten Schnober
>>> Doctoral Researcher
>>> Ubiquitous Knowledge Processing (UKP) Lab
>>> FB 20 / Computer Science Department
>>> Technische Universit├Ąt Darmstadt
>>> Hochschulstr. 10, D-64289 Darmstadt, Germany
>>> phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
>>> schnober@ukp.informatik.tu-darmstadt.de
>>> www.ukp.tu-darmstadt.de
>>>
>>> Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
>>> GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
>>> (AIPHES): www.aiphes.tu-darmstadt.de
>>> PhD program: Knowledge Discovery in Scientific Literature (KDSL)
>>> www.kdsl.tu-darmstadt.de
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message