flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paris Carbone <par...@kth.se>
Subject Re: sampling function
Date Tue, 12 Jul 2016 10:11:54 GMT
Hey Do,

I think that more sophisticated samplers could make a better fit in the ML library and not
in the core API but I am not very familiar with the milestones there.
Maybe the maintainers of the batch ML library could check if sampling techniques could be
useful there I guess.

Paris

> On 11 Jul 2016, at 16:15, Le Quoc Do <lequocdo@gmail.com> wrote:
> 
> Hi all,
> 
> Thank you all for your answers.
> By the way, I also recognized that Flink doesn't support  "stratified
> sampling" function (only simple random sampling) for DataSet.
> It would be nice if someone can create a Jira for it, and assign the task
> to me so that I can work for it.
> 
> Thank you,
> Do
> 
> On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
> 
>> Hi Do,
>> 
>> Paris and Martha worked on sampling techniques for data streams on Flink
>> last year. If you want to implement your own samplers, you might find
>> Martha's master thesis helpful [1].
>> 
>> -Vasia.
>> 
>> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
>> 
>> On 11 July 2016 at 11:31, Kostas Kloudas <k.kloudas@data-artisans.com>
>> wrote:
>> 
>>> Hi Do,
>>> 
>>> In DataStream you can always implement your own
>>> sampling function, hopefully without too much effort.
>>> 
>>> Adding such functionality it to the API could be a good idea.
>>> But given that in sampling there is no “one-size-fits-all”
>>> solution (as not every use case needs random sampling and not
>>> all random samplers fit to all workloads), I am not sure if we
>>> should start adding different sampling operators.
>>> 
>>> Thanks,
>>> Kostas
>>> 
>>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <code@greghogan.com> wrote:
>>>> 
>>>> Hi Do,
>>>> 
>>>> DataSet provides a stable @Public interface. DataSetUtils is marked
>>>> @PublicEvolving which is intended for public use, has stable behavior,
>>> but
>>>> method signatures may change. It's also good to limit DataSet to common
>>>> methods whereas the utility methods tend to be used for specific
>>>> applications.
>>>> 
>>>> I don't have the pulse of streaming but this sounds like a useful
>> feature
>>>> that could be added.
>>>> 
>>>> Greg
>>>> 
>>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequocdo@gmail.com>
>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm working on approximate computing using sampling techniques. I
>>>>> recognized that Flink supports the sample function for Dataset
>>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
>> wondering
>>> why
>>>>> you didn't merge the function to
>> org/apache/flink/api/java/DataSet.java
>>>>> since the sample function works as a transformation operator?
>>>>> 
>>>>> The second question is that are you planning to support the sample
>>>>> function for DataStream (within windows) since I did not see it in
>>>>> DataStream code ?
>>>>> 
>>>>> Thank you,
>>>>> Do
>>>>> 
>>> 
>>> 
>> 

Mime
View raw message