flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian <...@apache.org>
Subject Re: Random Shuffling
Date Wed, 24 Jun 2015 09:08:27 GMT
A very simple way to achieve is to generate a random variate on the 
driver that describes a mapping of datapoints to samples. Then you 
simply join the dataset with this mapping to generate the samples.

This approach requires you to know the size of the dataset in advance, 
but has the advantage that you can guarantee the sizes of the samples 
and can easily support more involved techniques such as sampling with 
replacement.

--sebastian


On 24.06.2015 10:38, Maximilian Alber wrote:
> That's not the point. In Machine Learning one often divides a data set X
> into f.e. three sets, one for the training, one for the validation, one
> for the final testing. The sets are usually created randomly according
> to some ratio. Thus it would be important to keep the ratio and to do
> the whole process randomly.
>
> Cheers,
> Max
>
> On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <sewen@apache.org
> <mailto:sewen@apache.org>> wrote:
>
>     If you do "rebalance()", it will redistribute elements round-robin
>     fashion, which should give you very even partition sizes.
>
>
>     On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber
>     <alber.maximilian@gmail.com <mailto:alber.maximilian@gmail.com>> wrote:
>
>         Thank you!
>
>         Still I cannot guarantee the size of each partition, or can I?
>         Something like randomSplit in Spark.
>
>         Cheers,
>         Max
>
>         On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax
>         <mjsax@informatik.hu-berlin.de
>         <mailto:mjsax@informatik.hu-berlin.de>> wrote:
>
>             Hi,
>
>             using partitionCustom, the data distribution depends only on
>             your
>             probability distribution. If it is uniform, you should be
>             fine (ie,
>             choosing the channel like
>
>              > private final Random random = new
>             Random(System.currentTimeMillis());
>              > int partition(K key, int numPartitions) {
>              >   return random.nextInt(numPartitions);
>              > }
>
>             should do the trick.
>
>             -Matthias
>
>             On 06/15/2015 05:41 PM, Maximilian Alber wrote:
>             > Thanks!
>             >
>             > Ok, so for a random shuffle I need partitionCustom. But in that case
the
>             > data might be out of balance then?
>             >
>             > For the splitting. Is there no way to have exact sizes?
>             >
>             > Cheers,
>             > Max
>             >
>             > On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <trohrmann@apache.org
<mailto:trohrmann@apache.org>
>             > <mailto:trohrmann@apache.org <mailto:trohrmann@apache.org>>>
wrote:
>             >
>             >     Hi Max,
>             >
>             >     you can always shuffle your elements using the |rebalance| method.
>             >     What Flink here does is to distribute the elements of each partition
>             >     among all available TaskManagers. This happens in a round-robin
>             >     fashion and is thus not completely random.
>             >
>             >     A different mean is the |partitionCustom| method which allows you
to
>             >     specify for each element to which partition it shall be sent. You
>             >     would have to specify a |Partitioner| to do this.
>             >
>             >     For the splitting there is at moment no syntactic sugar. What you
>             >     can do, though, is to assign each item a split ID and then use a
>             >     |filter| operation to filter the individual splits. Depending on
you
>             >     split ID distribution you will have differently sized splits.
>             >
>             >     Cheers,
>             >     Till
>             >
>             >     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>             >alber.maximilian@gmail.com <mailto:alber.maximilian@gmail.com>
>              >     <http://mailto:alber.maximilian@gmail.com> wrote:
>              >
>              >         Hi Flinksters,
>              >
>              >         I would like to shuffle my elements in the data
>             set and then
>              >         split it in two according to some ratio. Each
>             element in the
>              >         data set has an unique id. Is there a nice way to
>             do it with the
>              >         flink api?
>              >         (It would be nice to have guaranteed random
>             shuffling.)
>              >         Thanks!
>              >
>              >         Cheers,
>              >         Max
>              >
>              >     ‚Äč
>              >
>              >
>
>
>
>

Mime
View raw message