systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Randomly Selecting rows from a dataframe
Date Sun, 30 Apr 2017 20:15:51 GMT
well, you can pull a sample with replacement by constructing the
permutation matrix slightly differently (optionally you could also sort the
sample if required):

P = table(seq(1,N), sample(nrow(X),N,TRUE), N, nrow(X));
Xsample = P %*% X;

Btw, your script didn't work because removeEmpty with selection vector
expects a non-zero indicator by position not by value (e.g., non-zero in
7th cell indicates that you want to select the 7th row which ignores the
actual value you feed in).

Regards,
Matthias


On Sun, Apr 30, 2017 at 1:47 AM, arijit chakraborty <akc14@hotmail.com>
wrote:

> Hi,
>
>
> The solution Matthias gave works perfectly when we are doing random sample
> of the dataframe without replacement. But it's not working with
> replacement. E.g. if I've the original dataframe of the form
> matrix(seq(1,100,100, 1) and want to select randomly 20 rows. With Matthias
> example, we can randomly sample that and the new matrix might look like this
>
>
> matrix("1 2 3 21 29 36 37 40 45 53 55 56 71 72 79  82 90 96 97 99", 20,1).
>
>
> But if I want a matrix of this form, (which can be possible with random
> sampling with replacement)
>
>
> matrix("1 2 3 21 21 21 37 40 45 53 53 56 71 79 79  82 90 96 97 99", 20,1).
>
>
> I'm not getting it.
>
>
> I tried the following code:
>
>
> data_ind = matrix(seq(1,nrow(actual_data), 1), nrow(bdframe_bt_subset_1),
> 1)
>
> data_sample = sample(nrow(data_ind), 100, TRUE)
>
> data_sample_matrix= matrix(data_sample, 100, 1)
>
> a = matrix(0, (nrow(data_ind)- nrow(data_sample_matrix)), 1)
>
> data_sample1 = rbind(data_sample, a)
>
> b = removeEmpty(target=actual_data, margin="rows", select = data_sample1);
>
> But this is not giving me the repeated row even though I can see in
> "data_sample_matrix" I've repeated position in the data.
>
> I also tried the follow "sample.dlm" in "utils" folder, but that also not
> giving me the answer I'm looking for.
>
> We can use the for-loop in this case using "data_sample_matrix" matrix.
> But want to avoid looping.
>
> Can anyone please help?
>
> Thank you!
> Arijit
>
>
>
>
> ________________________________
> From: arijit chakraborty <akc14@hotmail.com>
> Sent: Saturday, April 22, 2017 12:45 PM
> To: dev@systemml.incubator.apache.org
> Subject: Re: Randomly Selecting rows from a dataframe
>
> Thank you Matthias! You are most helpful!
>
>
> Thanks again!
>
> Arijit
>
> ________________________________
> From: Matthias Boehm <mboehm7@googlemail.com>
> Sent: Saturday, April 22, 2017 2:20:48 AM
> To: dev@systemml.incubator.apache.org
> Subject: Re: Randomly Selecting rows from a dataframe
>
> you can take for example a 1%  sample of rows via a permutation matrix
> (specifically selection matrix) as follows
>
> I = (rand(rows=nrow(X), cols=1, min=0, max=1) <= 0.01);
> P = removeEmpty(target=diag(I), margin="rows");
> Xsample = P %*% X;
>
> or via removeEmpty and selection vector
>
> I = (rand(rows=nrow(X), cols=1, min=0, max=1) <= 0.01);
> Xsample = removeEmpty(target=X, margin="rows", select=I);
>
> Both should be compiled internally to very similar plans.
>
> Regards,
> Matthias
>
> On Fri, Apr 21, 2017 at 1:42 PM, arijit chakraborty <akc14@hotmail.com>
> wrote:
>
> > Hi,
> >
> >
> > Suppose I've a dataframe of 10 variables (X1-X10) and have 1000 rows. Now
> > I want to randomly select rows so that I've a subset of the dataset.
> >
> >
> > Can anyone please help me to solve this problem?
> >
> >
> > I tried the following code:
> >
> >
> > randSample = sample(nrow(dataframe), 200);
> >
> >
> > This gives me a column matrix with position of the row randomly selected.
> > But I could not able to solve how from this matrix I can subset data from
> > original dataframe.
> >
> >
> > Thank you!
> >
> >
> > Arijit
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message