accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cprigano <chris.p.rig...@gmail.com>
Subject Re: Accumulo iterator to return a random sample of a percentile of a table
Date Wed, 05 Feb 2014 17:52:13 GMT
U rock Chris, I was hoping that would not have to re-invent the "code".

This is exactly what I am looking for. Now all I need is a working single
node version of an acccumulo vm to try it out on ... Cloudera CDH4.3 don't
work with accumulo 1.4.3 (thanks Bill!) so I am looking for someting simple
to work with that runs out of the box :-)


On Wed, Feb 5, 2014 at 5:44 AM, Chris Bennight [via Apache Accumulo] <
ml-node+s1065345n7403h46@n5.nabble.com> wrote:

> If it's for the input to some algorithm (machine learning, etc.) I'm
> assuming it *is* important to have that 25% be representative of the
> entire
> population.
>
> HBase implements a simple strategy with a [1]RandomRowFilter that could
> trivially be adapted to an accumulo filter (Iterator).  The caveat being
> it's going to be essentially a full table scan each time - set a
> percentage, and then randomly choose if each key is accepted or not.
> Note
> that if each of your "values" (i.e. the granularity you want to accept or
> reject groups on) is more than one key value, you will want to use
> something like the WholeRowIterator first to aggregate them, then test for
> accept/reject.   You probably don't want to use the WholeRowIterator as
> is,
> as you would want to test/reject on the full key, and only aggregate if it
> passes - but you can use it as a pattern.
>
> If you want something faster then I think you are going to  generate and
> keep some population statistics / summaries on ingest, and query those.
> This will add more sampling error based on the granularity of your
> summaries - but you should be able to quantify that with standard error
> propagation.
>
>
> [1]
>
> https://github.com/apache/hbase/blob/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/RandomRowFilter.java
>
>
>
> On Tue, Feb 4, 2014 at 10:39 PM, cprigano <[hidden email]<http://user/SendEmail.jtp?type=node&node=7403&i=0>>
> wrote:
>
> > Good questions all! I am to start trying to just take a percentile of
> rows
> > in a table similar to a percentile to construct training,
> cross-validation
> > and testing sets. I am a machine learning person and what to be able to
> do
> > say a 25% random sample of rows in a table ( I may not know the size and
> > the percentile should be settable) Starting with the easiest assumption,
> > that all row  are the say "type" will get things started. I can then
> move
> > to more exotic scenarios. Accumulo is a new nut for me to crack and I
> would
> > very much like your thoughts. Thanks mate!
> >
> >
> > On Tue, Feb 4, 2014 at 7:27 PM, Chris Bennight [via Apache Accumulo] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=7403&i=1>>
> wrote:
> >
> > > I'm assuming you want a random selection of entries in accumulo - so
> say
> > a
> > > random selection of key's/values?
> > >
> > > How are your keys formatted (conceptually is fine); is there some sort
> of
> > > regularity to them?  (I.e. can you calculate ahead of time a random
> > > distribution of keys without validating which keys are present)?
> > >
> > > If you can't calculate the key distribution ahead of time, are you
> > keeping
> > > any statistics (or could you) on ingest (cardinality, distribution,
> etc.)
> > > -
> > > and finally, how rigorous and performant do you need this random
> sampling
> > > to be?  Do you just want representative data, or are you trying to do
> > > something like BlinkDB[1]  (allow people to specify confidence
> intervals
> > > on
> > > queries, and only sample enough data to meet the requisite uncertainty
> > > requirements)?
> > >
> > > [1] http://blinkdb.org/
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > > On Sat, Feb 1, 2014 at 3:58 PM, cprigano <[hidden email]<
> > http://user/SendEmail.jtp?type=node&node=7394&i=0>>
> > > wrote:
> > >
> > > > I am looking at writing an Accumulo iterator to return a random
> sample
> > > of a
> > > > percentile of a table.
> > > >
> > > > I would appreciate any suggestions.
> > > >
> > > > Thnaks,
> > > >
> > > > Chris
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354.html
> > > > Sent from the Developers mailing list archive at Nabble.com.
> > > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7394.html
> > >  To unsubscribe from Accumulo iterator to return a random sample of a
> > > percentile of a table, click here<
> >
> >
> > > .
> > > NAML<
> >
> http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7400.html
>
> > Sent from the Developers mailing list archive at Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7403.html
>  To unsubscribe from Accumulo iterator to return a random sample of a
> percentile of a table, click here<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7354&code=Y2hyaXMucC5yaWdhbm9AZ21haWwuY29tfDczNTR8NTkyODE0MjEy>
> .
> NAML<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7417.html
Sent from the Developers mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message