mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SriSatish Ambati <srisat...@0xdata.com>
Subject Re: 0xdata interested in contributing
Date Fri, 14 Mar 2014 03:10:05 GMT
Mahout is the hello world of Machine Learning. It's still the first place
many new users get exposed to algorithms on big data. Making that
experience beautiful, accessible and value-driven will make machine
learning ubiquitous and Mahout a movement to rival the success & utility of
say, lucene and hadoop. Our vision and motivation is to re-ignite the
community & double down on the identical founding visions of Mahout and
H2O. Under one umbrella, Mahout can power intelligent applications for the
enterprises and users.

Creating great software is hard, creating passionate communities is harder.
Our belief is that a product is not complete without it's community. This
convergence will make Mahout the principal platform for integrating
multiple ways of mining insights from data.

the whole is greater than the sum of the parts,
Sri



On Thu, Mar 13, 2014 at 6:13 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> PS and of course it all sounds like a well rounded project that exceeds
> current Mahout capabilities (in mapreduce world anyway). So not the least
> question is why are you seeking integration with Mahout. Clearly that would
> involve significant effort to do some things Mahout way. So what's the
> motivation?
> On Mar 13, 2014 6:08 PM, "Dmitriy Lyubimov" <dlieu.7@gmail.com> wrote:
>
> > Thank you, Cliff.
> >
> > Those things are pretty much clear. Most of the questions were more along
> > the lines which of those wonderful things you intend to port to Mahout,
> and
> > how you see these to stitch in with existing Mahout architecture.
> >
> > At least one of your users here reported it does not make sense to run
> > Mahout on all this, and at least two of us have trouble seeing how such
> > disassembly and reassembly might take place. What are your thoughts on
> this?
> > How clearly you realize the reintegration roadmap?
> >
> > Will you intend to keep h2o platform around as a standalone project?
> >
> > Do you intend to contribute top-level algorithms as well? What are your
> > thoughts on interoperability of top level algorithms with other
> > memory-based backends?
> >
> > Thank you.
> > On Mar 13, 2014 3:50 PM, "Cliff Click" <cclick0@gmail.com> wrote:
> >
> >> There have been a lot of questions on the H2O architecture, I hope to
> >> answer the top-level ones here.
> >>
> >>
> >> H2O is a fast & flexible engine.  We talk about the MapReduce execution
> >> flavor, because it's easy to explain, because it covers a lot of ground,
> >> and because we're implemented a bunch of dense linear-algebra style
> >> algorithms with it - but that's not the only thing we can do with H2O,
> nor
> >> is it the only coding"style".
> >>
> >>
> >> H2O is based on a number of layers, and is coded to at different layers
> >> to best approach different tasks and objectives.
> >>
> >>  * *In-memory K/V store layer*: H2O sports an in-memory
> >>    (not-persistent) in-memory K/V store, with **exact** (not lazy)
> >>    consistency semantics and transactions.  Both reads and writes are
> >>    fully (locally) cachable.  Typical cache-hit latencies for both are
> >>    around 150ns (that's **nanoseconds**) from a NonBlockingHashMap.
> >>  Let me repeat that: reads and writes go through a non-blocking hash
> >>    table - we do NOT suffer (CANNOT suffer) from a hot-blocks problem.
> >>  Cache-misses obviously require a network hop, and the execution
> >>    times are totally driven by the size of data moved divided by
> >>    available bandwidth... and of course the results are cached.  The
> >>    K/V store is currently used hold control state, all results, and of
> >>    course the Big Data itself.  You could certainly build a dandy
> >>    graph-based algorithm directly over the K/V store; that's been on
> >>    our long-term roadmap for awhile.
> >>
> >>  * *A columnar-compressed distributed Big Data store layer*: Big Data
> >>    is heavily (and losslessly) compressed - typically 2x to 4x better
> >>    than GZIP on disk, (YMMV), and can be accessed like a Java Array. -
> >>    a Giant greater-than-4billion-element distributed Java array.  H2O
> >>    guarantees that if the data is accessed linearly then the access
> >>    time will match what you can get out of C or Fortran - i.e., be
> >>    memory bandwidth bound, not CPU bound.  You can access the array
> >>    (for both reads and writes) in any order, of course, but you get
> >>    strong speed guarantees for accessing in-order.  You can do pretty
> >>    much anything to an H2O array that you can do with a Java array,
> >>    although due to size/scale you'll probably want to access the array
> >>    in a blatantly parallel style.
> >>      o */A note on compression/*: The data is decompressed Just-In-Time
> >>        strictly in CPU registers in the hot inner loops - and THIS IS
> >>        FASTER than decompressing beforehand because most algorithms are
> >>        memory bandwidth bound.  Moving a 32byte cacheline of compressed
> >>        data into CPU registers gets more data per-cache-miss than
> >>        moving 4 8-byte doubles. Decompression typically takes 2-4
> >>        instructions of shift/scale/add per element, and is well covered
> >>        by the cache-miss costs.
> >>      o */A note on Big Data and GC/*: H2O keeps all our data **in
> >>        heap**, but in large arrays of Java primitives. Our experience
> >>        shows that we run well, without GC issues, even *on very large
> >>        heaps with the default collector*. We routinely test with e.g.
> >>        heaps from 2G to 200G - and never see FullGC costs exceed a few
> >>        seconds every now and then (depends on the rate of Big Data
> >>        writing going on). The normal Java object allocation used to
> >>        drive the system internally has a negligible GC load.  We keep
> >>        our data in-heap because its as fast as possible (memory
> >>        bandwidth limited), and easy to code (pure Java), and has no
> >>        interesting GC costs.  Our GC tuning policy is: "only use the
> >>        -Xmx flag, set to the largest you can allow given the machine
> >>        resources".  Take all the other GC defaults, they will work fine.
> >>      o */A note on Bigger Data (and GC)/*: We do a user-mode
> >>        swap-to-disk when the Java heap gets too full, i.e., you're
> >>        using more Big Data than physical DRAM.  We won't die with a GC
> >>        death-spiral, but we will degrade to out-of-core speeds. We'll
> >>        go as fast as the disk will allow.
> >>      o */A note on data ingest/*/:/ We read data fully parallelized
> >>        from S3, HDFS, NFS, URI's, browser uploads, etc.  We can
> >>        typically drive HDFS disk spindles to an interesting fraction of
> >>        what you can get from e.g. HDFS file-copy.  We parse & compress
> >>        (in parallel) a very generous notion of a CSV file (for
> >>        instance, Hive files are directly ingestable), and SVM light
> >>        files.  We are planning on an RDD ingester - interactivity with
> >>        other frameworks is in everybody's interest.
> >>      o */A note on sparse data/*: H2O sports about 15 different
> >>        compression schemes under the hood, including ones designed to
> >>        compress sparse data.  We happily import SVMLight without ever
> >>        having the data "blow up" and still fully supporting the
> >>        array-access API, including speed guarantees.
> >>      o */A note on missing data/*: Most datasets have *missing*
> >>        elements, and most math algorithms deal with missing data
> >>        specially.  H2O fully supports a notion of "NA" for all data,
> >>        including setting, testing, selecting in (or out), etc, and this
> >>        notion is woven through the data presentation layer.
> >>      o */A note on streaming data/*: H2O vectors can have data inserted
> >>        & removed (anywhere, in any order) continuously.  In particular,
> >>        it's easy to add new data at the end and remove it from the
> >>        start - i.e., a build a large rolling dataset holding all the
> >>        elements that fit given a memory budget and a data flow-rate.
> >>    This has been on our roadmap for awhile, and needs only a little
> >>        more work to be fully functional.
> >>
> >>  * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to write
> >>    blatantly parallel code (although not the only way), and we support
> >>    a particularly fast and efficient flavor.  A Map maps Type A to Type
> >>    B, and a Reduce combines two Type B's into one Type B.  Both Types A
> >>    & B can be a combination of small-data (described as a Plain Old
> >>    Java Object, a POJO) and big-data (described as another Giant H2O
> >>    distributed array). Here's an example map from a Type A (a pair of
> >>    columns), to a Type B (a POJO of Class MyMR holding various sums):
> >>
> >>    *new MyMR<MRTask> extends MRTask {
> >>         double sum0, sum1, sq_sum0;           // Most things are
> >>    allowed here
> >>         @Override public void map( double d0, double d1 ) {
> >>           sum0 += d0;  sum1 += d1;  sq_sum0 += d0*d0;  // Again most
> >>    any Java code here
> >>         }
> >>         @Override public void reduce( MyMR my ) {   // Combine two
> >>    MyMRs together
> >>           sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
> >>         }
> >>       }.doAll( Vec v0, Vec v1 );  // Invoke in-parallel distributed*
> >>
> >>    This code will be distributed 'round the cluster, and run at
> >>    memory-bandwidth speeds (on compressed data!) with no further ado.
> >>  There's a lot of mileage possible here that I'm only touching
> >>    lightly on.  Filtering, subsetting, writing results into temp arrays
> >>    that are used on later next passes; uniques on billions of rows,
> >>    ddply-style group-by operations all work in this Map/Reduce
> >>    framework - and all work by writing plain old Java.
> >>      o */Scala, and a note on API cleanliness/*: We fully acknowledge
> >>        Java's weaknesses here - this is the Java6 flavor coding style;
> >>        Java7 style is nicer - but still not as nice as some other
> >>        languages.  We fully embrace & support alternative syntax(s)
> >>        over our engine.  In particular, we have an engineer working on
> >>        a in-process Scala (amongst other) interfaces.  We are shifting
> >>        our focus now, from the excellent backend to the API interface
> >>        side of things.  This is a work-in-progress for us, and we are
> >>        looking forward to much improvement over the next year.
> >>
> >>  * *Pre-Baked Algorithms Layer*: We have the following algorithms
> >>    pre-baked, fully optimized and full-featured: Generalized Linear
> >>    Modeling, including Logistic Regression plus Gaussian, Gamma,
> >>    Poisson, and Tweedie distributions.  Neural Nets. Random Forest
> >>    (that scales *out* to all the data in the cluster).  Gradient
> >>    Boosted Machine (again, in-parallel & fully distributed).  PCA.
> >>  KMeans (& variants).  Quantiles (any quantile, computed *exactly* in
> >>    milliseconds).  All these algorithms support Confusion Matrices
> >>    (with adjustable thresholds), AUC & ROC metrics, incremental test
> >>    data-set results on partially trained models during the build
> >>    process. Within each algorithm, we support a full range of options
> >>    that you'd find in the similar R or SAS package.
> >>      o */A note on some Mahout algorithms/*: We're clearly well suited
> >>        to e.g. SSVD and Co-occurrence and have talked with Ted Dunning
> >>        at length on how they would be implemented in H2O.
> >>
> >>  * *REST / JSON / R / python / Excel / REPL*: The system is externally
> >>    drivable via URLs/REST API calls, with JSON responses.  We use
> >>    REST/JSON from Python to drive all our testing harness.  We have a
> >>    very nice R package with H2O integrated behind R - you can issue R
> >>    commands to an H2O-backed R "data.frame" - and have all the Big Math
> >>    work on the Big Data in a cluster - including 90% of the typical
> >>    "data munging" workflow.  This same REST/JSON interface also works
> >>    with e.g. Excel (yes we have a demo) or shell scripts.  We have an
> >>    R-like language REPL.  We have a pretty web-GUI over the REST/JSON
> >>    layer, that is suitable for lightweight modeling tasks.
> >>
> >>
> >> Cliff
> >>
> >>
> >>
>



-- 
ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
+1-408.316.8192

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message