hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?
Date Wed, 21 Jan 2015 05:12:50 GMT
bq. Is Apache Spark good as a general database

I don't think Spark itself is a general database though there're connectors
to various NoSQL databases, including HBase.

bq. using their graph database features?

Sure. Take a look at http://spark.apache.org/graphx/

Cheers

On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <alec.taylor6@gmail.com> wrote:

> Small amounts in a one node cluster (at first).
>
> As it scales I'll be looking at running various O(nk) algorithms,
> where n is the number of distinct users and k are the overlapping
> features I want to consider.
>
> Is Apache Spark good as a general database as well as it's more fancy
> features? - E.g.: considering I'm building a network, maybe using
> their graph database features?
>
> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > Apache Spark supports integration with HBase (which has REST API).
> >
> > What's the amount of data you want to store in this system ?
> >
> > Cheers
> >
> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <alec.taylor6@gmail.com>
> wrote:
> >>
> >> I am architecting a platform incorporating: recommender systems,
> >> information retrieval (ML), sequence mining, and Natural Language
> >> Processing.
> >>
> >> Additionally I have the generic CRUD and authentication components,
> >> with everything exposed RESTfully.
> >>
> >> For the storage layer(s), there are a few options which immediately
> >> present themselves:
> >>
> >> Generic CRUD layer (high speed needed here, though I suppose I could use
> >> Redis…)
> >>
> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> >> SQL layer atop
> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> >> - MongoDB (or a similar document-store), a graph-database, or even
> >> something like Postgres
> >>
> >> Analytics layer (to enable Big Data / Data-intensive computing features)
> >>
> >> - Apache Spark
> >> - Hadoop with MapReduce and/or utilising some other Apache /
> >> non-Apache project with integration
> >> - Disco (from Nokia)
> >>
> >> ________________________________
> >>
> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> >> layers? - The advantage here is obvious, but I am certain there are
> >> disadvantages. (and yes, I know there are various ways; automated and
> >> manual; to push data from non HDFS-backed stores to HDFS)
> >>
> >> Also, as a bonus answer, which stack would you recommend for this
> >> user-network I'm building?
> >
> >
>

Mime
View raw message