Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of alec.taylor6@gmail.com
 designates 209.85.216.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALte62yOv+1etagzn9o5Ggd92oix7bCVA7oU6atu2vRkOQbvbg@mail.gmail.com>
References: 
 <CAO+9iGe4sVcNrLrJbqyk77OCkJmuNRUX2awLoAcB25OpE+DKwQ@mail.gmail.com>
	<CALte62yOv+1etagzn9o5Ggd92oix7bCVA7oU6atu2vRkOQbvbg@mail.gmail.com>
Date: Wed, 21 Jan 2015 16:02:21 +1100
Message-ID: 
 <CAO+9iGeF-x0F+TOCP6UxD10tCjRu5XAAjUif2uvQfvfx_Z3kcQ@mail.gmail.com>
Subject: Re: Low-latency queries, HDFS exclusively or should I go,
 e.g.: MongoDB?
From: Alec Taylor <alec.taylor6@gmail.com>
To: user@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Small amounts in a one node cluster (at first).

As it scales I'll be looking at running various O(nk) algorithms,
where n is the number of distinct users and k are the overlapping
features I want to consider.

Is Apache Spark good as a general database as well as it's more fancy
features? - E.g.: considering I'm building a network, maybe using
their graph database features?

On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> Apache Spark supports integration with HBase (which has REST API).
>
> What's the amount of data you want to store in this system ?
>
> Cheers
>
> On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <alec.taylor6@gmail.com> wro=
te:
>>
>> I am architecting a platform incorporating: recommender systems,
>> information retrieval (ML), sequence mining, and Natural Language
>> Processing.
>>
>> Additionally I have the generic CRUD and authentication components,
>> with everything exposed RESTfully.
>>
>> For the storage layer(s), there are a few options which immediately
>> present themselves:
>>
>> Generic CRUD layer (high speed needed here, though I suppose I could use
>> Redis=E2=80=A6)
>>
>> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> SQL layer atop
>> - Apache Spark (perhaps piping to HDFS)=E2=80=A6 =C2=BFmaybe?
>> - MongoDB (or a similar document-store), a graph-database, or even
>> something like Postgres
>>
>> Analytics layer (to enable Big Data / Data-intensive computing features)
>>
>> - Apache Spark
>> - Hadoop with MapReduce and/or utilising some other Apache /
>> non-Apache project with integration
>> - Disco (from Nokia)
>>
>> ________________________________
>>
>> Should I prefer one layer=E2=80=94e.g.: on HDFS=E2=80=94over multiple di=
sparite
>> layers? - The advantage here is obvious, but I am certain there are
>> disadvantages. (and yes, I know there are various ways; automated and
>> manual; to push data from non HDFS-backed stores to HDFS)
>>
>> Also, as a bonus answer, which stack would you recommend for this
>> user-network I'm building?
>
>