Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E81617DA4 for ; Wed, 21 Jan 2015 05:05:13 +0000 (UTC) Received: (qmail 90078 invoked by uid 500); 21 Jan 2015 05:05:08 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 89971 invoked by uid 500); 21 Jan 2015 05:05:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 89961 invoked by uid 99); 21 Jan 2015 05:05:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:05:02 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alec.taylor6@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qc0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:04:57 +0000 Received: by mail-qc0-f177.google.com with SMTP id p6so9401859qcv.8 for ; Tue, 20 Jan 2015 21:02:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=sUJD+8eSLyzrBcTyrhHPgO/BNGc0+L+refLPr1vDgtQ=; b=hDXDq6hxX61woF3wZ8mG5LK5ZvWfQJjVV6UezjxqSYsKwLPPnvNyaz26Wdm+AZX7Vd Gp4xmd5JpD1EcZHvATnrkNVpGbDOrK7C7+5/QWn8qwMM7AtNynWqz+xI1D6JKCJKPs3t BYlZjhVMVHc8H29SFte/eOZpLmCsdfqJHKL0ITNh8Ea2ktP6/Wq9IkH9M1QmWhrbabFK 4G5AH9PuaM+sUJPpicmZIVs03UgX/A+9XW46x7tUAEFCmcLPSmkt88AMCZkrrK1K8GQX 6fDIskaMABa3Pztbww2AyRDFWOM9Mbk2DKMmLiKRwEuFIJ4vwUIfQIqrJcUQpZJ/UUZO ThGQ== MIME-Version: 1.0 X-Received: by 10.229.225.195 with SMTP id it3mr17642296qcb.24.1421816541816; Tue, 20 Jan 2015 21:02:21 -0800 (PST) Received: by 10.96.76.234 with HTTP; Tue, 20 Jan 2015 21:02:21 -0800 (PST) In-Reply-To: References: Date: Wed, 21 Jan 2015 16:02:21 +1100 Message-ID: Subject: Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB? From: Alec Taylor To: user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Small amounts in a one node cluster (at first). As it scales I'll be looking at running various O(nk) algorithms, where n is the number of distinct users and k are the overlapping features I want to consider. Is Apache Spark good as a general database as well as it's more fancy features? - E.g.: considering I'm building a network, maybe using their graph database features? On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu wrote: > Apache Spark supports integration with HBase (which has REST API). > > What's the amount of data you want to store in this system ? > > Cheers > > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor wro= te: >> >> I am architecting a platform incorporating: recommender systems, >> information retrieval (ML), sequence mining, and Natural Language >> Processing. >> >> Additionally I have the generic CRUD and authentication components, >> with everything exposed RESTfully. >> >> For the storage layer(s), there are a few options which immediately >> present themselves: >> >> Generic CRUD layer (high speed needed here, though I suppose I could use >> Redis=E2=80=A6) >> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema >> SQL layer atop >> - Apache Spark (perhaps piping to HDFS)=E2=80=A6 =C2=BFmaybe? >> - MongoDB (or a similar document-store), a graph-database, or even >> something like Postgres >> >> Analytics layer (to enable Big Data / Data-intensive computing features) >> >> - Apache Spark >> - Hadoop with MapReduce and/or utilising some other Apache / >> non-Apache project with integration >> - Disco (from Nokia) >> >> ________________________________ >> >> Should I prefer one layer=E2=80=94e.g.: on HDFS=E2=80=94over multiple di= sparite >> layers? - The advantage here is obvious, but I am certain there are >> disadvantages. (and yes, I know there are various ways; automated and >> manual; to push data from non HDFS-backed stores to HDFS) >> >> Also, as a bonus answer, which stack would you recommend for this >> user-network I'm building? > >