Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 37BDC9982 for ; Mon, 20 May 2013 17:52:18 +0000 (UTC) Received: (qmail 41123 invoked by uid 500); 20 May 2013 17:52:14 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 40493 invoked by uid 500); 20 May 2013 17:52:13 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 40474 invoked by uid 99); 20 May 2013 17:52:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 May 2013 17:52:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cembree@gmail.com designates 209.85.214.170 as permitted sender) Received: from [209.85.214.170] (HELO mail-ob0-f170.google.com) (209.85.214.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 May 2013 17:52:07 +0000 Received: by mail-ob0-f170.google.com with SMTP id er7so7404220obc.15 for ; Mon, 20 May 2013 10:51:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:reply-to:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=ncWmgNiowsx9OX8yTX2HDNmymC595AgsZWFFk9xBqGU=; b=JrnqU2ICjcJog6KePgw42wHlTsSd41zQTxrcD/F1kTcDTvlaUS+UOQ1UtTr/M2Ou1D 0Jk4RlNXj4NJixMu2DWhwaXGx07198g5aiH/dS9525eevw5sAyzM3JujfqSDQH7u/lOu Ig8EH6NGJrRqVH23UzmOq2Y0CCeH9yR87BkhpruinwO7XHiaxNH8rzI7Z/mL3C7Ppn7N 7hc8E+XyMlx0CLHmKLYbf8mzXujSOImBTHh9sFzwpuprVcOysFmCPaxXSc7pwQ6O5/9G 6UyLxowExxdkluTyaa2ozBfOdfu2aXc5sQE/yZ9BzzsCyJSUd4mXAmnr6pwwEqwgbj+h SYzQ== MIME-Version: 1.0 X-Received: by 10.60.58.4 with SMTP id m4mr11910302oeq.41.1369072302948; Mon, 20 May 2013 10:51:42 -0700 (PDT) Received: by 10.76.95.132 with HTTP; Mon, 20 May 2013 10:51:42 -0700 (PDT) Reply-To: chris@embree.us In-Reply-To: <1369061312.42343.YahooMailNeo@web162201.mail.bf1.yahoo.com> References: <1369061312.42343.YahooMailNeo@web162201.mail.bf1.yahoo.com> Date: Mon, 20 May 2013 13:51:42 -0400 Message-ID: Subject: Re: Low latency data access Vs High throughput of data From: Chris Embree To: user@hadoop.apache.org, Raj Hadoop Content-Type: multipart/alternative; boundary=089e0158b0ae4a056b04dd29fcca X-Virus-Checked: Checked by ClamAV on apache.org --089e0158b0ae4a056b04dd29fcca Content-Type: text/plain; charset=ISO-8859-1 I'll take a swing at this one. Low latency data access: I hit the enter key (or submit button) and I expect results within seconds at most. My database query time should be sub-second. High throughput of data: I want to scan millions of rows of data and count or sum some subset. I expect this will take a few minutes (or much longer depending on complexity) to complete. Think of more batch style jobs. Caveats: This is really a map/reduce issue also. The Set up and processing of M/R jobs takes a bit of overhead. There are a couple of projects working now to move toward lower latency data access. Also, HDFS stores data in blocks and distributes them across many nodes. This means that there will (almost) always be some network data transfer required to get the final answer, and that "slows" things down a bit, depending on throughput and various other factors. Hope that helps. :) On Mon, May 20, 2013 at 10:48 AM, Raj Hadoop wrote: > Hi, > > I have a basic question on HDFS. I was reading that HDFS doesnt work well > with low latency data access. Rather it is designed for the high throughput > of data. Can you please explain in simple words the difference between "Low > latency data access Vs High throughput of data". > > Thanks, > Raj > --089e0158b0ae4a056b04dd29fcca Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I'll take a swing at this one.

Low latency data access: =A0I hit the enter key (or submit button) and I = expect results within seconds at most. =A0My database query time should be = sub-second.
High throughput of data: =A0I want to scan millions of rows of d= ata and count or sum some subset. =A0I expect this will take a few minutes = (or much longer depending on complexity) to complete. =A0Think of more batc= h style jobs.

Caveats: This is really a map/reduce issue = also. =A0The Set up and processing of M/R jobs takes a bit of overhead. =A0= There are a couple of projects working now to move toward lower latency dat= a access.

Also, HDFS stores data in blocks and distri= butes them across many nodes. =A0This means that there will (almost) always= be some network data transfer required to get the final answer, and that &= quot;slows" things down a bit, depending on throughput and various oth= er factors.

Hope that helps. :)


On Mon, May 20, 2013 at= 10:48 AM, Raj Hadoop <hadoopraj@yahoo.com> wrote:
Hi,

I have a basic question on HDFS. I was reading that HDFS doesnt work well wi= th=20 low latency data access. Rather it is designed for the high throughput=20 of data. Can you please explain in simple words the difference between=20 "Low latency data access Vs High throughput of data".

Thanks,
Raj

--089e0158b0ae4a056b04dd29fcca--