Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 3945 invoked from network); 12 Dec 2009 22:51:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Dec 2009 22:51:24 -0000 Received: (qmail 66494 invoked by uid 500); 12 Dec 2009 22:51:24 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 66399 invoked by uid 500); 12 Dec 2009 22:51:23 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 66389 invoked by uid 99); 12 Dec 2009 22:51:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2009 22:51:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 209.85.221.188 as permitted sender) Received: from [209.85.221.188] (HELO mail-qy0-f188.google.com) (209.85.221.188) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2009 22:51:15 +0000 Received: by qyk26 with SMTP id 26so1048786qyk.5 for ; Sat, 12 Dec 2009 14:50:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=O0gIwrAbfhoo3fF28mWQn7cV4cQx/FCwD++ElnuLxvk=; b=Ja6ozZY36oCRJLYF7aFF3URPjxnfRUvEnOrqYqWFWyo65RIBFaoBPBJHYQF9VEdaeF dBytv2pd6/79+CVaMEVHuG6BEgRfCQdtOLLcgmL5pT5yPhfGUYU99WAKGuamIBt0oskv fvJ0g6162EUf2vEeJEDvkuD7bNang8EocS/jg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=mmVtcSO5rRBksEpksqzfIIZIiYQ2oYOCZ9Ca0maKPaZhz+g/G8j3+eULXnSIwFYmXn TjUhxYlyGkdIwFqpjJfBYJdsIXQmMX/jpRHY9UVhPmGypXKapyiK5om7phVnfqbr1zjU hs/elCLhc3EfbfcypeOUT68fhhcPRbXPCI5sw= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.28.16 with SMTP id k16mr1595250qcc.70.1260658253899; Sat, 12 Dec 2009 14:50:53 -0800 (PST) In-Reply-To: <45f85f70912121301x353192du44becd5c16dce181@mail.gmail.com> References: <28902f870912111919s3f1a86adp75635e8c23d15d69@mail.gmail.com> <45f85f70912111934v2960fb36s1922507c9bc61675@mail.gmail.com> <28902f870912121038g644b72b0vc832ec1343c16687@mail.gmail.com> <45f85f70912121301x353192du44becd5c16dce181@mail.gmail.com> Date: Sat, 12 Dec 2009 14:50:53 -0800 X-Google-Sender-Auth: cbcf129e23be6231 Message-ID: <7c962aed0912121450y30f8b86cm54e652002bd14d6f@mail.gmail.com> Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set? From: stack To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636426bef682ad2047a8fdf5c X-Virus-Checked: Checked by ClamAV on apache.org --001636426bef682ad2047a8fdf5c Content-Type: text/plain; charset=ISO-8859-1 You might also consider hbase, particularly If you find that your data is being updated with some regularity, particularly if the updates are randomly distributed over the data set. See http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor how to do a fast bulk load of your billiions of rows of data. Yours, St.Ack On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote: > Hi Xueling, > > In that case, I would recommend the following: > > 1) Put all of your data on HDFS > 2) Write a MapReduce job that sorts the data by position of match > 3) As a second output of this job, you can write a "sparse index" - > basically a set of entries like this: > > > > where you're basically giving offsets into every 10K records or so. If > you index every 10K records, then 5 billion total will mean 100,000 > index entries. Each index entry shouldn't be more than 20 bytes, so > 100,000 entries will be 2MB. This is super easy to fit into memory. > (you could probably index every 100th record instead and end up with > 200MB, still easy to fit in memory) > > Then to satisfy your count-range query, you can simply scan your > in-memory sparse index. Some of the indexed blocks will be completely > included in the range, in which case you just add up the "number of > entries following" column. The start and finish block will be > partially covered, so you can use the file offset info to load that > file off HDFS, start reading at that offset, and finish the count. > > Total time per query should be <100ms no problem. > > -Todd > > On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu > wrote: > > Hi Todd: > > > > Thank you for your reply. > > > > The datasets wont be updated often. But the query against a data set is > > frequent. The quicker the query, the better. For example we have done > > testing on a Mysql database (5 billion records randomly scattered into 24 > > tables) and the slowest query against the biggest table (400,000,000 > > records) is around 12 mins. So if using any Hadoop product can speed up > the > > search then the product is what we are looking for. > > > > Cheers, > > Xueling > > > > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote: > > > >> Hi Xueling, > >> > >> One important question that can really change the answer: > >> > >> How often does the dataset change? Can the changes be merged in in > >> bulk every once in a while, or do you need to actually update them > >> randomly very often? > >> > >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, > or > >> 10ms? > >> > >> Thanks > >> -Todd > >> > >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu > >> wrote: > >> > Hi there: > >> > > >> > I am researching Hadoop to see which of its products suits our need > for > >> > quick queries against large data sets (billions of records per set) > >> > > >> > The queries will be performed against chip sequencing data. Each > record > >> is > >> > one line in a file. To be clear below shows a sample record in the > data > >> set. > >> > > >> > > >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC > U2 > >> 0 0 > >> > 1 4 *103570835* F .. 23G 24C > >> > > >> > The highlighted field is called "position of match" and the query we > are > >> > interested in is the # of sequences in a certain range of this > "position > >> of > >> > match". For instance the range can be "position of match" > 200 and > >> > "position of match" + 36 < 200,000. > >> > > >> > Any suggestions on the Hadoop product I should start with to > accomplish > >> the > >> > task? HBase,Pig,Hive, or ...? > >> > > >> > Thanks! > >> > > >> > Xueling > >> > > >> > > > --001636426bef682ad2047a8fdf5c--