Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 67804 invoked from network); 6 Jan 2010 19:33:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Jan 2010 19:33:26 -0000 Received: (qmail 71443 invoked by uid 500); 6 Jan 2010 19:33:25 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 71371 invoked by uid 500); 6 Jan 2010 19:33:25 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 71361 invoked by uid 99); 6 Jan 2010 19:33:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 19:33:25 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 19:33:18 +0000 Received: by pxi42 with SMTP id 42so12770809pxi.5 for ; Wed, 06 Jan 2010 11:32:58 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.121.30 with SMTP id t30mr2082694wfc.283.1262806378146; Wed, 06 Jan 2010 11:32:58 -0800 (PST) In-Reply-To: <28902f871001051726i1dc69f9fof71ac19d534a70ac@mail.gmail.com> References: <28902f870912111919s3f1a86adp75635e8c23d15d69@mail.gmail.com> <45f85f70912111934v2960fb36s1922507c9bc61675@mail.gmail.com> <28902f870912121038g644b72b0vc832ec1343c16687@mail.gmail.com> <45f85f70912121301x353192du44becd5c16dce181@mail.gmail.com> <28902f871001051724q24d3b173r2a5977d0fa7ab43@mail.gmail.com> <28902f871001051726i1dc69f9fof71ac19d534a70ac@mail.gmail.com> From: Todd Lipcon Date: Wed, 6 Jan 2010 11:32:38 -0800 Message-ID: <45f85f71001061132p65fde705te5957d679075de3f@mail.gmail.com> Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set? To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636e904c696ec5c047c840569 --001636e904c696ec5c047c840569 Content-Type: text/plain; charset=ISO-8859-1 Hi Xueling, Here's a general outline: My guess is that your "position of match" field is bounded (perhaps by the number of base pairs in the human genome?) Given this, you can probably write a very simple Partitioner implementation that divides this field into ranges, each with an approximately equal number of records. Next, write a simple MR job which takes in a line of data, and outputs the same line, but with the position-of-match as the key. This will get partitioned by the above function, so you end up with each reducer receiving all of the records in a given range. In the reducer, simply output every 1000th position into your "sparse" output file (along with the non-sparse output file offset), and every position into the non-sparse output file. In your realtime query server (not part of Hadoop), load the "sparse" file into RAM and perform binary search, etc - find the "bins" which the range endpoints land in, and then open the non-sparse output on HDFS to finish the count. Hope that helps. Thanks -Todd On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu wrote: > Rephrase the sentence "Or what APIs I should start with for my testing?": I > mean "What HDFS APIs I should start to look into for my testing? > > Thanks, > Xueling > > On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu > wrote: > > > Hi Todd: > > > > After finishing some tasks I finally get back to HDFS testing. > > > > One question for your last reply to this thread: Are there any code > > examples close to your second and third recommendations? Or what APIs I > > should start with for my testing? > > > > Thanks. > > Xueling > > > > > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote: > > > >> Hi Xueling, > >> > >> In that case, I would recommend the following: > >> > >> 1) Put all of your data on HDFS > >> 2) Write a MapReduce job that sorts the data by position of match > >> 3) As a second output of this job, you can write a "sparse index" - > >> basically a set of entries like this: > >> > >> > >> > >> where you're basically giving offsets into every 10K records or so. If > >> you index every 10K records, then 5 billion total will mean 100,000 > >> index entries. Each index entry shouldn't be more than 20 bytes, so > >> 100,000 entries will be 2MB. This is super easy to fit into memory. > >> (you could probably index every 100th record instead and end up with > >> 200MB, still easy to fit in memory) > >> > >> Then to satisfy your count-range query, you can simply scan your > >> in-memory sparse index. Some of the indexed blocks will be completely > >> included in the range, in which case you just add up the "number of > >> entries following" column. The start and finish block will be > >> partially covered, so you can use the file offset info to load that > >> file off HDFS, start reading at that offset, and finish the count. > >> > >> Total time per query should be <100ms no problem. > >> > >> -Todd > >> > >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu > >> wrote: > >> > Hi Todd: > >> > > >> > Thank you for your reply. > >> > > >> > The datasets wont be updated often. But the query against a data set > is > >> > frequent. The quicker the query, the better. For example we have done > >> > testing on a Mysql database (5 billion records randomly scattered into > >> 24 > >> > tables) and the slowest query against the biggest table (400,000,000 > >> > records) is around 12 mins. So if using any Hadoop product can speed > up > >> the > >> > search then the product is what we are looking for. > >> > > >> > Cheers, > >> > Xueling > >> > > >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon > wrote: > >> > > >> >> Hi Xueling, > >> >> > >> >> One important question that can really change the answer: > >> >> > >> >> How often does the dataset change? Can the changes be merged in in > >> >> bulk every once in a while, or do you need to actually update them > >> >> randomly very often? > >> >> > >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 > second, > >> or > >> >> 10ms? > >> >> > >> >> Thanks > >> >> -Todd > >> >> > >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu < > xshu@systemsbiology.org> > >> >> wrote: > >> >> > Hi there: > >> >> > > >> >> > I am researching Hadoop to see which of its products suits our need > >> for > >> >> > quick queries against large data sets (billions of records per set) > >> >> > > >> >> > The queries will be performed against chip sequencing data. Each > >> record > >> >> is > >> >> > one line in a file. To be clear below shows a sample record in the > >> data > >> >> set. > >> >> > > >> >> > > >> >> > one line (record) looks like: 1-1-174-418 > TGTGTCCCTTTGTAATGAATCACTATC > >> U2 > >> >> 0 0 > >> >> > 1 4 *103570835* F .. 23G 24C > >> >> > > >> >> > The highlighted field is called "position of match" and the query > we > >> are > >> >> > interested in is the # of sequences in a certain range of this > >> "position > >> >> of > >> >> > match". For instance the range can be "position of match" > 200 and > >> >> > "position of match" + 36 < 200,000. > >> >> > > >> >> > Any suggestions on the Hadoop product I should start with to > >> accomplish > >> >> the > >> >> > task? HBase,Pig,Hive, or ...? > >> >> > > >> >> > Thanks! > >> >> > > >> >> > Xueling > >> >> > > >> >> > >> > > >> > > > > > --001636e904c696ec5c047c840569--