Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 95662 invoked from network); 12 Oct 2010 20:07:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Oct 2010 20:07:42 -0000 Received: (qmail 92685 invoked by uid 500); 12 Oct 2010 20:07:41 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 92572 invoked by uid 500); 12 Oct 2010 20:07:41 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 92564 invoked by uid 99); 12 Oct 2010 20:07:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Oct 2010 20:07:41 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of magnito@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-iw0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Oct 2010 20:07:34 +0000 Received: by iwn1 with SMTP id 1so7188527iwn.14 for ; Tue, 12 Oct 2010 13:07:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:references:in-reply-to :mime-version:content-transfer-encoding:content-type:message-id:cc :x-mailer:from:subject:date:to; bh=xHI6r/6MeEoz+aR6yLHadaiA4jGuokgdAbYyFmDyq3s=; b=K27yu4Hg0e+0Z2IDVd+0lFkPnb+bFXd0YW277mNBAVG6i/2GepbXd+jdXiKO5GwTt2 vGYftiihMfvQK3Tbj1WXX63vCSItDxLiMuVZl7Ig70dnF3iwn0s5Jb6IdvJKGsE1pDo6 a3FCBTuuYGutSoXV2U5ePVTUVBdUXZhVoim7c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; b=o/bt3aGDC+ZDoNWZeptKUdX9GhoQcGHnec6YeKuHd2vHZ0RHXOK+GBK4wE2AubePAe hLZK90JBiU8kvtQAN8Y3QHde/X+mr9gT2y5ufdbRqsH33usxOfcHAUrqajT6r1OCU9xq 0h40klK+fDhkAy4d0JZXweWa7rMBiY/rD1Jz4= Received: by 10.231.170.79 with SMTP id c15mr6239355ibz.82.1286914032918; Tue, 12 Oct 2010 13:07:12 -0700 (PDT) Received: from [10.41.91.67] ([166.205.139.134]) by mx.google.com with ESMTPS id 8sm1574540iba.22.2010.10.12.13.07.10 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 12 Oct 2010 13:07:11 -0700 (PDT) References: <2D6136772A13B84E95DF6DA79E85A9F0011D35412B3E@NSPEXMBX-A.the-lab.llnl.gov> <8F16C60C-278C-4134-8DFB-F7F6A9CD07B0@mlogiciels.com> In-Reply-To: Mime-Version: 1.0 (iPhone Mail 8B117) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: Cc: "" X-Mailer: iPhone Mail (8B117) From: Jack Levin Subject: Re: Using external indexes in an HBase Map/Reduce job... Date: Tue, 12 Oct 2010 13:06:37 -0700 To: "user@hbase.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org What I would do is load 100k key values into Hive, do a simple join to produ= ce your output to the local file or another hive table, which can be auto lo= aded into hbase, this process will run MR and will be quick and won't do any= scans -Jack On Oct 12, 2010, at 12:53 PM, Michael Segel wrot= e: >=20 >=20 > All, >=20 > Let me clarify ... >=20 > The ultimate data we want to process is in HBase. >=20 > The data qualifiers are not part of the row key so you would have to do a f= ull table scan to get the data. > (A full table scan of 1 billion rows just to find a subset of 100K rows?) >=20 > So the idea is what if I got the set of row_keys that I want to process fr= om an external source. > I don't mention the source, because its not important. >=20 > What I am looking at is that at the start of my program, I have this java L= ist object that contains my 100K record keys for the records I want to fetch= . >=20 > So how can I write a m/r that allows me to split and fetch based on a obje= ct and not a file or an hfile for input? >=20 >=20 > Let me give you a concrete but imaginary example... >=20 > I have combined all of the DMV vehicle registrations for all of the US. >=20 > I want to find all of the cars that are registered to somebody with the la= st name Smith. >=20 > Since the owner's last name isn't part of the row key. I have to do a full= table scan. (Not really efficient.) >=20 > Suppose I have an external index. I get the list of row keys in a List Obj= ect. >=20 > Now I want to process the list in a m/r job. >=20 > So what's the best way to do it? >=20 > Can you use an object to feed in to a m/r job? (And that's the key point I= 'm trying to solve.) >=20 > Does that make sense? >=20 > -Mike >=20 >> Subject: Re: Using external indexes in an HBase Map/Reduce job... >> From: mdl@mlogiciels.com >> Date: Tue, 12 Oct 2010 11:53:11 -0700 >> To: user@hbase.apache.org >>=20 >> I've been reading this thread, and I'm still not clear on what the proble= m is. I saw your original post, but was unclear then as well.=20 >>=20 >> Please correct me if I'm wrong. It sounds like you want to run a M/R job= on some data that resides in a table in HBase. But, since the table is so l= arge the M/R job would take a long time to process the entire table, so you w= ant to only process the relevant subset. It also sounds like since you need= M/R, the relevant subset is too large to fit in memory and needs a distribu= ted solution. Is this correct so far? >>=20 >> A solution exists: scan filters. The individual region servers filter th= e data.=20 >>=20 >> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob.= That method takes a Scan object as an input. The Scan object can have a f= ilter which is run on the individual region server to limit the data that ge= ts sent to the job. I've written my own filters as well, which are quite si= mple. But, it is a bit of a pain because you have to make sure the custom f= ilter is in the classpath of the servers. I've used it to randomly select a= subset of data from HBase for quick test runs of new M/R jobs. =20 >>=20 >> You might be able to use existing filters. I recommend taking a look at t= he RowFilter as a starting point. I haven't used it, but it takes a Writabl= eByteArrayComparable which could possibly be extended to be based on a bloom= filter or a list. =20 >>=20 >> -Matthew >>=20 >> On Oct 12, 2010, at 10:55 AM, jason wrote: >>=20 >>>> What I can say is that I have a billion rows of data. >>>> I want to pull a specific 100K rows from the table. >>>=20 >>> Michael, I think I have exactly the same use case. Even numbers are the s= ame. >>>=20 >>> I posted a similar question a couple of weeks ago, but unfortunately >>> did not get a definite answer: >>>=20 >>> http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%3CAANLk= Ti=3D8kQzTVppQyMfAPssLMmsEemL-Cwwq38h=3Dd87E@mail.gmail.com%3E >>>=20 >>> So far, I decided to put HBase aside and experiment with Hadoop >>> directly using its BloomMapFile and its ability to quickly discard >>> files that do not contain requested keys. >>> This implies that I have to have a custom InputFormat for that, many >>> input map files, and sorted list of input keys. >>>=20 >>> I do not have any performance numbers yet to compare this approach to >>> the full scan but I am writing tests as we speak. >>>=20 >>> Please keep me posted if you find a good solution for this problem in >>> general (M/R scanning through a random key subset either based on >>> HBase or Hadoop) >>>=20 >>>=20 >>>=20 >>> On 10/12/10, Michael Segel wrote: >>>>=20 >>>>=20 >>>> Dave, >>>>=20 >>>> Its a bit more complicated than that. >>>>=20 >>>> What I can say is that I have a billion rows of data. >>>> I want to pull a specific 100K rows from the table. >>>>=20 >>>> The row keys are not contiguous and you could say they are 'random' suc= h >>>> that if I were to do a table scan, I'd have to scan the entire table (A= ll >>>> regions). >>>>=20 >>>> Now if I had a list of the 100k rows. =46rom a single client I could ju= st >>>> create 100 threads and grab rows from HBase one at a time in each threa= d. >>>>=20 >>>> But in a m/r, I can't really do that. (I want to do processing on the d= ata >>>> I get returned.) >>>>=20 >>>> So given a List Object with the row keys, how do I do a map reduce with= this >>>> list as the starting point. >>>>=20 >>>> Sure I could write it to HDFS and then do a m/r reading from the file a= nd >>>> setting my own splits to control parallelism. >>>> But I'm hoping for a more elegant solution. >>>>=20 >>>> I know that its possible, but I haven't thought it out... Was hoping so= meone >>>> else had this solved. >>>>=20 >>>> thx >>>>=20 >>>>> From: buttler1@llnl.gov >>>>> To: user@hbase.apache.org >>>>> Date: Tue, 12 Oct 2010 08:35:25 -0700 >>>>> Subject: RE: Using external indexes in an HBase Map/Reduce job... >>>>>=20 >>>>> Sorry, I am not clear on exactly what you are trying to accomplish her= e. >>>>> I have a table roughly of that size, and it doesn't seem to cause me a= ny >>>>> trouble. I also have a few separate solr indexes for data in the tabl= e >>>>> for query -- the solr query syntax is sufficient for my current needs.= >>>>> This setup allows me to do two things efficiently: >>>>> 1) batch processing of all records (e.g. tagging records that match a >>>>> particular criteria) >>>>> 2) search/lookup from a UI in an online manner >>>>> 3) it is also fairly easy to insert a bunch of records (keeping track o= f >>>>> their keys), and then run various batch processes only over those new >>>>> records -- essentially doing what you suggest: create a file of keys a= nd >>>>> split the map task over that file. >>>>>=20 >>>>> Dave >>>>>=20 >>>>>=20 >>>>> -----Original Message----- >>>>> From: Michael Segel [mailto:michael_segel@hotmail.com] >>>>> Sent: Tuesday, October 12, 2010 5:36 AM >>>>> To: hbase-user@hadoop.apache.org >>>>> Subject: Using external indexes in an HBase Map/Reduce job... >>>>>=20 >>>>>=20 >>>>> Hi, >>>>>=20 >>>>> Now I realize that most everyone is sitting in NY, while some of us ca= n't >>>>> leave our respective cities.... >>>>>=20 >>>>> Came across this problem and I was wondering how others solved it. >>>>>=20 >>>>> Suppose you have a really large table with 1 billion rows of data. >>>>> Since HBase really doesn't have any indexes built in (Don't get me sta= rted >>>>> about the contrib/transactional stuff...), you're forced to use some s= ort >>>>> of external index, or roll your own index table. >>>>>=20 >>>>> The net result is that you end up with a list object that contains you= r >>>>> result set. >>>>>=20 >>>>> So the question is... what's the best way to feed the list object in? >>>>>=20 >>>>> One option I thought about is writing the object to a file and then us= ing >>>>> it as the file in and then control the splitters. Not the most efficie= nt >>>>> but it would work. >>>>>=20 >>>>> Was trying to find a more 'elegant' solution and I'm sure that anyone >>>>> using SOLR or LUCENE or whatever... had come across this problem too. >>>>>=20 >>>>> Any suggestions? >>>>>=20 >>>>> Thx >>>>>=20 >>>>> =20 >>>>=20 >>=20 > =20