Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A2EA210333 for ; Mon, 4 Nov 2013 23:03:57 +0000 (UTC) Received: (qmail 60668 invoked by uid 500); 4 Nov 2013 23:03:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 60614 invoked by uid 500); 4 Nov 2013 23:03:54 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 60606 invoked by uid 99); 4 Nov 2013 23:03:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2013 23:03:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of patrick.schless@gmail.com designates 209.85.128.48 as permitted sender) Received: from [209.85.128.48] (HELO mail-qe0-f48.google.com) (209.85.128.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2013 23:03:49 +0000 Received: by mail-qe0-f48.google.com with SMTP id d4so4620167qej.35 for ; Mon, 04 Nov 2013 15:03:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=xxgkCHTuXDVzTHOHM/4nj+PpBpYo71/Caos4sLz2eJ0=; b=GNNHzxYIvJz1eKoEqq3NPVE3kFQBCxStQ+s632YKJbIHwSLUUY4kTZmuDL0z/7Rxmf 0UuOpuh5ios7Jy215PaZdtpXAIjMvoNsl6209Fl/7Nd9fy2yve34x1AW7/CmfJdKE72P MidLsMDQB1xLsU5aYm9EyR94P1CEIS9u4GljlFATFBbu6x3fGDr2vTGO4vjSYE7KcqSu 2k3ItZ0ZTFkLfZEC1SKeDRDy74DMuIXALjISHtotj7jMJyoMUFsSmgOVkWmEHXN9mOMC ATAM12bzUoAYirWXLmIPncjbczBkgUgz2//w1odDvHVOQfqw/XUSwkCR99Agkcjas0f7 NY5w== X-Received: by 10.224.113.199 with SMTP id b7mr25476062qaq.4.1383606208461; Mon, 04 Nov 2013 15:03:28 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.102.53 with HTTP; Mon, 4 Nov 2013 15:03:08 -0800 (PST) From: Patrick Schless Date: Mon, 4 Nov 2013 17:03:08 -0600 Message-ID: Subject: Scanner Caching with wildly varying row widths To: user Content-Type: multipart/alternative; boundary=047d7bea3752909b1c04ea61ec3b X-Virus-Checked: Checked by ClamAV on apache.org --047d7bea3752909b1c04ea61ec3b Content-Type: text/plain; charset=ISO-8859-1 We have an application where a row can contain anywhere between 1 and 3600000 cells (there's only 1 column family). In practice, most rows have under 100 cells. Now we want to run some mapreduce jobs that touch every cell within a range (eg count how many cells we have). With scanner caching set to something like 250, the job will chug along for a long time, until it hits a row with a lot of data, then it will die. Setting the cache size down to 1 (row) would presumably work, but take forever to run. We have addressed this by writing some jobs that use coprocessors, which allow us to pull back sets of cells instead of sets of rows, but this means we can't use any of the built-in jobs that come with hbase (eg copyTable). Is there any way around this? Have other people had to deal with such high variability in their row sizes? --047d7bea3752909b1c04ea61ec3b--