Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96F4918B9C for ; Mon, 8 Jun 2015 21:31:12 +0000 (UTC) Received: (qmail 34696 invoked by uid 500); 8 Jun 2015 21:31:03 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 34632 invoked by uid 500); 8 Jun 2015 21:31:03 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 34618 invoked by uid 99); 8 Jun 2015 21:31:02 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 21:31:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 480F0CC945 for ; Mon, 8 Jun 2015 21:31:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id br83QGDayVc4 for ; Mon, 8 Jun 2015 21:30:49 +0000 (UTC) Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 2A0B02315E for ; Mon, 8 Jun 2015 21:30:49 +0000 (UTC) Received: by lbcmx3 with SMTP id mx3so88855891lbc.1 for ; Mon, 08 Jun 2015 14:29:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YLRK9GuYibjZktAEQlB628gc1zLZU4NYAJawFZvTVos=; b=G46CfHDhJ5CtD3UKksFGs6IOiPfKxfuQ7m5ubt1DTW3nVK1RIWRWC6MUusL5nfqDnJ BBeMNm3EjPXE90mRK0ACjaii3pr3i8Pvaonn4N1+UZ1YdiGvKQm/dJDEBvqZ5+F/4Y59 rozm4Wtssb5DMLUEm6vAe3CWD5d+6h80/CUBrBrnoJXuIu6nzJyKTFNlTThuGbPIdYoa i2IOsnJfjgm0msePW4s3uhd/PIP/1eRUuPBj6Z2QJoWMQdfR9mY1rCsxmJjr0Wqxj+Qn sWyvR4QmwiwTnXY/XzyVMGND/vwTz7LsoOaLm7mZ5rWSbPY1vweCupyAvBwGZj3+ceVN 3Vhw== MIME-Version: 1.0 X-Received: by 10.112.199.35 with SMTP id jh3mr19095216lbc.23.1433798997440; Mon, 08 Jun 2015 14:29:57 -0700 (PDT) Received: by 10.152.225.171 with HTTP; Mon, 8 Jun 2015 14:29:57 -0700 (PDT) In-Reply-To: References: <557592EF.5070007@neofonie.de> Date: Mon, 8 Jun 2015 23:29:57 +0200 Message-ID: Subject: Re: Reading from HBase problem From: Fabian Hueske To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c34a7aebfff20518085785 --001a11c34a7aebfff20518085785 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Hilmi, I see two possible reasons: 1) The data source / InputFormat is not properly working, so not all HBase records are read/forwarded, or 2) The aggregation / count is buggy Roberts suggestion will use an alternative mechanism to do the count. In fact, you can count with groupBy(0).sum() and accumulators at the same time= . If both counts are the same, this will indicate that the aggregation is correct and hint that the HBase format is faulty. In any case, it would be very good to know your findings. Please keep us updated. One more hint, if you want to do a full aggregate, you don't have to use a "dummy" key like "a". Instead, you can work with Tuple1 and directly call sum(0) without doing the groupBy(). Best, Fabian 2015-06-08 17:36 GMT+02:00 Robert Metzger : > Hi Hilmi, > > if you just want to count the number of elements, you can also use > accumulators, as described here [1]. > They are much more lightweight. > > So you need to make your flatMap function a RichFlatMapFunction, then cal= l > getExecutionContext(). > Use a long accumulator to count the elements. > > If the results with the accumulator are consistent (the exact element > count), then there is a severe bug in Flink. But I suspect that the > accumulator will give you the same result (off by +-5) > > Best, > Robert > > > [1]: http://slideshare.net/robertmetzger1/apache-flink-hands-on > > On Mon, Jun 8, 2015 at 3:04 PM, Hilmi Yildirim > wrote: > >> Hi, >> I implemented a simple Flink Batch job which reads from an HBase Cluster >> of 13 machines and with nearly 100 million rows. The hbase version is >> 1.0.0-cdh5.4.1. So, I imported hbase-client 1.0.0-cdh5.4.1. >> I implemented a flatmap which creates a tuple ("a", 1L) for each row . >> Then, I use groupBy(0).sum(1).writeAsTest. The result should be the numb= er >> of rows. But, the result is not correct. I run the job multiple times an= d >> the result flactuates by +-5. I also run the job for a smaller table wit= h >> 100.000 rows and the result is correct. >> >> Does anyone know the reason for that? >> >> Best Regards, >> Hilmi >> >> -- >> -- >> Hilmi Yildirim >> Software Developer R&D >> >> http://www.neofonie.de >> >> Besuchen Sie den Neo Tech Blog f=C3=BCr Anwender: >> http://blog.neofonie.de/ >> >> Folgen Sie uns: >> https://plus.google.com/+neofonie >> http://www.linkedin.com/company/neofonie-gmbh >> https://www.xing.com/companies/neofoniegmbh >> >> Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin >> Handelsregister Berlin-Charlottenburg: HRB 67460 >> Gesch=C3=A4ftsf=C3=BChrung: Thomas Kitlitschko >> >> > --001a11c34a7aebfff20518085785 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Hilmi,

= I see two possible reasons:

1) The data source / InputFormat i= s not properly working, so not all HBase records are read/forwarded, or
=
2) The aggregation / count is buggy

Roberts suggestion w= ill use an alternative mechanism to do the count. In fact, you can count wi= th groupBy(0).sum() and accumulators at the same time.
If both cou= nts are the same, this will indicate that the aggregation is correct and hi= nt that the HBase format is faulty.

In any case, it would= be very good to know your findings. Please keep us updated.
=
One more hint, if you want to do a full aggregate, you don't = have to use a "dummy" key like "a". Instead, you can wo= rk with Tuple1<Long> and directly call sum(0) without doing the group= By().

Best, Fabian

2015-06-08 17:36 GMT+02:00 Robert Metzger <rmetz= ger@apache.org>:
Hi=C2=A0Hilmi,=

if you just want to count the num= ber of elements, you can also use accumulators, as described here [1].
They are much more lightweight.

So you need to make your flatMap function a RichFlatMapFun= ction, then call getExecutionContext().
Use a long accumulator to count the elements.=C2= =A0

If the results wi= th the accumulator are consistent (the exact element count), then there is = a severe bug in Flink. But I suspect that the accumulator will give you the= same result (off by +-5)

Best,
Ro= bert

On Mon, Jun 8, 2015 at 3:04 PM, Hilmi Yildirim= <hilmi.yildirim@neofonie.de> wrote:
Hi,
I implemented a simple Flink Batch job which reads from an HBase Cluster of= 13 machines and with nearly 100 million rows. The hbase version is 1.0.0-c= dh5.4.1. So, I imported hbase-client 1.0.0-cdh5.4.1.
I implemented a flatmap which creates a tuple ("a", 1L) for each = row . Then, I use groupBy(0).sum(1).writeAsTest. The result should be the n= umber of rows. But, the result is not correct. I run the job multiple times= and the result flactuates by +-5. I also run the job for a smaller table w= ith 100.000 rows and the result is correct.

Does anyone know the reason for that?

Best Regards,
Hilmi

--
--
Hilmi Yildirim
Software Developer R&D

http://www.neofonie.de=

Besuchen Sie den Neo Tech Blog f=C3=BCr Anwender:
http://blog.neofonie= .de/

Folgen Sie uns:
https://plu= s.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
h= ttps://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Gesch=C3=A4ftsf=C3=BChrung: Thomas Kitlitschko



--001a11c34a7aebfff20518085785--