Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D5D13200BAF for ; Mon, 17 Oct 2016 04:22:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D45A0160AF8; Mon, 17 Oct 2016 02:22:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F23A1160AD0 for ; Mon, 17 Oct 2016 04:22:31 +0200 (CEST) Received: (qmail 58944 invoked by uid 500); 17 Oct 2016 02:22:31 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 58934 invoked by uid 99); 17 Oct 2016 02:22:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Oct 2016 02:22:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4E633189ACA for ; Mon, 17 Oct 2016 02:10:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id sGnoMga3n_7A for ; Mon, 17 Oct 2016 02:10:22 +0000 (UTC) Received: from mail-yw0-f169.google.com (mail-yw0-f169.google.com [209.85.161.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 477925F397 for ; Mon, 17 Oct 2016 02:10:22 +0000 (UTC) Received: by mail-yw0-f169.google.com with SMTP id t193so106809311ywc.2 for ; Sun, 16 Oct 2016 19:10:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Iix69WtLCYfTfkgNE68Dlxm5DiRebWBIvm7K3SkDcPo=; b=aXV8lCG1QIUr8D+vgiWB6uPo2d6v6fC/oHrK9G/CCY6x6Uysy51xe5BD+wRbJwYuS+ GF7n+gt0LDQmbgWTkaKl3Pb3EhXuEKS3tpsFF1RSinAHRhRbSaxh1patVOgCKQHOGUEz PaHxRAh/4rT+L+M5QjjZ2lwhQPSROV8cZAhcqLW/lOPPaEGxo8bc/mK/oBGbXNQVUa3h MZSJ107ATa4mMwI01SpNyjJD+wyP+IQb+SIWAJX+8ZcG6nLcfHwfKzaYcYzB3GiQspQK FDoH+TjYyaq5Vq/1TYN9N7xans8vXzVnXOiB36kSASZIY2PcLEG/L6k9CJPzDWhU5hiK nQlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Iix69WtLCYfTfkgNE68Dlxm5DiRebWBIvm7K3SkDcPo=; b=T4x/y6htB8jBheijz6zFHPG1Qggpg3WXFCWPmTRj3z1qZxKyKS0DXH6v2Y5iXccOTP sdMVQU0vu7i1Y83sMsKHl2G8fuuFx90wga1MnFGY8SpuftOrEtyENzoG478xOe49TcS7 4829WovRTuJf8WbA4beZl6XqoxYT/S/xCWKvYUrm8WYnosr9X9ME6+CkRZF+5p7LLU2L flaXcqRIz7fa0FBMx/rMskr0ZcTHajOraKg8YcGvpHgyt3pCIZGsqvM/Zp0kYcAMiseJ axPIeOXrJXQUz/fPkfOYR53oLfJ4Z84+3epeEcWsl4neDE3b9XuExmj7Tb+QkEHsWAcU wpfQ== X-Gm-Message-State: AA6/9RkRIb83RTEgm/ZT3w6zw9o8h9AtV1/XA2xcnVjsx3y8eP8jZreRCNcEM3ujTHqFLzHMBXj3AgSNHJjaww== X-Received: by 10.129.98.214 with SMTP id w205mr21217315ywb.82.1476670215977; Sun, 16 Oct 2016 19:10:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.164.232 with HTTP; Sun, 16 Oct 2016 19:10:15 -0700 (PDT) In-Reply-To: <5803C161.1070209@gmail.com> References: <5803C161.1070209@gmail.com> From: Bob Cook Date: Sun, 16 Oct 2016 22:10:15 -0400 Message-ID: Subject: Re: Fwd: Extacting ALL Data using multiple java processes To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a11470396ac9a14053f061356 archived-at: Mon, 17 Oct 2016 02:22:33 -0000 --001a11470396ac9a14053f061356 Content-Type: text/plain; charset=UTF-8 Josh, Thanks. I was able to get TimestampFilter to works for my needs. But I originally wanted "createdDate" as our application creates that date which is known to the user and may be different than accumulo timestamp due to when the data actually got processed into accumulo. So if I wanted to use the ColumnFamily "createdDate" and it's value, what java code would I have to write? I looked at the AccumuloInputFormat class, but confused on how to specify the "range" for the date range that I'm interested in.. So would I use the TimestampFilter Class similar to how I'm using it in the "scanner.addScanIterator", but instead using "AcculoInputFormat.addIterator(job, is), as below. IteratorSetting is = new IteratorSetting(30, TimestampFilter.class); TimestampFilter.setRange(is, startDate, endDate); AccumuloInputFormat.addIterator(job, is); Or could I use is.addOption("start", startDate); is.addOption("end", endDate); NOTE: for me "TimestempFilter.setRange" nor "TimestampFilter.setStart and TimestampFilter.setEnd didn't seem to work. On Sun, Oct 16, 2016 at 2:05 PM, Josh Elser wrote: > The TimestampFilter will return only the Keys whose timestamp fall in the > range you specify. The timestamp is an attribute on every Key, a long value > which, when not set by the client at write time, is the number of millis > since the epoch. You specify the numeric range of timestamps you want. This > is a post-filter operation -- Accumulo must still read all of the data in > the table. > > You need to tell *us* what the time component you're actually filtering > on: the timestamp on each Key, or the createdDate column in each row. > > MapReduce is likely more efficient to do this batch processing (as > MapReduce is a batch processing system). See the AccumuloInputFormat class. > > Bob Cook wrote: > >> All, >> >> I'm new to accumulo and inherited this project to extract all data from >> accumulo (assembled as a "document" by RowID) into another web service. >> >> So I started with SimpleReadClient.java to "scan" all data, and built a >> "document" based on the RowID, ColumnFamily and Value. Sending >> this "document" to the service. >> Example data. >> ID CF CV >> RowID_1 createdDate "2015-01-01:00:00:01 UTC" >> RowID_1 data "this is a test" >> RowID_1 title "My test title" >> >> RowID_2 createdDate "2015-01-01:12:01:01 UTC" >> RowID_2 data "this is test 2" >> RowID_2 title "My test2 title" >> >> ... >> >> So my table is pretty simple, RowID, ColumnFamily and Value (no >> ColumnQualifier) >> >> I need to process one Billion "OLD" unique RowIDs (a years worth of >> data) on a live system that is ingesting "new data" at a rate of about a >> 4million RowIds a day. >> i.e. I need to process data from September 2015 - September 2016, not >> worrying about new data coming in. >> >> So I'm thinking I need to run multiple processes to extract ALL the data >> in this "data range" to be more efficient. >> Also, it may allow me to run the processes at a lower priority and at >> off-hours of the day when traffic is less. >> >> My issue is how do I specify the "range" to scan, and how do I specify. >> >> 1. Is using the "createdDate" a good idea, if so how would I specify the >> range for it. >> >> 2. How about the TimestampFilter? If I specify my start to end to >> "equal" a day (about 4 Million unique RowIDs), >> Will this get me all ColumnFamily and Values for a given RowID? Or >> could I miss something because it's timestamp >> was the next day. I don't really understand Timestamps wrt Accumulo. >> >> 3. Does a map-reduce job make sense. If so, how would I specify. >> >> >> Thanks, >> >> Bob >> >> --001a11470396ac9a14053f061356 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Josh,

Thanks. I was able to get Times= tampFilter to works for my needs.=C2=A0 But I originally wanted "creat= edDate" as our application creates that date which is known to the use= r
and may be different than accumulo timest= amp due to when the data actually got processed into accumulo.

So if I wanted to = use the ColumnFamily "createdDate" and it's value, =C2=A0what= java code would I have to write?

I looked at the AccumuloInputFormat class, but = confused on how to specify the "range" for the date range that I&= #39;m interested in..

So would I use the TimestampFilter Class similar to how I&= #39;m using it in the "scanner.addScanIterator", but instead usin= g "AcculoInputFormat.addIterator(job, is), as below.

IteratorSetting is =3D new IteratorSetting(30, TimestampFilter.clas= s); TimestampFilter.setRange(is, startDate, endDate); AccumuloInputFormat.addIterator(job, is);

Or could I use
is.addOption("start", startDate);
is.addOption("end", endDate);

NOTE: for me "Ti= mestempFilter.setRange" =C2=A0nor "TimestampFilter.setStart and T= imestampFilter.setEnd didn't seem to work.

On Sun,= Oct 16, 2016 at 2:05 PM, Josh Elser <josh.elser@gmail.com> wrote:
The TimestampFilter will return only the Keys= whose timestamp fall in the range you specify. The timestamp is an attribu= te on every Key, a long value which, when not set by the client at write ti= me, is the number of millis since the epoch. You specify the numeric range = of timestamps you want. This is a post-filter operation -- Accumulo must st= ill read all of the data in the table.

You need to tell *us* what the time component you're actually filtering= on: the timestamp on each Key, or the createdDate column in each row.

MapReduce is likely more efficient to do this batch processing (as MapReduc= e is a batch processing system). See the AccumuloInputFormat class.

Bob Cook wrote:
All,

I'm new to accumulo and inherited this project to extract all data from=
accumulo (assembled as a "document" by RowID) into another web se= rvice.

So I started with SimpleReadClient.java to "scan" all data, and b= uilt a
"document" based on the RowID, ColumnFamily and Value. Sending this "document" to the service.
Example data.
ID CF CV
RowID_1 createdDate "2015-01-01:00:00:01 UTC"
RowID_1 data "this is a test"
RowID_1 title "My test title"

RowID_2 createdDate "2015-01-01:12:01:01 UTC"
RowID_2 data "this is test 2"
RowID_2 title "My test2 title"

...

So my table is pretty simple,=C2=A0 RowID, ColumnFamily and Value (no
ColumnQualifier)

I need to process one Billion "OLD" unique RowIDs (a years worth = of
data) on a live system that is ingesting "new data" at a rate of = about a
4million RowIds a day.
i.e. I need to process data from September 2015 - September 2016, not
worrying about new data coming in.

So I'm thinking I need to run multiple processes to extract ALL the dat= a
in this "data range" to be more efficient.
Also, it may allow me to run the processes at a lower priority and at
off-hours of the day when traffic is less.

My issue is how do I specify the "range" to scan, and how do I sp= ecify.

1. Is using the "createdDate" a good idea, if so how would I spec= ify the
range for it.

2. How about the TimestampFilter?=C2=A0 =C2=A0If I specify my start to end = to
"equal" a day (about 4 Million unique RowIDs),
Will this get me all ColumnFamily and Values for a given RowID?=C2=A0 Or could I miss something because it's timestamp
was the next day.=C2=A0 I don't really understand Timestamps wrt Accumu= lo.

3. Does a map-reduce job make sense.=C2=A0 If so, how would I specify.


Thanks,

Bob


--001a11470396ac9a14053f061356--