Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <5803C161.1070209@gmail.com>
References: <CAKHkODRMoaE-+OMiJRT5d+9WjupvkJ5XVRpgdzzc_hbV7e5KFg@mail.gmail.com>
 <CAKHkODRP-=pgMVYOwPN0z4g5BKjJBgPGDt3dk3D_CLfNed1eXg@mail.gmail.com> <5803C161.1070209@gmail.com>
From: Bob Cook <bcook.theone@gmail.com>
Date: Sun, 16 Oct 2016 22:10:15 -0400
Message-ID: <CAKHkODRcQnZY72+wChe7rRkVb3oLZyc2QS29=gC8aGJWvSGaNw@mail.gmail.com>
Subject: Re: Fwd: Extacting ALL Data using multiple java processes
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=001a11470396ac9a14053f061356
archived-at: Mon, 17 Oct 2016 02:22:33 -0000

--001a11470396ac9a14053f061356
Content-Type: text/plain; charset=UTF-8

Josh,

Thanks. I was able to get TimestampFilter to works for my needs.  But I
originally wanted "createdDate" as our application creates that date which
is known to the user
and may be different than accumulo timestamp due to when the data actually
got processed into accumulo.

So if I wanted to use the ColumnFamily "createdDate" and it's value,  what
java code would I have to write?

I looked at the AccumuloInputFormat class, but confused on how to specify
the "range" for the date range that I'm interested in..

So would I use the TimestampFilter Class similar to how I'm using it in the
"scanner.addScanIterator", but instead using
"AcculoInputFormat.addIterator(job, is), as below.

IteratorSetting is = new IteratorSetting(30, TimestampFilter.class);
TimestampFilter.setRange(is, startDate, endDate);
AccumuloInputFormat.addIterator(job, is);

Or could I use
is.addOption("start", startDate);
is.addOption("end", endDate);

NOTE: for me "TimestempFilter.setRange"  nor "TimestampFilter.setStart and
TimestampFilter.setEnd didn't seem to work.

On Sun, Oct 16, 2016 at 2:05 PM, Josh Elser <josh.elser@gmail.com> wrote:

> The TimestampFilter will return only the Keys whose timestamp fall in the
> range you specify. The timestamp is an attribute on every Key, a long value
> which, when not set by the client at write time, is the number of millis
> since the epoch. You specify the numeric range of timestamps you want. This
> is a post-filter operation -- Accumulo must still read all of the data in
> the table.
>
> You need to tell *us* what the time component you're actually filtering
> on: the timestamp on each Key, or the createdDate column in each row.
>
> MapReduce is likely more efficient to do this batch processing (as
> MapReduce is a batch processing system). See the AccumuloInputFormat class.
>
> Bob Cook wrote:
>
>> All,
>>
>> I'm new to accumulo and inherited this project to extract all data from
>> accumulo (assembled as a "document" by RowID) into another web service.
>>
>> So I started with SimpleReadClient.java to "scan" all data, and built a
>> "document" based on the RowID, ColumnFamily and Value. Sending
>> this "document" to the service.
>> Example data.
>> ID CF CV
>> RowID_1 createdDate "2015-01-01:00:00:01 UTC"
>> RowID_1 data "this is a test"
>> RowID_1 title "My test title"
>>
>> RowID_2 createdDate "2015-01-01:12:01:01 UTC"
>> RowID_2 data "this is test 2"
>> RowID_2 title "My test2 title"
>>
>> ...
>>
>> So my table is pretty simple,  RowID, ColumnFamily and Value (no
>> ColumnQualifier)
>>
>> I need to process one Billion "OLD" unique RowIDs (a years worth of
>> data) on a live system that is ingesting "new data" at a rate of about a
>> 4million RowIds a day.
>> i.e. I need to process data from September 2015 - September 2016, not
>> worrying about new data coming in.
>>
>> So I'm thinking I need to run multiple processes to extract ALL the data
>> in this "data range" to be more efficient.
>> Also, it may allow me to run the processes at a lower priority and at
>> off-hours of the day when traffic is less.
>>
>> My issue is how do I specify the "range" to scan, and how do I specify.
>>
>> 1. Is using the "createdDate" a good idea, if so how would I specify the
>> range for it.
>>
>> 2. How about the TimestampFilter?   If I specify my start to end to
>> "equal" a day (about 4 Million unique RowIDs),
>> Will this get me all ColumnFamily and Values for a given RowID?  Or
>> could I miss something because it's timestamp
>> was the next day.  I don't really understand Timestamps wrt Accumulo.
>>
>> 3. Does a map-reduce job make sense.  If so, how would I specify.
>>
>>
>> Thanks,
>>
>> Bob
>>
>>

--001a11470396ac9a14053f061356
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra">Josh,</div><div class=3D"gmail_=
extra"><br></div><div class=3D"gmail_extra">Thanks. I was able to get Times=
tampFilter to works for my needs.=C2=A0 But I originally wanted &quot;creat=
edDate&quot; as our application creates that date which is known to the use=
r</div><div class=3D"gmail_extra">and may be different than accumulo timest=
amp due to when the data actually got processed into accumulo.</div><div cl=
ass=3D"gmail_extra"><br></div><div class=3D"gmail_extra">So if I wanted to =
use the ColumnFamily &quot;createdDate&quot; and it&#39;s value, =C2=A0what=
 java code would I have to write?</div><div class=3D"gmail_extra"><br></div=
><div class=3D"gmail_extra">I looked at the AccumuloInputFormat class, but =
confused on how to specify the &quot;range&quot; for the date range that I&=
#39;m interested in..</div><div class=3D"gmail_extra"><br></div><div class=
=3D"gmail_extra">So would I use the TimestampFilter Class similar to how I&=
#39;m using it in the &quot;scanner.addScanIterator&quot;, but instead usin=
g &quot;AcculoInputFormat.addIterator(job, is), as below.</div><div class=
=3D"gmail_extra"><br></div><div class=3D"gmail_extra"><span style=3D"color:=
rgb(51,51,51);font-family:&#39;droid sans mono&#39;,&#39;dejavu sans mono&#=
39;,monospace;font-size:16px;line-height:23.2px;white-space:pre-wrap;backgr=
ound-color:rgb(248,248,248)">IteratorSetting is =3D </span><span class=3D"g=
mail-hljs-keyword" style=3D"box-sizing:border-box;color:rgb(51,51,51);font-=
weight:bold;font-family:&#39;droid sans mono&#39;,&#39;dejavu sans mono&#39=
;,monospace;font-size:16px;line-height:23.2px;white-space:pre-wrap">new</sp=
an><span style=3D"color:rgb(51,51,51);font-family:&#39;droid sans mono&#39;=
,&#39;dejavu sans mono&#39;,monospace;font-size:16px;line-height:23.2px;whi=
te-space:pre-wrap;background-color:rgb(248,248,248)"> IteratorSetting(</spa=
n><span class=3D"gmail-hljs-number" style=3D"box-sizing:border-box;color:rg=
b(0,128,128);font-family:&#39;droid sans mono&#39;,&#39;dejavu sans mono=
9;,monospace;font-size:16px;line-height:23.2px;white-space:pre-wrap">30</sp=
an><span style=3D"color:rgb(51,51,51);font-family:&#39;droid sans mono&#39;=
,&#39;dejavu sans mono&#39;,monospace;font-size:16px;line-height:23.2px;whi=
te-space:pre-wrap;background-color:rgb(248,248,248)">, TimestampFilter.clas=
s);
TimestampFilter.setRange(is, </span><span style=3D"font-family:&#39;droid s=
ans mono&#39;,&#39;dejavu sans mono&#39;,monospace;font-size:16px;line-heig=
ht:23.2px;white-space:pre-wrap;background-color:rgb(255,255,255)"><font col=
or=3D"#dd1144">startDate, endDate</font></span><span style=3D"color:rgb(51,=
51,51);font-family:&#39;droid sans mono&#39;,&#39;dejavu sans mono&#39;,mon=
ospace;font-size:16px;line-height:23.2px;white-space:pre-wrap;background-co=
lor:rgb(248,248,248)">);
AccumuloInputFormat.addIterator(job, is);</span><br></div><div class=3D"gma=
il_extra"><br></div><div class=3D"gmail_extra">Or could I use</div><div cla=
ss=3D"gmail_extra">is.addOption(&quot;start&quot;, startDate);</div><div cl=
ass=3D"gmail_extra">is.addOption(&quot;end&quot;, endDate);</div><div class=
=3D"gmail_extra"><br></div><div class=3D"gmail_extra">NOTE: for me &quot;Ti=
mestempFilter.setRange&quot; =C2=A0nor &quot;TimestampFilter.setStart and T=
imestampFilter.setEnd didn&#39;t seem to work.</div><div class=3D"gmail_ext=
ra"><br></div><div class=3D"gmail_extra"><div class=3D"gmail_quote">On Sun,=
 Oct 16, 2016 at 2:05 PM, Josh Elser <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:josh.elser@gmail.com" target=3D"_blank">josh.elser@gmail.com</a>&gt;</spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-=
style:solid;padding-left:1ex">The TimestampFilter will return only the Keys=
 whose timestamp fall in the range you specify. The timestamp is an attribu=
te on every Key, a long value which, when not set by the client at write ti=
me, is the number of millis since the epoch. You specify the numeric range =
of timestamps you want. This is a post-filter operation -- Accumulo must st=
ill read all of the data in the table.<br>
<br>
You need to tell *us* what the time component you&#39;re actually filtering=
 on: the timestamp on each Key, or the createdDate column in each row.<br>
<br>
MapReduce is likely more efficient to do this batch processing (as MapReduc=
e is a batch processing system). See the AccumuloInputFormat class.<br>
<br>
Bob Cook wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">
All,<br>
<br>
I&#39;m new to accumulo and inherited this project to extract all data from=
<br>
accumulo (assembled as a &quot;document&quot; by RowID) into another web se=
rvice.<br>
<br>
So I started with SimpleReadClient.java to &quot;scan&quot; all data, and b=
uilt a<br>
&quot;document&quot; based on the RowID, ColumnFamily and Value. Sending<br=
>
this &quot;document&quot; to the service.<br>
Example data.<br>
ID CF CV<br>
RowID_1 createdDate &quot;2015-01-01:00:00:01 UTC&quot;<br>
RowID_1 data &quot;this is a test&quot;<br>
RowID_1 title &quot;My test title&quot;<br>
<br>
RowID_2 createdDate &quot;2015-01-01:12:01:01 UTC&quot;<br>
RowID_2 data &quot;this is test 2&quot;<br>
RowID_2 title &quot;My test2 title&quot;<br>
<br>
...<br>
<br>
So my table is pretty simple,=C2=A0 RowID, ColumnFamily and Value (no<br>
ColumnQualifier)<br>
<br>
I need to process one Billion &quot;OLD&quot; unique RowIDs (a years worth =
of<br>
data) on a live system that is ingesting &quot;new data&quot; at a rate of =
about a<br>
4million RowIds a day.<br>
i.e. I need to process data from September 2015 - September 2016, not<br>
worrying about new data coming in.<br>
<br>
So I&#39;m thinking I need to run multiple processes to extract ALL the dat=
a<br>
in this &quot;data range&quot; to be more efficient.<br>
Also, it may allow me to run the processes at a lower priority and at<br>
off-hours of the day when traffic is less.<br>
<br>
My issue is how do I specify the &quot;range&quot; to scan, and how do I sp=
ecify.<br>
<br>
1. Is using the &quot;createdDate&quot; a good idea, if so how would I spec=
ify the<br>
range for it.<br>
<br>
2. How about the TimestampFilter?=C2=A0 =C2=A0If I specify my start to end =
to<br>
&quot;equal&quot; a day (about 4 Million unique RowIDs),<br>
Will this get me all ColumnFamily and Values for a given RowID?=C2=A0 Or<br=
>
could I miss something because it&#39;s timestamp<br>
was the next day.=C2=A0 I don&#39;t really understand Timestamps wrt Accumu=
lo.<br>
<br>
3. Does a map-reduce job make sense.=C2=A0 If so, how would I specify.<br>
<br>
<br>
Thanks,<br>
<br>
Bob<br>
<br>
</blockquote>
</blockquote></div><br></div></div>

--001a11470396ac9a14053f061356--