accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Fwd: Extacting ALL Data using multiple java processes
Date Sun, 16 Oct 2016 18:05:21 GMT
The TimestampFilter will return only the Keys whose timestamp fall in 
the range you specify. The timestamp is an attribute on every Key, a 
long value which, when not set by the client at write time, is the 
number of millis since the epoch. You specify the numeric range of 
timestamps you want. This is a post-filter operation -- Accumulo must 
still read all of the data in the table.

You need to tell *us* what the time component you're actually filtering 
on: the timestamp on each Key, or the createdDate column in each row.

MapReduce is likely more efficient to do this batch processing (as 
MapReduce is a batch processing system). See the AccumuloInputFormat class.

Bob Cook wrote:
> All,
>
> I'm new to accumulo and inherited this project to extract all data from
> accumulo (assembled as a "document" by RowID) into another web service.
>
> So I started with SimpleReadClient.java to "scan" all data, and built a
> "document" based on the RowID, ColumnFamily and Value. Sending
> this "document" to the service.
> Example data.
> ID CF CV
> RowID_1 createdDate "2015-01-01:00:00:01 UTC"
> RowID_1 data "this is a test"
> RowID_1 title "My test title"
>
> RowID_2 createdDate "2015-01-01:12:01:01 UTC"
> RowID_2 data "this is test 2"
> RowID_2 title "My test2 title"
>
> ...
>
> So my table is pretty simple,  RowID, ColumnFamily and Value (no
> ColumnQualifier)
>
> I need to process one Billion "OLD" unique RowIDs (a years worth of
> data) on a live system that is ingesting "new data" at a rate of about a
> 4million RowIds a day.
> i.e. I need to process data from September 2015 - September 2016, not
> worrying about new data coming in.
>
> So I'm thinking I need to run multiple processes to extract ALL the data
> in this "data range" to be more efficient.
> Also, it may allow me to run the processes at a lower priority and at
> off-hours of the day when traffic is less.
>
> My issue is how do I specify the "range" to scan, and how do I specify.
>
> 1. Is using the "createdDate" a good idea, if so how would I specify the
> range for it.
>
> 2. How about the TimestampFilter?   If I specify my start to end to
> "equal" a day (about 4 Million unique RowIDs),
> Will this get me all ColumnFamily and Values for a given RowID?  Or
> could I miss something because it's timestamp
> was the next day.  I don't really understand Timestamps wrt Accumulo.
>
> 3. Does a map-reduce job make sense.  If so, how would I specify.
>
>
> Thanks,
>
> Bob
>

Mime
View raw message