accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Fwd: Extacting ALL Data using multiple java processes
Date Mon, 17 Oct 2016 03:59:27 GMT
There should be a static setRanges(Configuration, Collection<Range>) 
method somewhere in the type hierarchy of AccumuloInputFormat which lets 
you specify the Range[s].

Not using the TimestampFilter (not being able to use the timestamp for 
this filtering), you have two options to perform row-filtering based on 
the value for a column.

1) Perform the filter on the client side. If you have significant 
amounts of data, this will be slow. Even with MapReduce, this may 
present a significant overhead in processing.

2) Implement a custom Accumulo Iterator which can perform this filtering 
in Accumulo itself. I would recommend using the WholeRowIterator in 
conjunction with this filter you would implement.

At a high level, configure the WholeRowIterator to aggregate all of the 
Keys in one row into a single Key-Value pair. Then, implement and 
configure a custom Iterator (ideally, extend the abstract Filter 
iterator) which deserializes that single key-value pair back into many, 
extract the createdDate column, and make a decision as to whether or not 
the row should be returned to the client.

On the client, you would then unpack the serialized row into many 
key-value pairs again.

Bob Cook wrote:
> Josh,
> Thanks. I was able to get TimestampFilter to works for my needs.  But I
> originally wanted "createdDate" as our application creates that date
> which is known to the user
> and may be different than accumulo timestamp due to when the data
> actually got processed into accumulo.
> So if I wanted to use the ColumnFamily "createdDate" and it's value,
>   what java code would I have to write?
> I looked at the AccumuloInputFormat class, but confused on how to
> specify the "range" for the date range that I'm interested in..
> So would I use the TimestampFilter Class similar to how I'm using it in
> the "scanner.addScanIterator", but instead using
> "AcculoInputFormat.addIterator(job, is), as below.
> IteratorSetting is = newIteratorSetting(30, TimestampFilter.class);
> TimestampFilter.setRange(is, startDate, endDate);
> AccumuloInputFormat.addIterator(job, is);
> Or could I use
> is.addOption("start", startDate);
> is.addOption("end", endDate);
> NOTE: for me "TimestempFilter.setRange"  nor "TimestampFilter.setStart
> and TimestampFilter.setEnd didn't seem to work.
> On Sun, Oct 16, 2016 at 2:05 PM, Josh Elser <
> <>> wrote:
>     The TimestampFilter will return only the Keys whose timestamp fall
>     in the range you specify. The timestamp is an attribute on every
>     Key, a long value which, when not set by the client at write time,
>     is the number of millis since the epoch. You specify the numeric
>     range of timestamps you want. This is a post-filter operation --
>     Accumulo must still read all of the data in the table.
>     You need to tell *us* what the time component you're actually
>     filtering on: the timestamp on each Key, or the createdDate column
>     in each row.
>     MapReduce is likely more efficient to do this batch processing (as
>     MapReduce is a batch processing system). See the AccumuloInputFormat
>     class.
>     Bob Cook wrote:
>         All,
>         I'm new to accumulo and inherited this project to extract all
>         data from
>         accumulo (assembled as a "document" by RowID) into another web
>         service.
>         So I started with to "scan" all data, and
>         built a
>         "document" based on the RowID, ColumnFamily and Value. Sending
>         this "document" to the service.
>         Example data.
>         ID CF CV
>         RowID_1 createdDate "2015-01-01:00:00:01 UTC"
>         RowID_1 data "this is a test"
>         RowID_1 title "My test title"
>         RowID_2 createdDate "2015-01-01:12:01:01 UTC"
>         RowID_2 data "this is test 2"
>         RowID_2 title "My test2 title"
>         ...
>         So my table is pretty simple,  RowID, ColumnFamily and Value (no
>         ColumnQualifier)
>         I need to process one Billion "OLD" unique RowIDs (a years worth of
>         data) on a live system that is ingesting "new data" at a rate of
>         about a
>         4million RowIds a day.
>         i.e. I need to process data from September 2015 - September
>         2016, not
>         worrying about new data coming in.
>         So I'm thinking I need to run multiple processes to extract ALL
>         the data
>         in this "data range" to be more efficient.
>         Also, it may allow me to run the processes at a lower priority
>         and at
>         off-hours of the day when traffic is less.
>         My issue is how do I specify the "range" to scan, and how do I
>         specify.
>         1. Is using the "createdDate" a good idea, if so how would I
>         specify the
>         range for it.
>         2. How about the TimestampFilter?   If I specify my start to end to
>         "equal" a day (about 4 Million unique RowIDs),
>         Will this get me all ColumnFamily and Values for a given RowID?  Or
>         could I miss something because it's timestamp
>         was the next day.  I don't really understand Timestamps wrt
>         Accumulo.
>         3. Does a map-reduce job make sense.  If so, how would I specify.
>         Thanks,
>         Bob

View raw message