accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parise, Jonathan" <Jonathan.Par...@gd-ms.com>
Subject RE: Scanning In Timestamp Order
Date Wed, 02 Sep 2015 21:14:34 GMT
I was pretty sure this was the answer.

Yes it makes sense to me. I was expecting this response. I was hoping for some magic I didn't
know about, but not really expecting it.

Thanks,

Jon

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Wednesday, September 02, 2015 5:13 PM
To: user@accumulo.apache.org
Subject: Re: Scanning In Timestamp Order

Jon,

Short answer: no.

In RDBMS parlance, Accumulo has a single index. That index is the "row" 
portion of the Key class. This is the reason you see that as a "standard practice". Any other
attempt to fetch data based on another component of the key (ignoring locality groups/column
family subtleties) is an exhaustive scan of your dataset.

If you are going to support this application for any duration of time, it is a good idea to
take the penalty once in rewriting your old data into the new format to make all of your queries
henceforth fast. If you have such a significant amount of data that you want to avoid running
a large mapreduce task, you'll likely not want to make your users wait to read all of that
data to answer every query :)

Does that make sense?

- Josh

Parise, Jonathan wrote:
> Hi,
>
> I was wondering if there is a way to scan a table based on the 
> timestamps. For example, is there a way to set a range based on the 
> timestamp portion of the key?
>
> I know that standard practice is to add a timestamp as part of the row 
> id, but in this particular case I probably cannot use that technique.
> The reason I can't use it is that I need to find the most recent data 
> in a preexisting Accumulo instance. Not all of the information was 
> stored with timestamps as appended to the row id. I can't go back and 
> change the data, I just have to work with what is there.
>
> So, given a large amount of preexisting data without time information 
> in the row id, column family or column qualifier, how would you scan 
> for the most recent data?
>
> Specifically, is there any way to scan/sort by the timestamp portion 
> of the key. I did not see any way to make a Range with times.
>
> I also really do not want to run a job over all the data to make a new 
> copy of the table that is sorted. I have a lot of data here and such a 
> replication would take a very long time.
>
> Thanks,
>
> Jon
>

Mime
View raw message