accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: Scanning In Timestamp Order
Date Wed, 02 Sep 2015 21:16:13 GMT
Jon,

There is some magic, but unfortunately it's not yet implemented:
ACCUMULO-652

Want to take over that project?

Adam

On Wed, Sep 2, 2015 at 5:14 PM, Parise, Jonathan <Jonathan.Parise@gd-ms.com>
wrote:

> I was pretty sure this was the answer.
>
> Yes it makes sense to me. I was expecting this response. I was hoping for
> some magic I didn't know about, but not really expecting it.
>
> Thanks,
>
> Jon
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Wednesday, September 02, 2015 5:13 PM
> To: user@accumulo.apache.org
> Subject: Re: Scanning In Timestamp Order
>
> Jon,
>
> Short answer: no.
>
> In RDBMS parlance, Accumulo has a single index. That index is the "row"
> portion of the Key class. This is the reason you see that as a "standard
> practice". Any other attempt to fetch data based on another component of
> the key (ignoring locality groups/column family subtleties) is an
> exhaustive scan of your dataset.
>
> If you are going to support this application for any duration of time, it
> is a good idea to take the penalty once in rewriting your old data into the
> new format to make all of your queries henceforth fast. If you have such a
> significant amount of data that you want to avoid running a large mapreduce
> task, you'll likely not want to make your users wait to read all of that
> data to answer every query :)
>
> Does that make sense?
>
> - Josh
>
> Parise, Jonathan wrote:
> > Hi,
> >
> > I was wondering if there is a way to scan a table based on the
> > timestamps. For example, is there a way to set a range based on the
> > timestamp portion of the key?
> >
> > I know that standard practice is to add a timestamp as part of the row
> > id, but in this particular case I probably cannot use that technique.
> > The reason I can't use it is that I need to find the most recent data
> > in a preexisting Accumulo instance. Not all of the information was
> > stored with timestamps as appended to the row id. I can't go back and
> > change the data, I just have to work with what is there.
> >
> > So, given a large amount of preexisting data without time information
> > in the row id, column family or column qualifier, how would you scan
> > for the most recent data?
> >
> > Specifically, is there any way to scan/sort by the timestamp portion
> > of the key. I did not see any way to make a Range with times.
> >
> > I also really do not want to run a job over all the data to make a new
> > copy of the table that is sorted. I have a lot of data here and such a
> > replication would take a very long time.
> >
> > Thanks,
> >
> > Jon
> >
>

Mime
View raw message