accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@uw.edu>
Subject Re: how to maintain versioning in D4M schema?
Date Mon, 30 Nov 2015 20:22:21 GMT
>
> 2. Instead of putting each "value" in its own field, you could combine
> them into an ordered set: field|{time1:value1,time2:value2,time3:value3}.
> For this to work well, you'd have to write a custom combining iterator that
> kept only the most recent 3 during scans and compactions, based on time (or
> whatever you use to denote version).
>

If you don't mind writing a custom iterator, then you can write an iterator
for the original schema (where colq is "field|value1") which acts as
follows.  Don't forget that entries include a timestamp field.  Let V be
the total number of versions you want to retain, and let ARR be an array of
values and timestamps of size V.

   1. Save the "field" of the first entry in the column qualifier.  Store
   the value and timestamp as the first entry in ARR.
   2. While the next entry has the same "field", store its value and
   timestamp in the next empty entry in ARR.
      1. If there are no more empty slots in ARR, then remove the entry
      with the least recent timestamp from ARR, and add the new value and
      timestamp to ARR (or don't add the new entry if it has the least recent
      timestamp).
   3. When the next entry does not have the same "field", emit all the
   entries in ARR, clear ARR, and go back to step 1 with new entry.
   4. When there are no more entries, emit ARR and no more (set hasTop() to
   false).

This approach works because a row is guaranteed to be stored on the same
tablet server, and we see all the entries for a "field" consecutively.  Let
us know how it works for you if you choose to go this route.

Regards, Dylan

On Mon, Nov 30, 2015 at 10:58 AM, Christopher <ctubbsii@apache.org> wrote:

> I can think of two options:
>
> 1. Instead of "field|value", use "field<version>|value", where version
> behaves similarly to Accumulo's timestamp field, and add a custom iterator
> which achieves the same effect as the VersioningIterator using this part of
> the colq.
>
> 2. Instead of putting each "value" in its own field, you could combine
> them into an ordered set: field|{time1:value1,time2:value2,time3:value3}.
> For this to work well, you'd have to write a custom combining iterator that
> kept only the most recent 3 during scans and compactions, based on time (or
> whatever you use to denote version).
>
> Of the two, I think the second is simpler and fits best within the
> existing D4M schema. At the most, it just adds some structure to the value,
> which can be processed with an additional combining iterator, but doesn't
> fundamentally change the the table structure.
>
> On Sun, Nov 29, 2015 at 11:10 PM shweta.agrawal <shweta.agrawal@orkash.com>
> wrote:
>
>> The example which I am working is:
>>
>> rowid        colf          colq          value
>>    id                        field|value1      1
>>    id                        field|value2      1
>>    id                        field|value3      1
>>    id                        field|value4      1
>>    id                        field|value5      1
>>    id                        field|value6      1
>>
>> This is my schema in D4M style. Here one field has multiple values. And
>> I want to keep latest 3 values and I want that automatically other
>> values to be deleted as in case of versioning iterator.
>>
>> So after versioning my table should look like this:
>>
>> rowid        colf          colq          value
>>    id                        field|value1      1
>>    id                        field|value2      1
>>    id                        field|value3      1
>>
>> Thanks
>> Shweta
>>
>> On Friday 27 November 2015 07:15 PM, Jeremy Kepner wrote:
>> > Can you provide a made up specific example?  I think that will
>> > make the discussion easier.
>> >
>> >
>> > On Fri, Nov 27, 2015 at 02:46:33PM +0530, shweta.agrawal wrote:
>> >> Thanks for the answer.
>> >> But I am asking about versioning in D4M style. How can I use
>> >> versioning iterator in D4M style as in D4M style, in Rowid id is
>> >> strored and field|value is stored in ColumnQualifier. So as value is
>> >> stored in columnQualifier I cannot maintain versions through
>> >> versioning iterator. So I am asking how will I maintain versioning
>> >> in D4M style?
>> >>
>> >> Thanks
>> >> Shweta
>> >>
>> >> On Friday 27 November 2015 12:45 PM, Dylan Hutchison wrote:
>> >>> In order to store five versions of a key but return only one of
>> >>> them during a scan, set the minc and majc VersioningIterator to 5
>> >>> and set the scan VersioningIterator to 1.  You can set scanning
>> >>> iterators on a per-scan basis if this helps.
>> >>>
>> >>> It is not necessary to put the timestamp in the column family if
>> >>> you are going with the VersioningIterator approach.
>> >>>
>> >>> There are many ways to achieve versioning in Accumulo. As the
>> >>> designer/programmer, you must choose one that fits your
>> >>> application, of which we do not know the full details. It sounds
>> >>> like you have narrowed your choice to (1) putting the timestamp in
>> >>> the column family, or (2) not putting the timestamp anywhere else
>> >>> but instead changing the VersioningIterator such that Accumulo
>> >>> stores more versions than the latest version of a
>> >>> (row,colfam,colqual,colvis) key.
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Nov 26, 2015 at 8:45 PM, mohit.kaushik
>> >>> <mohit.kaushik@orkash.com <mailto:mohit.kaushik@orkash.com>>
>> >>> wrote:
>> >>>
>> >>>     David,
>> >>>
>> >>>     But this is the case when we store versions based on timestamp
>> >>>     field. The point is, in D4M schema we can not achieve it by doing
>> >>>     this. In this case we are considering CF to store timestamp in
>> >>>     reverse order as described by Dylan. Then how can we configure
>> >>>     Accumulo to return only latest version and store only 5 versions?
>> >>>
>> >>>     Thanks
>> >>>     Mohit Kaushik
>> >>>
>> >>>     On 11/27/2015 09:54 AM, David Medinets wrote:
>> >>>>      From the user manual:
>> >>>>
>> >>>>     user@myinstance  mytable>  config  -t  mytable  -s
>> table.iterator.scan.vers.opt.maxVersions=5
>> >>>>     user@myinstance  mytable>  config  -t  mytable  -s
>> table.iterator.minc.vers.opt.maxVersions=5
>> >>>>     user@myinstance  mytable>  config  -t  mytable  -s
>> table.iterator.majc.vers.opt.maxVersions=5
>> >>>>
>> >>>>     On Thu, Nov 26, 2015 at 11:10 PM, shweta.agrawal
>> >>>>     <shweta.agrawal@orkash.com <mailto:shweta.agrawal@orkash.com>>
>> wrote:
>> >>>>
>> >>>>         I want to maintain 5 versions only and user can enter any
>> >>>>         number of versions but I want to keep only 5 latest version.
>> >>>>
>> >>>>
>> >>>>         On Friday 27 November 2015 09:38 AM, David Medinets wrote:
>> >>>>>         Do you want five versions of every entry or will the
number
>> >>>>>         of versions vary?
>> >>>>>
>> >>>>>         On Thu, Nov 26, 2015 at 10:53 PM, shweta.agrawal
>> >>>>>         <shweta.agrawal@orkash.com
>> >>>>>         <mailto:shweta.agrawal@orkash.com>> wrote:
>> >>>>>
>> >>>>>             Thanks Dylan and David.
>> >>>>>             I can store version information in column family.
But my
>> >>>>>             problem is when I have many versions of the same
key how
>> >>>>>             will I manage that. In Accumulo versioning I can
specify
>> >>>>>             that how many versions I want to manage.
>> >>>>>
>> >>>>>             Suppose I have 10 versions and I only want 5 versions
to
>> >>>>>             store, how to manage this in a big table?
>> >>>>>
>> >>>>>             Thanks
>> >>>>>             Shweta
>> >>>>>
>> >>>>>             On Thursday 26 November 2015 10:22 PM, David Medinets
>> wrote:
>> >>>>>>             What are the query patterns? If you are versioning
for
>> >>>>>>             auditing then changing the VersioningIterator
seems the
>> >>>>>>             easiest approach. You could also store
>> >>>>>>             application-specific version information in
the column
>> >>>>>>             family. One of the reasons that D4M does not
use it is
>> >>>>>>             to allow application-specific uses. Using the
CF means
>> >>>>>>             that any applications that understand D4M would
not
>> >>>>>>             need to change their queries to adjust for the
version
>> >>>>>>             information.
>> >>>>>>
>> >>>>>>             On Thu, Nov 26, 2015 at 4:26 AM, shweta.agrawal
>> >>>>>>             <shweta.agrawal@orkash.com
>> >>>>>>             <mailto:shweta.agrawal@orkash.com>>
wrote:
>> >>>>>>
>> >>>>>>                 Hi,
>> >>>>>>
>> >>>>>>                 I have my data stored in D4M style. I also
want to
>> >>>>>>                 maintain versions of different value on
the basis
>> >>>>>>                 of time.  As in D4M style  data is only
in rowid
>> >>>>>>                 and colQualifier only.
>> >>>>>>
>> >>>>>>                 Is there any way to achieve versioning in
D4M
>> schema?
>> >>>>>>
>> >>>>>>                 Thanks
>> >>>>>>                 Shweta
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>     --
>> >>>
>> >>>     *Mohit Kaushik*
>> >>>     Software Engineer
>> >>>     A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>> >>>     *Tel:*+91 (124) 4969352 <tel:%2B91%20%28124%29%204969352>
|
>> >>>     *Fax:*+91 (124) 4033553 <tel:%2B91%20%28124%29%204033553>
>> >>>
>> >>>     <http://politicomapper.orkash.com>interactive social intelligence
>> >>>     at work...
>> >>>
>> >>>     <https://www.facebook.com/Orkash2012>
>> >>>     <http://www.linkedin.com/company/orkash-services-private-limited>
>> >>>     <https://twitter.com/Orkash> <http://www.orkash.com/blog/>
>> >>>     <http://www.orkash.com>
>> >>>     <http://www.orkash.com> ... ensuring Assurance in complexity
and
>> >>>     uncertainty
>> >>>
>> >>>     /This message including the attachments, if any, is a confidential
>> >>>     business communication. If you are not the intended recipient it
>> >>>     may be unlawful for you to read, copy, distribute, disclose or
>> >>>     otherwise use the information in this e-mail. If you have received
>> >>>     it in error or are not the intended recipient, please destroy it
>> >>>     and notify the sender immediately. Thank you /
>> >>>
>> >>>
>>
>>

Mime
View raw message