accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser" <josh.el...@gmail.com>
Subject Re: Review Request 19790: ACCUMULO-378 Design document
Date Mon, 31 Mar 2014 16:18:53 GMT


> On March 28, 2014, 6:53 p.m., kturner wrote:
> > docs/src/main/resources/design/ACCUMULO-378-design.mdtext, line 119
> > <https://reviews.apache.org/r/19790/diff/1/?file=539855#file539855line119>
> >
> >     Seems like accumulo should have a public API for querying what needs to be replicated,
notifying it when something has been replicated, and methods for importing replicated data.
 I am thinking of something different than a plugin, more like the import/export table API.
 How the replication happens is up the user.  We could provide a default implementation that
does replication as you mentioned.  Some users may want to occassionally replicate large batches
using map reduce.  Others may want to continually replicate files using distributed queueing
solutions.
> 
> Josh Elser wrote:
>     My initial thoughts were to provide something at a public api layer due to the likely
desire to integrate WALs as a part of said API. Opening up an API might prove difficult to
implement well -- we would have to design something that scales out to adequately support
the ingest rates Accumulo will support.
>     
>     Not saying I'm against it, but it would be difficult to get right. Hooking into it
would also likely be difficult to implement.
> 
> kturner wrote:
>     I agree would not want to expose internals of walogs to users.  However, I think
this API would just expose URI that need to be replicated.  The user woud not have to care
about what the actuall data is pointed to be the URI.
>     
>     I am going about this all wrong.  I should outline what I would like to see Accumulo
do instead of some incomplete "how" to do it.  Stepping back i would like to see this feature
designed to empower admins.
>     
>     ZFS is a file system I really like that empowers admins.  One way it empowers admins
is by providing a really flexible easy to use mechanism for replicating file systems. W/ ZFS
an admin can do something like the following to initially replicate a file system.
>     
>      # zfs snapshot tank/home@snap1 
>      # zfs send tank/home@snap1 | ssh host2 zfs recv newtank/home
>     
>     After some period of time they can easy replicate the changes to the file system
w/ the following commands.
>     
>      # zfs snapshot tank/home@snap2 
>      # zfs send -i tank/home@snap1 tank/home@snap2 | ssh host2 zfs recv newtank/dana
>     
>     What I like about this is that zfs send writes to std out, so that admin could write
to a file, send over the network, write to tape, etc.   Whenever and however the admin wants
to move the data, the ZFS API makes it super easy for them to do it.    Of course we can not
do exatcly what ZFS does, but we can make it easy for admins to move data between clusters
in different ways and on different schedules.

So, wrapping something around (ranges of) WALs and RFile is definitely desirable here. I believe
with that, we can better separate the logic into discrete pieces: 1) Generate data 2) Transmit
data 3) Apply data

The more we can make the implementations more agnostic of the underlying data, likely the
better. The wrapper around WALs and RFiles would need to support some semantics like ordering
(WAL1 needs to be applied before WAL2), verification/validation on the remote side (checksum?),
and the ability to efficiently replay this data.

Thinking further, you could even generalize the problem of how to get from #1 to #2 as a FIFO
queue backed by a table.


- Josh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19790/#review38927
-----------------------------------------------------------


On March 28, 2014, 5:54 p.m., kturner wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19790/
> -----------------------------------------------------------
> 
> (Updated March 28, 2014, 5:54 p.m.)
> 
> 
> Review request for accumulo.
> 
> 
> Bugs: ACCUMULO-378
>     https://issues.apache.org/jira/browse/ACCUMULO-378
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-378 Design document.  Posting for review here, not meant for commit.  Final
version of document should be posted on issue.
> 
> 
> Diffs
> -----
> 
>   docs/src/main/resources/design/ACCUMULO-378-design.mdtext PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19790/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> kturner
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message