accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ke...@deenlo.com
Subject Re: Review Request 19790: ACCUMULO-378 Design document
Date Mon, 31 Mar 2014 15:57:26 GMT


> On March 28, 2014, 6:53 p.m., kturner wrote:
> > docs/src/main/resources/design/ACCUMULO-378-design.mdtext, line 139
> > <https://reviews.apache.org/r/19790/diff/1/?file=539855#file539855line139>
> >
> >     A walog or bulk imported file could be referenced by multiple tablets.  I am
wondering if it would be better to move this info out of the tablet and do something like
~del markers in the metadata table.  Like a ~repl_hdfs://foo/a.rf row in the metadata table.
 This row could store replication status.  If the ~repl row exist, then file would not be
deleted.  The ~repl marker could not be removed until the file is replicated and there are
no more refs in the tablet metadata (is this sufficient to prevent addint a repl marker for
something that already replicated).  Could possibly update repl markers using conditional
mutations, since multiple tablets and the master may mutate it.
> 
> Josh Elser wrote:
>     Yeah, this ties into what Mike had asked about. Having it in a completely separate
table would be best from a "screwing up other things" perspective. I don't have any example
of why these markers would need to be in the same row as the tablets. I need to read some
of that code again.

I was thinking about this some more, putting those markers in the same row makes updates to
the metadata table atomic.  The same mutation that adds a file can also add the replication
marker.   Keeping track of the replication status of a file will still need to be done in
a single place.  Could do this tracking of per file replication status info in ~repl section
of metadata table.  Use file suffixes with ~repl, like ~repl<suffix>.     

I would be wary of using another table.  Having tablet write to a sibling in the tree of tablets
has always been troublesome.  Writing up the tree avoids problems.  But this depends on the
implementation.  If some other process scans the repl column families and dedupes them and
adds new ones to another table that tracks files to replicate, that would probably be ok.
 If the tablets are writing to the deduped list of files to replicate, it would be best to
write up the tree.

It would be nice to walk through the end to end process in the design doc and outline what
mutations are written to metadata (and possibly other tables).  My thinking is that this would
be an easy way to help us reason about the correctness of the design.


> On March 28, 2014, 6:53 p.m., kturner wrote:
> > docs/src/main/resources/design/ACCUMULO-378-design.mdtext, line 261
> > <https://reviews.apache.org/r/19790/diff/1/?file=539855#file539855line261>
> >
> >     Tablets could support an atomic operation that marks all of its current files
as needing replication and appropriately handle new data coming in.  The master would go through
all tablets in a table calling this operation.  Tablets could write something to the metadata
table when the operation is successful.  This allows the master to know which tablets are
done.
> 
> Josh Elser wrote:
>     That's a possibility. Like you mentioned earlier, depending on the amount of data
to be replicated, exporttable and distcp might be (wildly) more efficient. Worth it to try
to do this now, or leave pre-existing data replication as a follow-on?

An important consideration now i think is to make sure we don't do anything that precludes
this.


> On March 28, 2014, 6:53 p.m., kturner wrote:
> > docs/src/main/resources/design/ACCUMULO-378-design.mdtext, line 345
> > <https://reviews.apache.org/r/19790/diff/1/?file=539855#file539855line345>
> >
> >     Or maybe the replication info could be stored externally, if that information
applies to the entire wal file.
> 
> Josh Elser wrote:
>     I was thinking of some sort of serialized object inside of the Value of that repl
column. This would serve as a point of hiding the details of WAL or RFile. Do you have other
thoughts about this?

What information would be stored per key/value?


> On March 28, 2014, 6:53 p.m., kturner wrote:
> > docs/src/main/resources/design/ACCUMULO-378-design.mdtext, line 119
> > <https://reviews.apache.org/r/19790/diff/1/?file=539855#file539855line119>
> >
> >     Seems like accumulo should have a public API for querying what needs to be replicated,
notifying it when something has been replicated, and methods for importing replicated data.
 I am thinking of something different than a plugin, more like the import/export table API.
 How the replication happens is up the user.  We could provide a default implementation that
does replication as you mentioned.  Some users may want to occassionally replicate large batches
using map reduce.  Others may want to continually replicate files using distributed queueing
solutions.
> 
> Josh Elser wrote:
>     My initial thoughts were to provide something at a public api layer due to the likely
desire to integrate WALs as a part of said API. Opening up an API might prove difficult to
implement well -- we would have to design something that scales out to adequately support
the ingest rates Accumulo will support.
>     
>     Not saying I'm against it, but it would be difficult to get right. Hooking into it
would also likely be difficult to implement.

I agree would not want to expose internals of walogs to users.  However, I think this API
would just expose URI that need to be replicated.  The user woud not have to care about what
the actuall data is pointed to be the URI.

I am going about this all wrong.  I should outline what I would like to see Accumulo do instead
of some incomplete "how" to do it.  Stepping back i would like to see this feature designed
to empower admins.

ZFS is a file system I really like that empowers admins.  One way it empowers admins is by
providing a really flexible easy to use mechanism for replicating file systems. W/ ZFS an
admin can do something like the following to initially replicate a file system.

 # zfs snapshot tank/home@snap1 
 # zfs send tank/home@snap1 | ssh host2 zfs recv newtank/home

After some period of time they can easy replicate the changes to the file system w/ the following
commands.

 # zfs snapshot tank/home@snap2 
 # zfs send -i tank/home@snap1 tank/home@snap2 | ssh host2 zfs recv newtank/dana

What I like about this is that zfs send writes to std out, so that admin could write to a
file, send over the network, write to tape, etc.   Whenever and however the admin wants to
move the data, the ZFS API makes it super easy for them to do it.    Of course we can not
do exatcly what ZFS does, but we can make it easy for admins to move data between clusters
in different ways and on different schedules.


- kturner


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19790/#review38927
-----------------------------------------------------------


On March 28, 2014, 5:54 p.m., kturner wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19790/
> -----------------------------------------------------------
> 
> (Updated March 28, 2014, 5:54 p.m.)
> 
> 
> Review request for accumulo.
> 
> 
> Bugs: ACCUMULO-378
>     https://issues.apache.org/jira/browse/ACCUMULO-378
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-378 Design document.  Posting for review here, not meant for commit.  Final
version of document should be posted on issue.
> 
> 
> Diffs
> -----
> 
>   docs/src/main/resources/design/ACCUMULO-378-design.mdtext PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19790/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> kturner
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message