subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Blair Zajac <bl...@orcaware.com>
Subject Re: FSv2 (was: FREE Apache Subversion Meetup...)
Date Tue, 19 Oct 2010 17:12:08 GMT
On 10/19/2010 01:31 AM, Greg Stein wrote:
> On Mon, Oct 18, 2010 at 23:51, Blair Zajac<blair@orcaware.com>  wrote:
>> On 10/04/2010 06:45 AM, C. Michael Pilato wrote:
>>>
>>> There, you can learn more about what the Meetups tend to look like, what
>>> other Meetups are planned for this years conference, and so on.  You'll
>>> also
>>> find a link to the Subversion Meetup wiki page:
>>>
>>>         http://subversion.open.collab.net/wiki/ApacheConNA2010Meetup
>>
>> That's the first mention I've seen of FSv2.  What ideas are going into it?
>>   What problems is it primarily meant to solve?
>
> FSv2 is a hand-wave.
>
> Personally, I see it as a broad swath of API changes to align our
> needs with the underlying storage. Trowbridge noted that our current
> API makes it *really* difficult to implement an effective backend. I'd
> also like to see a backend that allows for parallel PUTs during the
> commit process. Hyrum sees FSv2 as some kind of super-key-value
> storage with layers on top, allowing for various types of high-scaling
> mechanisms.

How would that API look?  The API as it is is pretty clear.

Background for my wish list.

We use Subversion as a backend for a versioned asset management system. 
  We get up to 5 commits per second from render processes generating new 
assets and artists saving assets.  We have interactive GUI users that do 
asset lookups all the time.

While the immutability of svn has allowed us to cache revision data and 
our servers can push 4,000 lookups per second to our render farm that do 
lookups on a particular revision, interactive users that do HEAD lookups 
suffer because the high commit rate.  We cache data by node-id in 
memcached, but because the root node always get a new node-id and 
because the first thing interactive users do is get a list of folders of 
the root node, we always get cache misses.  I don't really want svn to 
change the way new node-ids are assigned to parent nodes all the way to 
the root.

1) Scalability to 30,000 child nodes in a single directory.

Currently, a single change to a node in a directory with 20,000 child 
nodes causes a new revision file in fsfs to use around 960 kB.  With a 
commit rate of 1.5 commits per second in a repository, the disk usage is 
very high.  We introduced a hidden layer of "hash:DD" directories, 30 in 
our case, that our internal Subversion server hashes path elements to. 
This makes the revision files much smaller, but now when getting a list 
of nodes in a directory, we have up to 30 child directories to index, 
increasing lookup times.

If we could remove the need to hash directories, then the lookup on the 
root node would be much faster and interactive users would be happier.

2) I would like to ensure that the new backend supports multiple 
modifications to the same node.  I don't know if this was designed into 
the current backend, but given I expose svn_fs.h over RPC, clients can 
make any one or multiple modifications to the tree, so the new backend 
should support this.

And while we're discussing wants.

3) Pools are painful to use.  We have repository, revision and 
transaction C++ objects stored in an LRU cache.  They cache revision and 
transaction roots for improved performance.  Using the wrong pool for a 
RPC method can cause memory leaks (we just found one Monday causing a 
backend server to run out of memory).  Constructing and destroying pools 
in the wrong order can cause the process to crash.  This is hard to get 
right, so using a different model would be very useful.  I haven't had 
the cycles to look at Hyrum's new C++ object and see how that would help.

Blair


Mime
View raw message