Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of amit.j76@gmail.com designates
 209.85.212.47 as permitted sender)
MIME-Version: 1.0
Sender: amit.j76@gmail.com
In-Reply-To: 
 <CAHCW-mmOF4thRvkFEQWsG84qF2w8aGgOvywLfHdtYySi08ArmA@mail.gmail.com>
References: 
 <CAHCW-mmvOA2_85xzE+kKtrOwqWbaP6jjvC-eZ5kkW66JRXBiNQ@mail.gmail.com>
 <9C0FC4C8E9C29945B01766FC7F9D389818A7DC2B82@eurmbx01.eur.adobe.com>
 <CAHCW-mmOF4thRvkFEQWsG84qF2w8aGgOvywLfHdtYySi08ArmA@mail.gmail.com>
From: Amit Jain <amitj@ieee.org>
Date: Wed, 30 Oct 2013 16:04:50 +0530
Message-ID: 
 <CAGP5+YrffQd5e4RuND6EAQLbYzbwYPDO_N6Y+fi65V3c+s6FNg@mail.gmail.com>
Subject: Re: Strategies around storing blobs in Mongo
To: oak-dev@jackrabbit.apache.org
Content-Type: multipart/alternative; boundary=047d7b6786fe425f6604e9f2e3d4

--047d7b6786fe425f6604e9f2e3d4
Content-Type: text/plain; charset=ISO-8859-1

>> So even adding a 2
>>MB chunk on a sharded system over remote connection would block read
>>for that complete duration. So at minimum we should be avoiding that.

I guess if there are read replicas in the shard replica set then, it will
mitigate the effect to some extent


On Wed, Oct 30, 2013 at 3:04 PM, Chetan Mehrotra
<chetan.mehrotra@gmail.com>wrote:

> > sounds reasonable. what is the impact of such a design when it comes
> > to map-reduce features? I was thinking that we could use it e.g. for
> > garbage collection, but I don't know if this is still an option when data
> > is spread across multiple databases.
>
> Would investigate that aspect further
>
> > connecting to a second server would add quite some complexity to
> Yup. Option was just provided for completeness sake. And something
> like this would probably never be required.
>
> > that was one of my initial thoughts as well, but I was wondering what
> > the impact of such a deployment is on data store garbage collection.
>
> Probably we can make a shadow node for the binary in the blob
> collection and keep the binary content within the DataStore itself.
> Stuff like Garbage collection would be performed on the Shadow node
> and logic would use results from that to perform actual deletions.
>
>
> Chetan Mehrotra
>
>
> On Wed, Oct 30, 2013 at 1:13 PM, Marcel Reutegger <mreutegg@adobe.com>
> wrote:
> > Hi,
> >
> >> Currently we are storing blobs by breaking them into small chunks and
> >> then storing those chunks in MongoDB as part of blobs collection. This
> >> approach would cause issues as Mongo maintains a global exclusive
> >> write locks on a per database level [1]. So even writing multiple
> >> small chunks of say 2 MB each would lead to write lock contention.
> >
> > so far we observed high lock content primarily when there are a lot of
> > updates. inserts were not that big of a problem, because you can batch
> > them. it would probably be good to have a test to see how big the
> > impact is when blogs come into play.
> >
> >> Mongo also provides GridFS[2]. However it also uses a similar strategy
> >> like we are currently using and such a support is built into the
> >> Driver. For server they are just collection entries.
> >>
> >> So to minimize contentions for write locks for uses cases where big
> >> assets are being stored in Oak we can opt for following strategies
> >>
> >> 1. Store the blobs collection in a different database. As Mongo write
> >> locks [1] are taken per db level then storing the blobs in different
> >> db would allow the read/write of node data (majority usecase) to
> >> continue.
> >
> > sounds reasonable. what is the impact of such a design when it comes
> > to map-reduce features? I was thinking that we could use it e.g. for
> > garbage collection, but I don't know if this is still an option when data
> > is spread across multiple databases.
> >
> >> 2. For more asset/binary heavy usecase use a separate database server
> >> itself to server the binaries.
> >
> > connecting to a second server would add quite some complexity to
> > the system. wouldn't it be easier to just leverage standard mongodb
> > sharding to distribute the load?
> >
> >> 3. Bring back the JR2 DataStore implementation and just save metadata
> >> related to binaries in Mongo. We already have S3 based implementation
> >> there and they would continue to work with Oak also
> >
> > that was one of my initial thoughts as well, but I was wondering what
> > the impact of such a deployment is on data store garbage collection.
> >
> > regards
> >  marcel
> >
> >> Chetan Mehrotra
> >> [1] http://docs.mongodb.org/manual/faq/concurrency/#how-granular-are-
> >> locks-in-mongodb
> >> [2] http://docs.mongodb.org/manual/core/gridfs/
>

--047d7b6786fe425f6604e9f2e3d4--