jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pablo Rios" <pr...@bea.com>
Subject RE: [jira] Resolved: (JCR-926) Global data store for binaries
Date Wed, 03 Oct 2007 16:04:50 GMT
Hi Thomas,

How do you handle in the GC the below scenario described by Esteban ?

[time]     [user session]			[GC session]
t0         node.setProperty(binary)
t1                                        gc.start
t2                                        gc.stop
t3         node.save

In this scenario the GC will delete the binary content created at t0 (t0
< t1) and the repository will have a binary property persisted at t3
with its value deleted.

It seems that the GC needs more information from the Data Store to do
its job. One approach may be to have "transient" and "persisted" binary
content so that the GC doesn't delete "transient" binary content. In the
above scenario the binary content would be created with the "transient"
state (at t0) and when the property is persisted it would be changed to
the "persisted" state (at t3). The GC can use observation to be notified
of property creation, as currently does to detect when nodes are moved
during GC scan. We can use a mark file for the binary content state in
the FileDataStore implementation and an additional column in the binary
content table for the DatabaseDataStore.

Does it have sense ?
What is the overhead of a PROPERTY_ADDED + PROPERTY_CHANGED listener in
Jackrabbit ?

Regards,
Pablo


-----Original Message-----
From: Esteban Franqueiro [mailto:efranque@bea.com] 
Sent: Tuesday, October 02, 2007 4:12 PM
To: dev@jackrabbit.apache.org
Subject: RE: [jira] Resolved: (JCR-926) Global data store for binaries

> Hi,
> 
> > > What about RepositoryException?
> > Yes, that would work too. But we wanted to be able to indentify the 
> > specific exception thrown from the DS. In a few places we
> wrapped DSE
> > inside a RE.

Yes, I changed that.

> What about if DataStoreException extends RepositoryException? 
> Like that you don't need to catch the exception and re-throw 
> a different one; but identifying is possible.
> 
> > > What about updating the time when getLength() is called?
> >
> > Sorry, I don't understand this.
> 
> Currently the modified date/time is updated when 
> modifiedDateOnRead is set and getStream() is called. You said 
> you want to avoid calling getStream().close(). So I suggest 
> to update the modified date/time when modifiedDateOnRead is 
> set and getLength() is called.
> 
> > > There is already DataStore.updateModifiedDateOnRead(long
> > > before); a separate method is not required in my view.
> > This didn't work in our testing.
> 
> Sorry, what does not work, and why?
> 
> > The FileDataRecord always queries the file object for it's 
> properties.
> 
> There is only getLength and getStream. Caching the length is possible.
> The length could be part of the identifier. For example, you 
> could do that for the DatabaseDataStore (where accessing this 
> information would be slow). For the FileDataStore, I think it 
> is not required as new
> File(..).length() is quite fast; but I may be wrong. So 
> basically the DataIdentifier for DatabaseDataStore could be 
> the digest plus the length.

For the database data store we're not using digests as ID's. We're using
UUID's. The thing about digests is that you have to get in the way
between the sender and the server. Initially we used something like the
FileDS to store the binary, get the digest and then upload it to the DB.
Then we decided to store directly in the DB, so we couldn't use digests
anymore.
Certainly, we could compute the digests in between the client and the
server, but we would need a two step upload process since after
computing the digest we would have to update the row to insert it there.
Using UUIDs is much easier since they don't depend on the content. Yes,
I know you loose the non-repetition property, but it was something we
could live with at the time. Anyway, if we can come up with another way,
that would be fine for us.
Also, storing the binary in the filesystem for digest computation,
instead of storing directly in the DB, has some impact on perfromance.

> > Well, we found that it is necesary, since the GC runs on a 
> different 
> > session than the users (we're using system sessions for this).
> 
> For the data store, it doesn't matter what session created or 
> accessed the data record. Also it doesn't matter when save() 
> was called - the data record is stored before that. As long 

Yes, the data record is stored, but the nodes referencing that record
are not.

> as the GC has access all nodes it will touch all 'not 
> recently used' data records. New data records (that were 
> created after the GC was started) will have a more recent 
> modified date / time and therefore the GC will not delete them.
> There is probably a misunderstanding somewhere. It would help 
> if you could post your current source code, maybe there is a 
> bug. Did you update the modified date / time in addRecord() 
> when the object already exists, as done in the file data store?

Yes, perhaps there's a misunderstanding. I will send you my code, but in
the meantime, let me explain what I mean.
The naive-GC reads all the nodes in the workspace to mark them (ie,
update access time). So if the user is uploading binaries in one session
and the GC is iterating the nodes on another, then until the user
save()s, the GC session won't see the changes in persistent storage. The
time of the binaries may be before or after the cut point of the GC,
depending on timing issues.
If it is before, then when the GC does a deleteAllOlderThan(), those
unmarked binaries will be deleted (they're unmarked because the nodes
that contain them could not be read because they don't exist yet).
I don't know if using the new PM interface will change any of this.

> > So, the user adds a binary property in one session, after 
> the file is 
> > uploaded but before the user save()s the session, the GC on 
> the system 
> > session starts reading all nodes from the workspace. Since 
> the changes 
> > are not yet written to persistent storage, the file is 
> assumed to be a 
> > deleted property, and is in fact deleted.
> 
> Is this using the database data store or the file data store? 
> If this is the case, it would be a bug.

Sorry, but I don't see the difference.

> > We needed to make RepositoryImpl.getWorkspaceNames() public 
> for this 
> > to work.
> 
> Yes that would be good. Did you have a look at 
> PersistenceManagerIteratorTest.testGetAllNodeIds()? 
> getWorkspaceNames is called using reflection (also.
> RepositoryImpl.class.getDeclaredMethod("getWorkspaceNames", 
> new Class[0]).setAccessible(true)).

Yes, I've seen it. But it would be easier to just make them public. Or
at least export some of those methods thru an utils class.

> > Will this change have any effect on the issue I just mentioned?
> 
> I don't know - first I need to understand the issue...
> 
> > Ok then, our current implementation is against a patched-up 1.3.1. 
> > What do you think is the best way to isolate the code?

May be we should discuss further after you get the code.
Since the copy of Jackrabbit I'm working has a lot of patches, it would
be pretty complicated to isolate those changes in a patch. Do you mind
if I send you a zip file with the implementations of the interfaces, the
tests, the configurations and parsers, and the initialization routines?

Regards,

Esteban Franqueiro
esteban.franqueiro@bea.com

Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Notice:  This email message, together with any attachments, may contain information  of  BEA
Systems,  Inc.,  its subsidiaries  and  affiliated entities,  that may be confidential,  proprietary,
 copyrighted  and/or legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient, and have received
this message in error, please immediately return this by email and then delete it.

Mime
View raw message