jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller" <thomas.tom.muel...@gmail.com>
Subject Re: [jira] Resolved: (JCR-926) Global data store for binaries
Date Tue, 09 Oct 2007 08:29:42 GMT
Hi,

I'm sorry about the delay. I was away last week.

> What about if DataStoreException extends RepositoryException?
> What about updating the time when getLength() is called?

Are you OK with that? If yes I will change it.

> For the database data store we're not using digests as ID's. We're using
> UUID's.

That's technically OK, but I think it can be avoided using
DigestInputStream (see below).

> The thing about digests is that you have to get in the way
> between the sender and the server.

Yes, that's true.

> Initially we used something like the
> FileDS to store the binary, get the digest and then upload it to the DB.

That's of course a problem, it must be avoided.

> Then we decided to store directly in the DB,

Sure, that's much faster.

> so we couldn't use digests anymore.

I think there is still a way to get the digest. If you wrap the
InputStream like this:

public DataRecord addRecord(InputStream input) throws IOException {
  MessageDigest digest = MessageDigest.getInstance(DIGEST);
  InputStream input = new DigestInputStream(input, digest);
  ...
}

> Certainly, we could compute the digests in between the client and the
> server, but we would need a two step upload process since after
> computing the digest we would have to update the row to insert it there.

A two step process must be avoided of course, I hope the
DigestInputStream approach can solve that.

> Using UUIDs is much easier since they don't depend on the content.

Sure. What about (pseudo code):

Example schema:
CREATE TABLE DATASTORE(
  KEY BINARY(..) PRIMARY KEY,
  TEMP BIT,
  LENGTH BIGINT,
  DATA BLOB
);

do {
  randomId = secureRandom.nextBytes(..);
  try {
    INSERT INTO DATASTORE(KEY, TEMP) VALUES(randomId, TRUE);
  } catch(UniqueKeyException ) {
    // very, very rare
    continue;
  }
} while(false);
UPDATE DATASTORE
  SET DATA=DigestInputStream(..)
  WHERE KEY=randomId
digest = ...
try {
updateCount = UPDATE DATASTORE SET
  KEY=digest, TEMP=FALSE
  WHERE KEY=randomId AND TEMP=TRUE
  AND NOT EXISTS(
  SELECT KEY FROM DATASTORE
  WHERE KEY=digest
  AND TEMP=FALSE AND LENGTH=...
)
} catch(duplicate key) {
  throw exception("duplicate key with different length");
}
if(updateCount == 0) {
  DELETE FROM DATASTORE WHERE KEY=randomId AND TEMP=true
}
return digest;

I hope you get the idea. The second UPDATE statement will return 0
update count if a record with the same digest exists. I didn't set the
LENGTH everywhere.

> [time]     [user session]                       [GC session]
> t0         node.setProperty(binary)
> t1                                        gc.start
> t2                                        gc.stop
> t3         node.save

Is this the problem? I didn't think about that so far... In my view it
is rare because the garbage collection usually will take some time,
and the time between node.setProperty and node.save is (hopefully)
short. But it needs to be solved. I will write a test case. A simple
solution is to only delete records when the repository is stopped (or
started). Obviously this is not a solution for long running
repositories. Another idea is to keep transient large binaries in a
WeakReferenceHashMap, and before deleting check that the record is not
in there.

> > > We needed to make RepositoryImpl.getWorkspaceNames() public
> it would be easier to just make them public.
> Or at least export some of those methods thru an utils class.

I will make it public.

> Do you mind
> if I send you a zip file with the implementations of the interfaces, the
> tests, the configurations and parsers, and the initialization routines?

There is no hurry, but please don't send it via email. The preferred
way is to attach the code to the bug:
http://issues.apache.org/jira/browse/JCR-1154 (you will be asked to
'Grant license to ASF for inclusion in ASF works').

Thanks,
Thomas

Mime
View raw message