chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Homeier <shome...@meyle-mueller.de>
Subject AW: document 'uniqueness'
Date Sun, 22 Jun 2014 09:37:26 GMT
Hi Tim,

you said you need to migrate the documents from FileNet to a CMIS compliant server.
Is the CMIS compliant server your implementation?
If so you could calculate a Hash like MD5 over the  content stream and set it as the object
ID.
Due to the CMIS spec this object ID needs to be unique. So it must be ensured that no two
objects with the same object ID exists in the same CMIS repository which is equivalent to
have two objects with the same content stream.
This approach whould also ensure to not add equal documents in the future after migration
is done.
Nevertheless here you also need to find a performant way of determining if an object with
an ID already exists (and find a solution if the hash is changed only by a timestamp inside
the content stream etc.)
With about two million objects you maybe need to extend the RAM on the migration machine to
keep such many objects in memory and comparing it by using Hashmaps and Hashtables with own
implementations of equals() and hashCode() ;)

Anyway a stimulating task. I'm curious about the ideas of others here to solve it in a performant
way ;)

Cheers
Sascha

-----Urspr√ľngliche Nachricht-----
Von: Tim Webster [mailto:tim.webster@gmail.com] 
Gesendet: Samstag, 21. Juni 2014 17:55
An: dev@chemistry.apache.org
Betreff: Re: document 'uniqueness'

Hello,

yes thanks for the suggestion - it sort of does that already with the Spring Batch progress
tracking, but it still won't prevent another document being added to the repository that is
identical to a previous one if it somehow failed - like a JVM crash or power failure.  Because
there is no transaction management for the CMIS part, you can't really ensure this, except
for a constraint in the repository itself.

Anyway, yeah I think you're right and I need to look at FileNet specifically.  I just wasn't
sure if I missed something and there was something in the CMIS spec that I could use (e.g.
some property or something).

Thanks,

Tim



On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca> wrote:

> I'm sure you've already thought of this, but couldn't your migration 
> process just persist the legacy ids in a separate location (e.g. 
> database table, possibly cached in memory for performance)? Then you 
> would just need to check that for each document being migrated, to 
> make sure that the same doc hasn't been seen previously.
>
> Not a CMIS related solution, but seems like it would work fine...
>
> The other option, as you suggest, is to see if FileNet supports a 
> 'uniqueness' constraint for custom metadata properties. I believe 
> Sharepoint does but not sure about FileNet.
>
> Thanks
> michael lucas  |  Senior Software Developer  |  Great-West Life
>
>
> -----Original Message-----
> From: Tim Webster [mailto:tim.webster@gmail.com]
> Sent: June 20, 2014 8:15 AM
> To: dev@chemistry.apache.org
> Subject: document 'uniqueness'
>
> Hi,
>
> I am developing a migration process (using Spring Batch) to migrate 
> documents from a legacy CMS into a CMIS-compliant system, and I need 
> to ensure that duplicate documents are not created accidentally.
>
> However, our CMIS system (IBM FileNet) allows the addition of 
> documents with the same name.  Documents with identical values for 
> cmis:name or cmis:contentStreamFilename are allowed.  Even if this 
> could be disabled (I don't know if it can or cannot), it is a business 
> requirement and I wouldn't be able to.
>
> The only thing I can think of to prevent this is to save the 'legacy' 
> ID of the document in a new CMIS property and somehow check that it 
> doesn't already exist when adding a new document. However this will be 
> very inefficient and slow down the migration (we're talking about up 
> to 2 million documents).
>
> Ideally the 'uniqueness constraint' would be checked on the server and 
> would throw an exception, which I could then deal with.
>
> Does anyone know of an easier way to do this, or is there anything I 
> can make use of in the CMIS spec to help?
>
> Thanks,
>
Mime
View raw message