chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron DiFrango <rdifra...@captechconsulting.com>
Subject Re: AW: document 'uniqueness'
Date Sun, 22 Jun 2014 14:23:20 GMT
Tim,

The suggestion below from Sascha is a good one.  The other approach I¹ve
take before is to perform a search in the repo for a given document and
only if it does not exist would I insert it, otherwise perform an update
or just log it as an ³error².

Thanks,

Ron DiFrango       
Director / Architect  |  CapTech



On 6/22/14, 5:37 AM, "Sascha Homeier" <shomeier@meyle-mueller.de> wrote:

>Hi Tim,
>
>you said you need to migrate the documents from FileNet to a CMIS
>compliant server.
>Is the CMIS compliant server your implementation?
>If so you could calculate a Hash like MD5 over the  content stream and
>set it as the object ID.
>Due to the CMIS spec this object ID needs to be unique. So it must be
>ensured that no two objects with the same object ID exists in the same
>CMIS repository which is equivalent to have two objects with the same
>content stream.
>This approach whould also ensure to not add equal documents in the future
>after migration is done.
>Nevertheless here you also need to find a performant way of determining
>if an object with an ID already exists (and find a solution if the hash
>is changed only by a timestamp inside the content stream etc.)
>With about two million objects you maybe need to extend the RAM on the
>migration machine to keep such many objects in memory and comparing it by
>using Hashmaps and Hashtables with own implementations of equals() and
>hashCode() ;)
>
>Anyway a stimulating task. I'm curious about the ideas of others here to
>solve it in a performant way ;)
>
>Cheers
>Sascha
>
>-----Ursprüngliche Nachricht-----
>Von: Tim Webster [mailto:tim.webster@gmail.com]
>Gesendet: Samstag, 21. Juni 2014 17:55
>An: dev@chemistry.apache.org
>Betreff: Re: document 'uniqueness'
>
>Hello,
>
>yes thanks for the suggestion - it sort of does that already with the
>Spring Batch progress tracking, but it still won't prevent another
>document being added to the repository that is identical to a previous
>one if it somehow failed - like a JVM crash or power failure.  Because
>there is no transaction management for the CMIS part, you can't really
>ensure this, except for a constraint in the repository itself.
>
>Anyway, yeah I think you're right and I need to look at FileNet
>specifically.  I just wasn't sure if I missed something and there was
>something in the CMIS spec that I could use (e.g. some property or
>something).
>
>Thanks,
>
>Tim
>
>
>
>On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca> wrote:
>
>> I'm sure you've already thought of this, but couldn't your migration
>> process just persist the legacy ids in a separate location (e.g.
>> database table, possibly cached in memory for performance)? Then you
>> would just need to check that for each document being migrated, to
>> make sure that the same doc hasn't been seen previously.
>>
>> Not a CMIS related solution, but seems like it would work fine...
>>
>> The other option, as you suggest, is to see if FileNet supports a
>> 'uniqueness' constraint for custom metadata properties. I believe
>> Sharepoint does but not sure about FileNet.
>>
>> Thanks
>> michael lucas  |  Senior Software Developer  |  Great-West Life
>>
>>
>> -----Original Message-----
>> From: Tim Webster [mailto:tim.webster@gmail.com]
>> Sent: June 20, 2014 8:15 AM
>> To: dev@chemistry.apache.org
>> Subject: document 'uniqueness'
>>
>> Hi,
>>
>> I am developing a migration process (using Spring Batch) to migrate
>> documents from a legacy CMS into a CMIS-compliant system, and I need
>> to ensure that duplicate documents are not created accidentally.
>>
>> However, our CMIS system (IBM FileNet) allows the addition of
>> documents with the same name.  Documents with identical values for
>> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
>> could be disabled (I don't know if it can or cannot), it is a business
>> requirement and I wouldn't be able to.
>>
>> The only thing I can think of to prevent this is to save the 'legacy'
>> ID of the document in a new CMIS property and somehow check that it
>> doesn't already exist when adding a new document. However this will be
>> very inefficient and slow down the migration (we're talking about up
>> to 2 million documents).
>>
>> Ideally the 'uniqueness constraint' would be checked on the server and
>> would throw an exception, which I could then deal with.
>>
>> Does anyone know of an easier way to do this, or is there anything I
>> can make use of in the CMIS spec to help?
>>
>> Thanks,
>>


Mime
View raw message