chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron DiFrango <rdifra...@captechconsulting.com>
Subject Re: AW: document 'uniqueness'
Date Mon, 23 Jun 2014 17:23:44 GMT
Tim,

Just curious have you tried the search before you insert method to see
what impact it has on performance?

Thanks,

Ron DiFrango       
Director / Architect  |  CapTech
(804) 855-9196  |  rdifrango@captechconsulting.com
<https://email4.captechventures.com/owa/UrlBlockedError.aspx>





On 6/22/14, 2:42 PM, "Tim Webster" <tim.webster@gmail.com> wrote:

>Hi,
>
>Thanks for the advice guys...:-)
>
>Unfortunately the target CMIS repository isn't my own implementation -
>it's
>FileNet P8.  The 'source' system is FileNet Content Services (not sure the
>version - but it's non-CMIS compliant and about to become unsupported by
>IBM - hence the migration).
>
>So...what that means is I can't really do anything server-side about this.
>
>Sascha raises an interesting option - I didn't realize I could set the
>ObjectId myself.  If I did that, and multiple documents had the same
>ObjectId, surely the server would throw an exception, meaning I wouldn't
>need to check if it already existed in the server?
>
>I could maybe throw some other bits into the hash computation (like
>creation date or something) to ensure uniqueness...?
>
>The 'search before insert' is of course an option, but it would slow
>everything down so much.  For single-version documents I'm able to add 30
>documents/second, which is the minimum requirement.
>
>
>
>
>
>
>On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
>rdifrango@captechconsulting.com> wrote:
>
>> Tim,
>>
>> The suggestion below from Sascha is a good one.  The other approach I¹ve
>> take before is to perform a search in the repo for a given document and
>> only if it does not exist would I insert it, otherwise perform an update
>> or just log it as an ³error².
>>
>> Thanks,
>>
>> Ron DiFrango
>> Director / Architect  |  CapTech
>>
>>
>>
>> On 6/22/14, 5:37 AM, "Sascha Homeier" <shomeier@meyle-mueller.de> wrote:
>>
>> >Hi Tim,
>> >
>> >you said you need to migrate the documents from FileNet to a CMIS
>> >compliant server.
>> >Is the CMIS compliant server your implementation?
>> >If so you could calculate a Hash like MD5 over the  content stream and
>> >set it as the object ID.
>> >Due to the CMIS spec this object ID needs to be unique. So it must be
>> >ensured that no two objects with the same object ID exists in the same
>> >CMIS repository which is equivalent to have two objects with the same
>> >content stream.
>> >This approach whould also ensure to not add equal documents in the
>>future
>> >after migration is done.
>> >Nevertheless here you also need to find a performant way of determining
>> >if an object with an ID already exists (and find a solution if the hash
>> >is changed only by a timestamp inside the content stream etc.)
>> >With about two million objects you maybe need to extend the RAM on the
>> >migration machine to keep such many objects in memory and comparing it
>>by
>> >using Hashmaps and Hashtables with own implementations of equals() and
>> >hashCode() ;)
>> >
>> >Anyway a stimulating task. I'm curious about the ideas of others here
>>to
>> >solve it in a performant way ;)
>> >
>> >Cheers
>> >Sascha
>> >
>> >-----Ursprüngliche Nachricht-----
>> >Von: Tim Webster [mailto:tim.webster@gmail.com]
>> >Gesendet: Samstag, 21. Juni 2014 17:55
>> >An: dev@chemistry.apache.org
>> >Betreff: Re: document 'uniqueness'
>> >
>> >Hello,
>> >
>> >yes thanks for the suggestion - it sort of does that already with the
>> >Spring Batch progress tracking, but it still won't prevent another
>> >document being added to the repository that is identical to a previous
>> >one if it somehow failed - like a JVM crash or power failure.  Because
>> >there is no transaction management for the CMIS part, you can't really
>> >ensure this, except for a constraint in the repository itself.
>> >
>> >Anyway, yeah I think you're right and I need to look at FileNet
>> >specifically.  I just wasn't sure if I missed something and there was
>> >something in the CMIS spec that I could use (e.g. some property or
>> >something).
>> >
>> >Thanks,
>> >
>> >Tim
>> >
>> >
>> >
>> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca>
>>wrote:
>> >
>> >> I'm sure you've already thought of this, but couldn't your migration
>> >> process just persist the legacy ids in a separate location (e.g.
>> >> database table, possibly cached in memory for performance)? Then you
>> >> would just need to check that for each document being migrated, to
>> >> make sure that the same doc hasn't been seen previously.
>> >>
>> >> Not a CMIS related solution, but seems like it would work fine...
>> >>
>> >> The other option, as you suggest, is to see if FileNet supports a
>> >> 'uniqueness' constraint for custom metadata properties. I believe
>> >> Sharepoint does but not sure about FileNet.
>> >>
>> >> Thanks
>> >> michael lucas  |  Senior Software Developer  |  Great-West Life
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Tim Webster [mailto:tim.webster@gmail.com]
>> >> Sent: June 20, 2014 8:15 AM
>> >> To: dev@chemistry.apache.org
>> >> Subject: document 'uniqueness'
>> >>
>> >> Hi,
>> >>
>> >> I am developing a migration process (using Spring Batch) to migrate
>> >> documents from a legacy CMS into a CMIS-compliant system, and I need
>> >> to ensure that duplicate documents are not created accidentally.
>> >>
>> >> However, our CMIS system (IBM FileNet) allows the addition of
>> >> documents with the same name.  Documents with identical values for
>> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
>> >> could be disabled (I don't know if it can or cannot), it is a
>>business
>> >> requirement and I wouldn't be able to.
>> >>
>> >> The only thing I can think of to prevent this is to save the 'legacy'
>> >> ID of the document in a new CMIS property and somehow check that it
>> >> doesn't already exist when adding a new document. However this will
>>be
>> >> very inefficient and slow down the migration (we're talking about up
>> >> to 2 million documents).
>> >>
>> >> Ideally the 'uniqueness constraint' would be checked on the server
>>and
>> >> would throw an exception, which I could then deal with.
>> >>
>> >> Does anyone know of an easier way to do this, or is there anything I
>> >> can make use of in the CMIS spec to help?
>> >>
>> >> Thanks,
>> >>
>>
>>


Mime
View raw message