chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Webster <tim.webs...@gmail.com>
Subject Re: AW: document 'uniqueness'
Date Sun, 22 Jun 2014 18:42:43 GMT
Hi,

Thanks for the advice guys...:-)

Unfortunately the target CMIS repository isn't my own implementation - it's
FileNet P8.  The 'source' system is FileNet Content Services (not sure the
version - but it's non-CMIS compliant and about to become unsupported by
IBM - hence the migration).

So...what that means is I can't really do anything server-side about this.

Sascha raises an interesting option - I didn't realize I could set the
ObjectId myself.  If I did that, and multiple documents had the same
ObjectId, surely the server would throw an exception, meaning I wouldn't
need to check if it already existed in the server?

I could maybe throw some other bits into the hash computation (like
creation date or something) to ensure uniqueness...?

The 'search before insert' is of course an option, but it would slow
everything down so much.  For single-version documents I'm able to add 30
documents/second, which is the minimum requirement.






On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
rdifrango@captechconsulting.com> wrote:

> Tim,
>
> The suggestion below from Sascha is a good one.  The other approach I¹ve
> take before is to perform a search in the repo for a given document and
> only if it does not exist would I insert it, otherwise perform an update
> or just log it as an ³error².
>
> Thanks,
>
> Ron DiFrango
> Director / Architect  |  CapTech
>
>
>
> On 6/22/14, 5:37 AM, "Sascha Homeier" <shomeier@meyle-mueller.de> wrote:
>
> >Hi Tim,
> >
> >you said you need to migrate the documents from FileNet to a CMIS
> >compliant server.
> >Is the CMIS compliant server your implementation?
> >If so you could calculate a Hash like MD5 over the  content stream and
> >set it as the object ID.
> >Due to the CMIS spec this object ID needs to be unique. So it must be
> >ensured that no two objects with the same object ID exists in the same
> >CMIS repository which is equivalent to have two objects with the same
> >content stream.
> >This approach whould also ensure to not add equal documents in the future
> >after migration is done.
> >Nevertheless here you also need to find a performant way of determining
> >if an object with an ID already exists (and find a solution if the hash
> >is changed only by a timestamp inside the content stream etc.)
> >With about two million objects you maybe need to extend the RAM on the
> >migration machine to keep such many objects in memory and comparing it by
> >using Hashmaps and Hashtables with own implementations of equals() and
> >hashCode() ;)
> >
> >Anyway a stimulating task. I'm curious about the ideas of others here to
> >solve it in a performant way ;)
> >
> >Cheers
> >Sascha
> >
> >-----Ursprüngliche Nachricht-----
> >Von: Tim Webster [mailto:tim.webster@gmail.com]
> >Gesendet: Samstag, 21. Juni 2014 17:55
> >An: dev@chemistry.apache.org
> >Betreff: Re: document 'uniqueness'
> >
> >Hello,
> >
> >yes thanks for the suggestion - it sort of does that already with the
> >Spring Batch progress tracking, but it still won't prevent another
> >document being added to the repository that is identical to a previous
> >one if it somehow failed - like a JVM crash or power failure.  Because
> >there is no transaction management for the CMIS part, you can't really
> >ensure this, except for a constraint in the repository itself.
> >
> >Anyway, yeah I think you're right and I need to look at FileNet
> >specifically.  I just wasn't sure if I missed something and there was
> >something in the CMIS spec that I could use (e.g. some property or
> >something).
> >
> >Thanks,
> >
> >Tim
> >
> >
> >
> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca> wrote:
> >
> >> I'm sure you've already thought of this, but couldn't your migration
> >> process just persist the legacy ids in a separate location (e.g.
> >> database table, possibly cached in memory for performance)? Then you
> >> would just need to check that for each document being migrated, to
> >> make sure that the same doc hasn't been seen previously.
> >>
> >> Not a CMIS related solution, but seems like it would work fine...
> >>
> >> The other option, as you suggest, is to see if FileNet supports a
> >> 'uniqueness' constraint for custom metadata properties. I believe
> >> Sharepoint does but not sure about FileNet.
> >>
> >> Thanks
> >> michael lucas  |  Senior Software Developer  |  Great-West Life
> >>
> >>
> >> -----Original Message-----
> >> From: Tim Webster [mailto:tim.webster@gmail.com]
> >> Sent: June 20, 2014 8:15 AM
> >> To: dev@chemistry.apache.org
> >> Subject: document 'uniqueness'
> >>
> >> Hi,
> >>
> >> I am developing a migration process (using Spring Batch) to migrate
> >> documents from a legacy CMS into a CMIS-compliant system, and I need
> >> to ensure that duplicate documents are not created accidentally.
> >>
> >> However, our CMIS system (IBM FileNet) allows the addition of
> >> documents with the same name.  Documents with identical values for
> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
> >> could be disabled (I don't know if it can or cannot), it is a business
> >> requirement and I wouldn't be able to.
> >>
> >> The only thing I can think of to prevent this is to save the 'legacy'
> >> ID of the document in a new CMIS property and somehow check that it
> >> doesn't already exist when adding a new document. However this will be
> >> very inefficient and slow down the migration (we're talking about up
> >> to 2 million documents).
> >>
> >> Ideally the 'uniqueness constraint' would be checked on the server and
> >> would throw an exception, which I could then deal with.
> >>
> >> Does anyone know of an easier way to do this, or is there anything I
> >> can make use of in the CMIS spec to help?
> >>
> >> Thanks,
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message