chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Webster <tim.webs...@gmail.com>
Subject Re: AW: document 'uniqueness'
Date Mon, 23 Jun 2014 18:26:01 GMT
I have to confess I haven't, I was making an assumption that it would slow
it down too much, and it would really be a last resort.

I should at least try it out and see how it impacts performance before I
dismiss it though.




On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango <
rdifrango@captechconsulting.com> wrote:

> Tim,
>
> Just curious have you tried the search before you insert method to see
> what impact it has on performance?
>
> Thanks,
>
> Ron DiFrango
> Director / Architect  |  CapTech
> (804) 855-9196  |  rdifrango@captechconsulting.com
> <https://email4.captechventures.com/owa/UrlBlockedError.aspx>
>
>
>
>
>
> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webster@gmail.com> wrote:
>
> >Hi,
> >
> >Thanks for the advice guys...:-)
> >
> >Unfortunately the target CMIS repository isn't my own implementation -
> >it's
> >FileNet P8.  The 'source' system is FileNet Content Services (not sure the
> >version - but it's non-CMIS compliant and about to become unsupported by
> >IBM - hence the migration).
> >
> >So...what that means is I can't really do anything server-side about this.
> >
> >Sascha raises an interesting option - I didn't realize I could set the
> >ObjectId myself.  If I did that, and multiple documents had the same
> >ObjectId, surely the server would throw an exception, meaning I wouldn't
> >need to check if it already existed in the server?
> >
> >I could maybe throw some other bits into the hash computation (like
> >creation date or something) to ensure uniqueness...?
> >
> >The 'search before insert' is of course an option, but it would slow
> >everything down so much.  For single-version documents I'm able to add 30
> >documents/second, which is the minimum requirement.
> >
> >
> >
> >
> >
> >
> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
> >rdifrango@captechconsulting.com> wrote:
> >
> >> Tim,
> >>
> >> The suggestion below from Sascha is a good one.  The other approach I¹ve
> >> take before is to perform a search in the repo for a given document and
> >> only if it does not exist would I insert it, otherwise perform an update
> >> or just log it as an ³error².
> >>
> >> Thanks,
> >>
> >> Ron DiFrango
> >> Director / Architect  |  CapTech
> >>
> >>
> >>
> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shomeier@meyle-mueller.de>
> wrote:
> >>
> >> >Hi Tim,
> >> >
> >> >you said you need to migrate the documents from FileNet to a CMIS
> >> >compliant server.
> >> >Is the CMIS compliant server your implementation?
> >> >If so you could calculate a Hash like MD5 over the  content stream and
> >> >set it as the object ID.
> >> >Due to the CMIS spec this object ID needs to be unique. So it must be
> >> >ensured that no two objects with the same object ID exists in the same
> >> >CMIS repository which is equivalent to have two objects with the same
> >> >content stream.
> >> >This approach whould also ensure to not add equal documents in the
> >>future
> >> >after migration is done.
> >> >Nevertheless here you also need to find a performant way of determining
> >> >if an object with an ID already exists (and find a solution if the hash
> >> >is changed only by a timestamp inside the content stream etc.)
> >> >With about two million objects you maybe need to extend the RAM on the
> >> >migration machine to keep such many objects in memory and comparing it
> >>by
> >> >using Hashmaps and Hashtables with own implementations of equals() and
> >> >hashCode() ;)
> >> >
> >> >Anyway a stimulating task. I'm curious about the ideas of others here
> >>to
> >> >solve it in a performant way ;)
> >> >
> >> >Cheers
> >> >Sascha
> >> >
> >> >-----Ursprüngliche Nachricht-----
> >> >Von: Tim Webster [mailto:tim.webster@gmail.com]
> >> >Gesendet: Samstag, 21. Juni 2014 17:55
> >> >An: dev@chemistry.apache.org
> >> >Betreff: Re: document 'uniqueness'
> >> >
> >> >Hello,
> >> >
> >> >yes thanks for the suggestion - it sort of does that already with the
> >> >Spring Batch progress tracking, but it still won't prevent another
> >> >document being added to the repository that is identical to a previous
> >> >one if it somehow failed - like a JVM crash or power failure.  Because
> >> >there is no transaction management for the CMIS part, you can't really
> >> >ensure this, except for a constraint in the repository itself.
> >> >
> >> >Anyway, yeah I think you're right and I need to look at FileNet
> >> >specifically.  I just wasn't sure if I missed something and there was
> >> >something in the CMIS spec that I could use (e.g. some property or
> >> >something).
> >> >
> >> >Thanks,
> >> >
> >> >Tim
> >> >
> >> >
> >> >
> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca>
> >>wrote:
> >> >
> >> >> I'm sure you've already thought of this, but couldn't your migration
> >> >> process just persist the legacy ids in a separate location (e.g.
> >> >> database table, possibly cached in memory for performance)? Then you
> >> >> would just need to check that for each document being migrated, to
> >> >> make sure that the same doc hasn't been seen previously.
> >> >>
> >> >> Not a CMIS related solution, but seems like it would work fine...
> >> >>
> >> >> The other option, as you suggest, is to see if FileNet supports a
> >> >> 'uniqueness' constraint for custom metadata properties. I believe
> >> >> Sharepoint does but not sure about FileNet.
> >> >>
> >> >> Thanks
> >> >> michael lucas  |  Senior Software Developer  |  Great-West Life
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: Tim Webster [mailto:tim.webster@gmail.com]
> >> >> Sent: June 20, 2014 8:15 AM
> >> >> To: dev@chemistry.apache.org
> >> >> Subject: document 'uniqueness'
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am developing a migration process (using Spring Batch) to migrate
> >> >> documents from a legacy CMS into a CMIS-compliant system, and I need
> >> >> to ensure that duplicate documents are not created accidentally.
> >> >>
> >> >> However, our CMIS system (IBM FileNet) allows the addition of
> >> >> documents with the same name.  Documents with identical values for
> >> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
> >> >> could be disabled (I don't know if it can or cannot), it is a
> >>business
> >> >> requirement and I wouldn't be able to.
> >> >>
> >> >> The only thing I can think of to prevent this is to save the 'legacy'
> >> >> ID of the document in a new CMIS property and somehow check that it
> >> >> doesn't already exist when adding a new document. However this will
> >>be
> >> >> very inefficient and slow down the migration (we're talking about up
> >> >> to 2 million documents).
> >> >>
> >> >> Ideally the 'uniqueness constraint' would be checked on the server
> >>and
> >> >> would throw an exception, which I could then deal with.
> >> >>
> >> >> Does anyone know of an easier way to do this, or is there anything
I
> >> >> can make use of in the CMIS spec to help?
> >> >>
> >> >> Thanks,
> >> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message