chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Webster <tim.webs...@gmail.com>
Subject Re: AW: document 'uniqueness'
Date Tue, 24 Jun 2014 21:03:34 GMT
If anyone is still interested here's what I've since found out.

Ron's suggestion of querying the repository for the existing document does
impact performance, but not by as much as I thought it would.  With a batch
of 50,000 documents, my rate of migration dropped from about 29 docs/second
to about 22 docs/second.

You cannot set the objectId of an object, it's read-only, so instead I used
the cmis:name property as a test.  I simply used the ID from the old
FileNet (source) system as a unique identifier.  If the document already
exists then I just throw an exception and tell Spring Batch to skip it.
 Funnily enough, I wanted to do this more for catastrophic failures, but
under load I found that FileNet can add the document and not manage to send
a correct response back (CmisConnectionException, XML parser errors, etc)
and this led to attempts to migrate duplicates.  I am hammering it pretty
hard though (15 concurrent threads) so I'm not expecting it to behave
perfectly.

I still don't know how things will play out once we get into the hundreds
of thousands range, but at least I know this is a viable approach.

Anyway, thought you'd like to know - thanks again for the help...:-)

Tim



On Mon, Jun 23, 2014 at 7:26 PM, Tim Webster <tim.webster@gmail.com> wrote:

> I have to confess I haven't, I was making an assumption that it would slow
> it down too much, and it would really be a last resort.
>
> I should at least try it out and see how it impacts performance before I
> dismiss it though.
>
>
>
>
> On Mon, Jun 23, 2014 at 6:23 PM, Ron DiFrango <
> rdifrango@captechconsulting.com> wrote:
>
>> Tim,
>>
>> Just curious have you tried the search before you insert method to see
>> what impact it has on performance?
>>
>> Thanks,
>>
>> Ron DiFrango
>> Director / Architect  |  CapTech
>> (804) 855-9196  |  rdifrango@captechconsulting.com
>> <https://email4.captechventures.com/owa/UrlBlockedError.aspx>
>>
>>
>>
>>
>>
>> On 6/22/14, 2:42 PM, "Tim Webster" <tim.webster@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >Thanks for the advice guys...:-)
>> >
>> >Unfortunately the target CMIS repository isn't my own implementation -
>> >it's
>> >FileNet P8.  The 'source' system is FileNet Content Services (not sure
>> the
>> >version - but it's non-CMIS compliant and about to become unsupported by
>> >IBM - hence the migration).
>> >
>> >So...what that means is I can't really do anything server-side about
>> this.
>> >
>> >Sascha raises an interesting option - I didn't realize I could set the
>> >ObjectId myself.  If I did that, and multiple documents had the same
>> >ObjectId, surely the server would throw an exception, meaning I wouldn't
>> >need to check if it already existed in the server?
>> >
>> >I could maybe throw some other bits into the hash computation (like
>> >creation date or something) to ensure uniqueness...?
>> >
>> >The 'search before insert' is of course an option, but it would slow
>> >everything down so much.  For single-version documents I'm able to add 30
>> >documents/second, which is the minimum requirement.
>> >
>> >
>> >
>> >
>> >
>> >
>> >On Sun, Jun 22, 2014 at 3:23 PM, Ron DiFrango <
>> >rdifrango@captechconsulting.com> wrote:
>> >
>> >> Tim,
>> >>
>> >> The suggestion below from Sascha is a good one.  The other approach
>> I¹ve
>> >> take before is to perform a search in the repo for a given document and
>> >> only if it does not exist would I insert it, otherwise perform an
>> update
>> >> or just log it as an ³error².
>> >>
>> >> Thanks,
>> >>
>> >> Ron DiFrango
>> >> Director / Architect  |  CapTech
>> >>
>> >>
>> >>
>> >> On 6/22/14, 5:37 AM, "Sascha Homeier" <shomeier@meyle-mueller.de>
>> wrote:
>> >>
>> >> >Hi Tim,
>> >> >
>> >> >you said you need to migrate the documents from FileNet to a CMIS
>> >> >compliant server.
>> >> >Is the CMIS compliant server your implementation?
>> >> >If so you could calculate a Hash like MD5 over the  content stream and
>> >> >set it as the object ID.
>> >> >Due to the CMIS spec this object ID needs to be unique. So it must be
>> >> >ensured that no two objects with the same object ID exists in the same
>> >> >CMIS repository which is equivalent to have two objects with the same
>> >> >content stream.
>> >> >This approach whould also ensure to not add equal documents in the
>> >>future
>> >> >after migration is done.
>> >> >Nevertheless here you also need to find a performant way of
>> determining
>> >> >if an object with an ID already exists (and find a solution if the
>> hash
>> >> >is changed only by a timestamp inside the content stream etc.)
>> >> >With about two million objects you maybe need to extend the RAM on the
>> >> >migration machine to keep such many objects in memory and comparing
it
>> >>by
>> >> >using Hashmaps and Hashtables with own implementations of equals() and
>> >> >hashCode() ;)
>> >> >
>> >> >Anyway a stimulating task. I'm curious about the ideas of others here
>> >>to
>> >> >solve it in a performant way ;)
>> >> >
>> >> >Cheers
>> >> >Sascha
>> >> >
>> >> >-----Ursprüngliche Nachricht-----
>> >> >Von: Tim Webster [mailto:tim.webster@gmail.com]
>> >> >Gesendet: Samstag, 21. Juni 2014 17:55
>> >> >An: dev@chemistry.apache.org
>> >> >Betreff: Re: document 'uniqueness'
>> >> >
>> >> >Hello,
>> >> >
>> >> >yes thanks for the suggestion - it sort of does that already with the
>> >> >Spring Batch progress tracking, but it still won't prevent another
>> >> >document being added to the repository that is identical to a previous
>> >> >one if it somehow failed - like a JVM crash or power failure.  Because
>> >> >there is no transaction management for the CMIS part, you can't really
>> >> >ensure this, except for a constraint in the repository itself.
>> >> >
>> >> >Anyway, yeah I think you're right and I need to look at FileNet
>> >> >specifically.  I just wasn't sure if I missed something and there was
>> >> >something in the CMIS spec that I could use (e.g. some property or
>> >> >something).
>> >> >
>> >> >Thanks,
>> >> >
>> >> >Tim
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <Mike.Lucas@gwl.ca>
>> >>wrote:
>> >> >
>> >> >> I'm sure you've already thought of this, but couldn't your migration
>> >> >> process just persist the legacy ids in a separate location (e.g.
>> >> >> database table, possibly cached in memory for performance)? Then
you
>> >> >> would just need to check that for each document being migrated,
to
>> >> >> make sure that the same doc hasn't been seen previously.
>> >> >>
>> >> >> Not a CMIS related solution, but seems like it would work fine...
>> >> >>
>> >> >> The other option, as you suggest, is to see if FileNet supports
a
>> >> >> 'uniqueness' constraint for custom metadata properties. I believe
>> >> >> Sharepoint does but not sure about FileNet.
>> >> >>
>> >> >> Thanks
>> >> >> michael lucas  |  Senior Software Developer  |  Great-West Life
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Tim Webster [mailto:tim.webster@gmail.com]
>> >> >> Sent: June 20, 2014 8:15 AM
>> >> >> To: dev@chemistry.apache.org
>> >> >> Subject: document 'uniqueness'
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am developing a migration process (using Spring Batch) to migrate
>> >> >> documents from a legacy CMS into a CMIS-compliant system, and I
need
>> >> >> to ensure that duplicate documents are not created accidentally.
>> >> >>
>> >> >> However, our CMIS system (IBM FileNet) allows the addition of
>> >> >> documents with the same name.  Documents with identical values
for
>> >> >> cmis:name or cmis:contentStreamFilename are allowed.  Even if this
>> >> >> could be disabled (I don't know if it can or cannot), it is a
>> >>business
>> >> >> requirement and I wouldn't be able to.
>> >> >>
>> >> >> The only thing I can think of to prevent this is to save the
>> 'legacy'
>> >> >> ID of the document in a new CMIS property and somehow check that
it
>> >> >> doesn't already exist when adding a new document. However this
will
>> >>be
>> >> >> very inefficient and slow down the migration (we're talking about
up
>> >> >> to 2 million documents).
>> >> >>
>> >> >> Ideally the 'uniqueness constraint' would be checked on the server
>> >>and
>> >> >> would throw an exception, which I could then deal with.
>> >> >>
>> >> >> Does anyone know of an easier way to do this, or is there anything
I
>> >> >> can make use of in the CMIS spec to help?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message