From dev-return-41887-apmail-directory-dev-archive=directory.apache.org@directory.apache.org Mon Nov 26 15:04:05 2012 Return-Path: X-Original-To: apmail-directory-dev-archive@www.apache.org Delivered-To: apmail-directory-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 08AD2E277 for ; Mon, 26 Nov 2012 15:04:05 +0000 (UTC) Received: (qmail 11961 invoked by uid 500); 26 Nov 2012 15:04:04 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 11875 invoked by uid 500); 26 Nov 2012 15:04:04 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 11835 invoked by uid 99); 26 Nov 2012 15:04:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Nov 2012 15:04:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of elecharny@gmail.com designates 209.85.217.178 as permitted sender) Received: from [209.85.217.178] (HELO mail-lb0-f178.google.com) (209.85.217.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Nov 2012 15:03:54 +0000 Received: by mail-lb0-f178.google.com with SMTP id l5so10583090lbo.37 for ; Mon, 26 Nov 2012 07:03:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:reply-to:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding; bh=lxldiaZB251uv+pEypw/XKYGHaRihYIn4+evV0v396A=; b=k5eSf8N9zTlVov2yPZZNBXtXRiU2tLmzaf+vtSG1w1IiTGDAS3oHQObjIm8cn8bJvZ wPuxz65MuY0jmK6nW/wLRcf4xhWTyABY1NDQQeyfixVu/FLqxn45kTPKwXgQLj5hHGQY g5U7i+YyeY5GEn2lnluPMbFLP70jmGmmByBakrFFFHG+x94OPlLc1NXSbwYFuILiLaRQ +xeIRQ9dXEjeLjitTlMgT0UbUreVocXjVStmFqTP+qdnFcStYi1Ffjfao43GncKS+UA8 WvUy9oFzAiUFNLhGbzEGdD+8BWFnvueR9dfTMF8I/P6z1wFURP7uwEW+m6sbXiN/tYdd JI3w== Received: by 10.152.124.111 with SMTP id mh15mr11349961lab.20.1353942213939; Mon, 26 Nov 2012 07:03:33 -0800 (PST) Received: from Emmanuels-MacBook-Pro.local (lon92-10-78-226-4-211.fbx.proxad.net. [78.226.4.211]) by mx.google.com with ESMTPS id gt17sm5562974lab.6.2012.11.26.07.03.32 (version=SSLv3 cipher=OTHER); Mon, 26 Nov 2012 07:03:33 -0800 (PST) Message-ID: <50B384C5.30001@gmail.com> Date: Mon, 26 Nov 2012 16:03:33 +0100 From: =?UTF-8?B?RW1tYW51ZWwgTMOpY2hhcm55?= Reply-To: elecharny@apache.org User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Apache Directory Developers List Subject: Replication producer potential blocking issue, and a few improvement proposals X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, so I spent the week-end reviewing the replication code in M8, and I found a few area that can be improved, and a potentially serious problem that needs to be fixed. first of all, the few tests we conducted last week shown that replication is working pretty well, *if* we have no communication loss between the consumers and the server. We have some issues when a consumer is disconnected and reconnected back (still have to investigate this bug). Otherwise, there are a few areas for improvement, but this is not urgent - the journal does not have to be cleaned up entry by entry. This is an extremely costly operation, requiring a lot of writes on disk. There is a better way to manage the old elements : we can simply have a rotating journal, and keep a current and an old journal. When the current journal is full, it becomes the old journal, and the old journal is simply deleted. - I think having one journal instead of having a journal per consumer is a better idea. This worth being discussed for a future implementation, but right now, I'm fine with the one- consumer/one journal approach. - The consumer RID should not be created by the producer. This is just used by the consumer to distinguish between two different replication configuration declared on the consumer. The producer does not have to keep such information locally. That also mean we may have more than one journal for a server, as we may declare more than one replication consumer from a server A to a server B. - More critical : the way the EventInterceptor is implemented, we have no way to check the authorization. We must find a way to go through the authorization interceptor in order to check that each entry is allowed to be sent to a consumer. - Another big issue : we don't filter the AttributeType we send to the consumers, AFAICT Now, the problem : - As we depend on MINA 2 to send the entries to the consumers, we have to be extremely careful when we do things like : private void sendResult( SearchResultEntry searchResultEntry, Entry entry, EventType eventType, SyncStateValue syncStateValue ) { searchResultEntry.addControl( syncStateValue ); LOG.debug( "sending event {} of entry {}", eventType, entry.getDn() ); WriteFuture future = session.getIoSession().write( searchResultEntry ); // Now, send the entry to the consumer handleWriteFuture( future, entry, eventType ); } with : private void handleWriteFuture( WriteFuture future, Entry entry, EventType event ) { // Let the operation be executed. // Note : we wait 10 seconds max future.awaitUninterruptibly( 10000L ); if ( !future.isWritten() ) { LOG.error( "Failed to write to the consumer {} during the event {} on entry {}", new Object[] { consumerMsgLog.getId(), event, entry.getDn() } ); LOG.error( "", future.getException() ); // set realtime push to false, will be set back to true when the client // comes back and sends another request this flag will be set to true pushInRealTime = false; } else { try { // if successful update the last sent CSN consumerMsgLog.setLastSentCsn( entry.get( SchemaConstants.ENTRY_CSN_AT ).getString() ); } catch( Exception e ) { //should never happen LOG.error( "No entry CSN attribute found", e ); } } } If the consumer is disconnected, the current thread will be blocked for up to 10 seconds (that in the case the consumer wasn't gracefully disconnected...). For 10 seconds, the current thread will just do nothing but wait. We don't have hundreds of threads, at some point, this can become problematic... The best way to fix that would be to have a separated thread per consumer, and to use a queue where the events are pushed, quueue that will be read by the consumer's thread. As we have a queue in the middle, and a thread per consumer, we can guarantee that handling a modification is done fast enough on the local server, and propagated efficiently, or that the consumer's disconnection will be handled without blocking any server's thread. I'm continuing my investigations ! -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com