Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4FEF89743 for ; Tue, 13 Mar 2012 18:07:03 +0000 (UTC) Received: (qmail 99732 invoked by uid 500); 13 Mar 2012 18:07:02 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 99576 invoked by uid 500); 13 Mar 2012 18:07:02 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 99566 invoked by uid 99); 13 Mar 2012 18:07:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2012 18:07:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2012 18:06:58 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B436D1E240 for ; Tue, 13 Mar 2012 18:06:37 +0000 (UTC) Date: Tue, 13 Mar 2012 18:06:37 +0000 (UTC) From: "James Dyer (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1207925776.8624.1331661997739.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1410407962.18208.1325938899242.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (SOLR-3011) DIH MultiThreaded bug MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228549#comment-13228549 ] James Dyer commented on SOLR-3011: ---------------------------------- I'm just starting to look at the DIH threaded code and the history behind it (SOLR-1352, etc). It seems like a lot of work has been put into this by Noble, Shalin, you and others. Yet, I can't help but wonder if we've gone down a bad path with this one. That is, DIH is essentially a collection of loosely-coupled components: DataSource, Entity, Transformer, etc. So it seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe, but every component in a given DIH configuration needs to be also. There also is quite a bit of code duplication in DocBuilder and classes like ThreadedEntityProcessorWrapper for threaded configurations. In the past, this seems to have caused threaded-only problems like SOLR-2668. Also, the DocBuilder duplication only covers full-imports. The delta-import doesn't look like it supports threading at all? Finally, users can get confused because specifying threads="1" actually does something: it puts the whole import through the threaded code branches instead of the single-thread code branches. Then there is the issue of tests. Mikhail, you've just noticed that MockDataSource was not designed to test a multi-threaded scenario in a valid fashion. But I'm not sure even the tests that are just for threads are all that valid. Take a look at TestDocBuilderThreaded. Most of the configurations have "threads=1". And what about this comment: /* * This test fails in TestEnviroment, but works in real Live */ @Test public void testEvaluator() throws Exception I'm starting to worry we're playing wack-a-mole with this, and maybe we need a different approach. What if we tried this as a path forward: 1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible. 2. In 4.0 (or a separate branch), remove threading from DIH. 3. Refactor code and improve tests. 4. Make DocBuilder, etc threadsafe. 5. Create a marker interface or annotation that can be put on DataSources, Entities, Transformers, SolrWriters, etc signifying they can be used safely with threads. 6. Re-implement the "threads" parameter. Maybe be a bit more thoughtful about how it will work & document it. Do we have it be a thread-per-document (like we have now) or a thread-per-entity (run child threads in parallel, but the root entity is single-threaded)? Can we design it so that both can be supported? 7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the marker interface, the new annotation, etc. I realize that #1-2 present a problem with what has been done here already. The SOLR-3011 patches work on 4.x and it would be a lot of work to backport 3.x. Yet I am proposing removing the current threading entirely from 4.x and "fixing" only 3.x. But I can probably help with porting (some of?) this patch back to 3.x. We recently had this comment come from one of our PMC members: "If we would have a better alternative without locale, threading and unicode bugs, I would svn rm." But what I'm seeing is that while Lucene and Solr-core have had a lot of hands in refactoring and improving the code, DIH has only had features piled up onto it. It was mostly written by 1 or 2 people, with limited community support from a code-maintenance perspective. In short, it hasn't gotten the TLC it needs to thrive long-term. Maybe now's the time. Comments? Does this seem like a viable way forward? > DIH MultiThreaded bug > --------------------- > > Key: SOLR-3011 > URL: https://issues.apache.org/jira/browse/SOLR-3011 > Project: Solr > Issue Type: Sub-task > Components: contrib - DataImportHandler > Affects Versions: 3.5, 4.0 > Reporter: Mikhail Khludnev > Priority: Minor > Fix For: 4.0 > > Attachments: SOLR-3011.patch, SOLR-3011.patch, patch-3011-EntityProcessorBase-iterator.patch, patch-3011-EntityProcessorBase-iterator.patch > > > current DIH design is not thread safe. see last comments at SOLR-2382 and SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly it's a SOLR-2947 patch from 28th Dec. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org