lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Dyer (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3011) DIH MultiThreaded bug
Date Tue, 13 Mar 2012 18:06:37 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228549#comment-13228549
] 

James Dyer commented on SOLR-3011:
----------------------------------

I'm just starting to look at the DIH threaded code and the history behind it (SOLR-1352, etc).
 It seems like a lot of work has been put into this by Noble, Shalin, you and others.  Yet,
I can't help but wonder if we've gone down a bad path with this one.  That is, DIH is essentially
a collection of loosely-coupled components:  DataSource, Entity, Transformer, etc.  So it
seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe,
but every component in a given DIH configuration needs to be also.

There also is quite a bit of code duplication in DocBuilder and classes like ThreadedEntityProcessorWrapper
for threaded configurations.  In the past, this seems to have caused threaded-only problems
like SOLR-2668.  Also, the DocBuilder duplication only covers full-imports.  The delta-import
doesn't look like it supports threading at all?  Finally, users can get confused because specifying
threads="1" actually does something:  it puts the whole import through the threaded code branches
instead of the single-thread code branches.

Then there is the issue of tests.  Mikhail, you've just noticed that MockDataSource was not
designed to test a multi-threaded scenario in a valid fashion.  But I'm not sure even the
tests that are just for threads are all that valid.  Take a look at TestDocBuilderThreaded.
 Most of the configurations have "threads=1".  And what about this comment:

<code>
/*
  * This test fails in TestEnviroment, but works in real Live
  */
  @Test
  public void testEvaluator() throws Exception
<code>

I'm starting to worry we're playing wack-a-mole with this, and maybe we need a different approach.
 What if we tried this as a path forward:

1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible.
2. In 4.0 (or a separate branch), remove threading from DIH.
3. Refactor code and improve tests.
4. Make DocBuilder, etc threadsafe.
5. Create a marker interface or annotation that can be put on DataSources, Entities, Transformers,
SolrWriters, etc signifying they can be used safely with threads.
6. Re-implement the "threads" parameter.  Maybe be a bit more thoughtful about how it will
work & document it.  Do we have it be a thread-per-document (like we have now) or a thread-per-entity
(run child threads in parallel, but the root entity is single-threaded)?  Can we design it
so that both can be supported?
7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the marker interface,
the new annotation, etc.

I realize that #1-2 present a problem with what has been done here already.  The SOLR-3011
patches work on 4.x and it would be a lot of work to backport 3.x.  Yet I am proposing removing
the current threading entirely from 4.x and "fixing" only 3.x.  But I can probably help with
porting (some of?) this patch back to 3.x.

We recently had this comment come from one of our PMC members:  "If we would have a better
alternative without locale, threading and unicode bugs, I would svn rm."  But what I'm seeing
is that while Lucene and Solr-core have had a lot of hands in refactoring and improving the
code, DIH has only had features piled up onto it.  It was mostly written by 1 or 2 people,
with limited community support from a code-maintenance perspective.  In short, it hasn't gotten
the TLC it needs to thrive long-term.  Maybe now's the time.

Comments?  Does this seem like a viable way forward?
                
> DIH MultiThreaded bug
> ---------------------
>
>                 Key: SOLR-3011
>                 URL: https://issues.apache.org/jira/browse/SOLR-3011
>             Project: Solr
>          Issue Type: Sub-task
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.5, 4.0
>            Reporter: Mikhail Khludnev
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: SOLR-3011.patch, SOLR-3011.patch, patch-3011-EntityProcessorBase-iterator.patch,
patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and SOLR-2947.
I'm going to provide the patch makes DIH core threadsafe. Mostly it's a SOLR-2947 patch from
28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message