Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA01910176 for ; Sat, 15 Feb 2014 09:08:00 +0000 (UTC) Received: (qmail 12929 invoked by uid 500); 15 Feb 2014 09:07:56 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 12246 invoked by uid 500); 15 Feb 2014 09:07:49 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 12237 invoked by uid 99); 15 Feb 2014 09:07:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Feb 2014 09:07:47 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of solr@elyograg.org designates 166.70.79.219 as permitted sender) Received: from [166.70.79.219] (HELO frodo.elyograg.org) (166.70.79.219) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Feb 2014 09:07:43 +0000 Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id 1FE798590 for ; Sat, 15 Feb 2014 02:07:22 -0700 (MST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-type:content-type:in-reply-to :references:subject:subject:mime-version:user-agent:from:from :date:date:message-id:received:received; s=mail; t=1392455241; bh=odNxIf7ZgtPpAf7RnTf0i0HkcIKshAW2XE9fT46HPXY=; b=vmlc1GoFhRqM rjXO/+gG6+cdqAT0ihaLgtcOAVJoSU2kw7NO6W4J6knuSWhZYdwOEJFiCbM46I7F lTIrQPCdT73fyxz5mifx4YnIKqvCCWJAF0OBY3F82XIBS8UBgH1TO6kjBslcSi5c 53fVD2FPivZbULNgu8Z3jOUNbEvrbug= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id mQEJEDzlLjxF for ; Sat, 15 Feb 2014 02:07:21 -0700 (MST) Received: from [192.168.1.101] (101.int.elyograg.org [192.168.1.101]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id C10EC33D5 for ; Sat, 15 Feb 2014 02:07:21 -0700 (MST) Message-ID: <52FF2E48.5010301@elyograg.org> Date: Sat, 15 Feb 2014 02:07:20 -0700 From: Shawn Heisey User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: DIH References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 2/14/2014 10:45 PM, William Bell wrote: > On virtual cores the DIH handler is really slow. On a 12 core box it only > uses 1 core while indexing. > > Does anyone know how to do Java threading from a SQL query into Solr? > Examples? > > I can use SolrJ to do it, or I might be able to modify DIH to enable > threading. > > At some point in 3.x threading was enabled in DIH, but it was removed since > people where having issues with it (we never did). If you know how to fix DIH so it can do multiple indexing threads safely, please open an issue and upload a patch. I'm still using DIH for full rebuilds, but I'd actually like to replace it with a rebuild routine written in SolrJ. I currently achieve decent speed by running DIH on all my shards at the same time. I do use SolrJ for once-a-minute index maintenance, but the code that I've written to pull data out of SQL and write it to Solr is not able to index millions of documents in a single thread as fast as DIH does. I have been building a multithreaded design in my head, but I haven't had a chance to write real code and see whether it's actually a good design. For me, the bottleneck is definitely Solr, not the database. I recently wrote a test program that uses my current SolrJ indexing method. If I skip the "server.add(docs)" line, it can read all 91 million docs from the database and build SolrInputDocument objects for them in 2.5 hours or less, all with a single thread. When I do a real rebuild with DIH, it takes a little more than 4.5 hours -- and that is inherently multithreaded, because it's doing all the shards simultaneously. I have no idea how long it would take with a single-threaded SolrJ program. Thanks, Shawn