lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Data Import
Date Fri, 17 Mar 2017 13:53:47 GMT
I feel DIH is much better for prototyping, even though people do use
it in production. If you do want to use DIH, you may benefit from
reviewing the DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports,
again useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already
have Java stack.

The choice is yours in the end.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey <apache@elyograg.org> wrote:
> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available
options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in production
>> environment. Will sending data via SolrJ in batches be faster than posting a csv
using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never tested
> it.  The same threading problem that I discuss below would apply to
> indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going into
> Solr, and each batch of documents to be inserted must wait for the
> previous one to finish inserting before it can start.  I do not know if
> DIH batches documents or sends them in one at a time.  If you have a
> manually sharded index, you can run DIH on each shard in parallel, but
> each one will be single-threaded.  That single thread is pretty
> efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading) is
> how you radically speed up the Solr part of indexing.  This is usually
> done with a custom indexing program, which might be written with SolrJ
> or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the situation
> is examined closely, most people find that it's not Solr that makes
> their indexing slow.  The bottleneck is usually the source system -- how
> quickly the data can be retrieved.  It usually takes a lot longer to
> obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>

Mime
View raw message