lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Jha <shanuu....@gmail.com>
Subject Re: How fast indexing?
Date Tue, 22 Mar 2016 01:48:41 GMT
When I run the same sql on DB it takes only 1 sec. And 6-7 documents are getting indexed per
second. 

As I've 4 node solrCloud setup, can I run 4 import handler to index the same data? Will it
not over write? 

10-20k is very high in numbers, where can I get the actual size of document.

Rgds
AJ

> On 22-Mar-2016, at 05:32, Shawn Heisey <apache@elyograg.org> wrote:
> 
>> On 3/20/2016 6:11 PM, Amit Jha wrote:
>> In my case I am using DIH to index the data and Query is having 2 join statements.
To index 70K documents it is taking 3-4Hours. Document size would be around 10-20KB. DB is
MSSQL and using solr4.2.10 in cloud mode.
> 
> My source data is in a MySQL database.  I use DIH for full rebuilds and
> SolrJ for maintenance.
> 
> My index is sharded, but I'm not running SolrCloud.  When using DIH, all
> of my shards build at once, and each one achieves about 750 docs per
> second.  With six large shards, rebuilding a 146 million document index
> takes 9-10 hours.  It produces a total index size in the ballpark of 170GB.
> 
> DIH has a performance limitation -- it's single-threaded.  I obtain the
> speeds that I do because all of my shards import at the same time -- six
> dataimport instances running at the same time, each one with a single
> thread, importing a little more than 24 million documents.  I have
> discovered that Solr is the bottleneck on my setup.  The data retrieval
> from MySQL can proceed much faster than Solr can handle with a single
> indexing thread.  My situation is a little bit unusual -- as Erick
> mentioned, usually the bottleneck is data retrieval, not Solr.
> 
> At this point, if I want to make bulk indexing go faster, I need to
> build a SolrJ application that can index with multiple threads to each
> Solr core at the same time.  This is on my roadmap, but it's not going
> to be a trivial project.
> 
> At 10-20K, your documents are large, but not excessively so.  If 70000
> documents takes 3-4 hours, then there's one of a few problems happening.
> 
> 1) your database is VERY slow.
> 2) your analysis chain in schema.xml is running SUPER slow analysis
> components.
> 3) Your server or its configuration is not providing enough resources
> (CPU/RAM/IO) so Solr can run efficiently.
> 
> #2 seems rather unlikely, so I would suspect one of the other two.
> 
> ----
> 
> I have seen one situation related to the Microsoft side of your setup
> that might cause a problem like this.  If any of your machines are
> running on Windows Server 2012 and you have bridged NICs (usually for
> failover in the event of a switch failure), then you will need to break
> the bridge and just run one NIC.
> 
> The performance improvement on the network when a bridged NIC is removed
> from Server 2012 is enough to blow your mind, especially if the access
> is over a high-latency network link, like a VPN or WAN connection.  The
> same setup on Server 2003 or Server 2008 has very good performance.
> Microsoft seems to have a bug with bridged NICs in Server 2012.  Last
> time I tried to figure out whether it could be fixed, I ran into this
> problem:
> 
> https://xkcd.com/979/
> 
> Thanks,
> Shawn
> 

Mime
View raw message