lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: data import handler clarifications/ pros and cons.
Date Mon, 06 Oct 2014 12:56:20 GMT
On 10/6/2014 5:09 AM, Karunakar Reddy wrote:
> Please suggest me effective way of using data import handler.
> Here is my use case.
> I have different kind of items which needs to be indexed in solr . Eg(
> books, shoes,electronics etc... ) each one has in different relational
> table.
> I have only one core as of now which is been used for public search and for
> other search pages like (book search page/ electronics search page..)
> and updates are happening through indexing script which we are maintaining
> internally  .
> We are planning to use DIH(data import handler).
> 1)Is it best way to use DIH/over indexing script? any pros and cons of
> using DIH?
> 2) How can we index different type of documents(books,electronic..  the
> data is there in different tables in mysql ) through document import
> handler?
> 3)What is the best way to do delta-import.? how do we fire delta-import
> request? is there any thing like auto delta import like autocommit?

If you already have an effective indexing method that does everything
you need, I would suggest sticking with it.

I think of DIH as stopgap feature, a way to get started with Solr when
using a structured data store, until you can write your own indexing
procedure that is highly tailored to your situation.  I'm actually still
using DIH for full reindexes, controlled with SolrJ, but I have grand
designs for replacing it with a multi-threaded approach that hopefully
will be much faster.

DIH is a fairly efficient single-threaded way of accessing a single flat
table space from a database.  As soon as you try to make it include
multiple and/or nested entities, its performance will often drop
significantly.  If you can reduce all of your interaction with the
database to as single SELECT call -- using joins, a stored procedure, or
something similar, then you MIGHT be able to use DIH effectively.  The
DIH handler on each of my shards uses exactly one SELECT call.

There is currently no DIH scheduler built-in to Solr.  There are two
reasons that the idea has met with resistance:

1) There is already a built-in scheduling apparatus on *every* modern
operating system, one that has been tested, debugged, and is generally
bulletproof.  If a feature like that is built into Solr, users will be
unhappy if it doesn't work as advertised because we made a mistake in
the code.  I'd rather rely on an OS feature that's been around for
multiple decades.

2) As a group, the developers are resistant to features that would cause
Solr to make changes in the index without being *told* to do it by an
outside force.  There is already an issue in Jira for a DIH scheduler,
but the patch hasn't been committed.  Some developers would like to
include it.


View raw message