lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: What is the best way of Indexing different formats of documents?
Date Tue, 07 Apr 2015 16:59:56 GMT
The disadvantages of DIH are
1> it's a black box, debugging it isn't easy
2> it puts all the work on the Solr node. Parsing documents in various
forms can be pretty heavy-weight and steal cycles from indexing and
2a> the extracting request handler also puts all the load on Solr FWIW.

Personally I prefer an external program (and I was gratified to see
Yavar's reference to the indexing with SolrJ article...). But then I'm
a Java programmer by training, so that seems easy...


On Tue, Apr 7, 2015 at 7:41 AM, Dan Davis <> wrote:
> Sangeetha,
> You can also run Tika directly from data import handler, and Data Import
> Handler can be made to run several threads if you can partition the input
> documents by directory or database id.   I've done 4 "threads" by having a
> base configuration that does an Oracle query like this:
>       SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
> WHERE ...) WHERE threadid = %d
> A bash/sed script writes several data import handler XML files.
> I can then index several threads at a time.
> Each of these threads can then use all the transformers, e.g.
> templateTransformer, etc.
> XML can be transformed via XSLT.
> The Data Import Handler has other entities that go out to the web and then
> index the document via Tika.
> If you are indexing generic HTML, you may want to figure out an approach to
> SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
> locally, because Boilerpipe has a bug that has been fixed, but not pushed
> to Maven Central.   Without that, the ASF cannot include the fix, but
> distributions such as LucidWorks Solr Enterprise can.
> I can drop some configs into if I clean them up to obfuscate
> host names, passwords, and such.
> On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <> wrote:
>> Well have indexed heterogeneous sources including a variety of NoSQL's,
>> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
>> of using SolrJ is that you should have an API to fetch data from your data
>> source (Say JDBC for RDBMS, Tika for extracting text content from rich
>> documents etc.) than SolrJ is so damn great and simple. Its as simple as
>> downloading the jar and few lines of code to send data to your solr server
>> after pre-processing your data. More details here:
>> Cheers,
>> Yavar
>> On Tue, Apr 7, 2015 at 4:18 PM, <
>>> wrote:
>> > Hi,
>> >
>> > I am a newbie to SOLR and basically from database background. We have a
>> > requirement of indexing files of different formats (x12,edifact,
>> csv,xml).
>> > The files which are inputted can be of any format and we need to do a
>> > content based search on it.
>> >
>> > From the web I understand we can use TIKA processor to extract the
>> content
>> > and store it in SOLR. What I want to know is, is there any better
>> approach
>> > for indexing files in SOLR ? Can we index the document through streaming
>> > directly from the Application ? If so what is the disadvantage of using
>> it
>> > (against DIH which fetches from the database)? Could someone share me
>> some
>> > insight on this ? ls there any web links which I can refer to get some
>> idea
>> > on it ? Please do help.
>> >
>> > Thanks
>> > Sangeetha
>> >
>> >

View raw message