lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Additional information from external database
Date Wed, 22 Feb 2017 15:10:39 GMT
There really isn't a _Tika_ database connector, Tika parses the
structured files. A typical jdbc connector can connect to a DB. You
might be thinking of Data Import Handler (DIH).

Here's a program that both uses Tika and connects to a DB that might
give you a hint. It uses an older version of Solr, but should be
fairly easily modifiable.

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Feb 22, 2017 at 4:14 AM, Wilhelm Eger <wilhelm.eger@gmail.com> wrote:
> Hi!
>
> I am using a setup of datafari (www.datafari.com), which more or less combines
> a ManifoldCF file index with SolR as a search engine.
>
> My setup consists of ~350000 files, which are composed mainly of doc(x),
> xls(x), msg and pdf files. pdf files are ocr'd externally before they are added
> to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd on-
> the-fly, when being imported.
>
> The files are actually part of an external file management system (files in the
> literal meaning of files, not files in the meaning of entities saved on the hard
> disk), which is not related to ManifoldCF/SolR at all. This system
> unfortunately does not provide a proper full text search, hence I implemented
> it as outlined above.
>
> However, the users are used to certain file numbers provided by this file
> management system. These file numbers are stored in a MSSQL database, which is
> accessible from the host my setup is running on. I can easily get the file
> number by sending a respective SQL statement based on the file name (of the
> entity saved on the hard disk) to the SQL Server. Hence, for each file name,
> there is a file number stored in the database. I would like to have these file
> numbers to be stored in a specific field of the solr index to be shown by the
> (tomcat) output, e.g:
>
> File name: /data/1003234234.docx
> Content: "This is the content. You searched for _text_."
> File name belongs to file number: SUI-G-25-A
>
> Is there any possibility to achieve that? Did I understand it correctly that
> this could happen either in ManifoldCF during indexing or in SolR during
> importing?
>
> I know that there is a tika plugin to talk to databases, which could be fed
> with a SQL statement. But how to connect it with the data retrieved from the
> files crawler?
>
> Alternatively, I could also call an external script (bash, python) to retrieve
> the respective data from the database using bsqldb.
>
> Any hint in the right direction is very much appreciated.
>
> Thanks in advance,
>
> Wilhelm
>

Mime
View raw message