manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilhelm Eger <>
Subject Re: Additional information from external database
Date Wed, 22 Feb 2017 12:29:48 GMT
Hi Karl,

Thanks a lot for your help.

My Datafari setup uses a file system crawler to crawl files (repository 
connector -> job), from which text is extracted via the tika plugin. This is 
then sent to SolR via the SolR output connector.

I am already using a transformation connector to add a field based on the name 
of the job (using the file system repository connector) to distinguish the 
origin of the indexed file later.

Actually, I ended up at the same solution you presented me (but I did not 
mention it beforehand to spoil the answers): writing my own transformation 
connector to retrieve the information from the database. The connector should:

- know the file name
- compile a SQL statement from the file name
- send this SQL statement to the database
- retrieve the file number
- add it to a certain field

I do know little to nothing about java, but I am able to teach myself if 
necessary. Is there any starting point to begin with developing my on 
transformation connector?

Thanks in advance,


Am Mittwoch, 22. Februar 2017, 13:15:23 CET schrieb Karl Wright:
> Hi Wilhelm,
> I don't know anything about how datafari uses ManifoldCF to crawl.  All I
> can do is describe how ManifoldCF works, and then maybe you can see how it
> integrates with datafari.
> MCF gets documents from a repository using one of many kinds of repository
> connector.  It then can transform the document in many different ways,
> before sending the (transformed) document to one of many output
> connectors.  I gather that datafari injects documents primarily into Solr.
> Each job in MCF has its own "pipeline", which describes the flow of a
> document through the system for that job.
> The transformations that are available in MCF include:
> - ability to extract metadata from the document (using Tika)
> - ability to modify or add metadata properties (you specify this in the job
> UI)
> - OpenNLP metadata extraction
> - Filter out documents based on characteristics of the document
> Writing connectors is relatively straightforward and there are online
> materials available to help you do this. I can provide a link, if you need
> it.  Without any more information as to what exactly you are using for a
> repository connector, and what that connector provides as part of the
> document information, I can't really give you the best approach here, but
> it may be possible to write a transformation connector that would look up
> the information you want to add as metadata from your database and include
> that in the document that gets sent to Solr.
> Please let us know how we can help.
> Thanks,
> Karl
> On Wed, Feb 22, 2017 at 7:01 AM, Wilhelm Eger <>
> wrote:
> > Hi!
> > 
> > I am using a setup of datafari (, which more or less
> > combines
> > a ManifoldCF file index with SolR as a search engine.
> > 
> > My setup consists of ~350000 files, which are composed mainly of doc(x),
> > xls(x), msg and pdf files. pdf files are ocr'd externally before they are
> > added
> > to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd
> > on-
> > the-fly, when being imported.
> > 
> > The files are actually part of an external file management system (files
> > in the
> > literal meaning of files, not files in the meaning of entities saved on
> > the hard
> > disk), which is not related to ManifoldCF/SolR at all. This system
> > unfortunately does not provide a proper full text search, hence I
> > implemented
> > it as outlined above.
> > 
> > However, the users are used to certain file numbers provided by this file
> > management system. These file numbers are stored in a MSSQL database,
> > which is
> > accessible from the host my setup is running on. I can easily get the file
> > number by sending a respective SQL statement based on the file name (of
> > the
> > entity saved on the hard disk) to the SQL Server. Hence, for each file
> > name,
> > there is a file number stored in the database. I would like to have these
> > file
> > numbers to be stored in a specific field of the solr index to be shown by
> > the
> > (tomcat) output, e.g:
> > 
> > File name: /data/1003234234.docx
> > Content: "This is the content. You searched for _text_."
> > File name belongs to file number: SUI-G-25-A
> > 
> > Is there any possibility to achieve that? Did I understand it correctly
> > that
> > this could happen either in ManifoldCF during indexing or in SolR during
> > importing?
> > 
> > I know that there is a tika plugin to talk to databases, which could be
> > fed
> > with a SQL statement. But how to connect it with the data retrieved from
> > the
> > files crawler?
> > 
> > Alternatively, I could also call an external script (bash, python) to
> > retrieve
> > the respective data from the database using bsqldb.
> > 
> > Any hint in the right direction is very much appreciated.
> > 
> > Thanks in advance,
> > 
> > Wilhelm

View raw message