manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Testing Pipelines. Conclusions so far and Some Doubts
Date Mon, 30 Jun 2014 14:53:01 GMT
Splitters are fine, but since data in a job comes from only one source,
aggregation is only possible when the idea is only to augment a
document from the repository. Indeed, I don't think anything beyond
that on the source side is possible.

Karl

Sent from my Windows Phone
From: Rafa Haro
Sent: 6/30/2014 10:14 AM
To: Karl Wright; dev@manifoldcf.apache.org
Subject: Re: Testing Pipelines. Conclusions so far and Some Doubts
Hi Karl,

I can extend myself explaining the reasons, but a simple summary is that
we need more complex pipelines, supporting for example splitters and
aggregators, not only sequential components. Of course, everything can
be hacked, and we have decided to change our current approach by
implementing some transformation connectors, but for incoming versions
of our product we will be using our own processor architecture.

Thanks. Rafa

El 30/06/14 16:05, Karl Wright escribió:
> Hi rafa,
>
> I am out of town at the moment, but frankly I could see no reason that
> the architecture as it is implemented would not meet your use case. A
> transformation connection is not limited to passing along the input
> repository document object; it can modify I extensively and even
> replace it.
>
> Karl
>
> Sent from my Windows Phone
> From: Rafa Haro
> Sent: 6/30/2014 6:48 AM
> To: dev@manifoldcf.apache.org
> Subject: Testing Pipelines. Conclusions so far and Some Doubts
> Hi,
>
> I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7.
> Before exposing the problems I have experimented and before asking some
> questions, I would like to explain the kind of test I have performed so
> far:
>
> 1. Testing with a simple File system connector for simplicity
>
> 2. Using 2 instances of Solr Output Connector for testing Multiple
> output. The final Solr instance is the same and each output connector
> has been configured with 2 different solr cores (collection1 and
> collection2)
>
> 3. Using Allowed Documents and Tika Extractor as Transformation
> connectors. Allowed Documents has been configured to allow only PDF
> files (mimetype + extension)
>
> 4. The processing pipeline I wanted to configure is quite simple: Filter
> and extract content (with Tika) for collection1 and a normal crawling
> for collection2. Let me explain better: both transformation connectors
> were configured for collection1 Solr Output and no transformation
> connector were configured for collection2. I have two files in the
> configured repository path for the File system connector: a PDF file and
> an ODS file. I was expecting only the PDF file to be indexed in
> collection1 and both files in collection2.
>
> The result of the experiment has been the following:
>
> 1. All the files have been indexed in both collections. Apparently the
> Allowed Documents transformation connector doesn't work with filesystem
> repository connector.
>
> 2. For collection1 Output Connector, I first changed the Update Handler
> from /update/extract to /update because Tika Extractor was going to be
> configured for it. This change produces an error in Solr while indexing
> (Unsupported ContentType: application/octet-stream Not in:
> [application/xml, text/csv, text/json, application/csv,
> application/javabin, text/xml, application/json]).
>
> 3. Therefore, I configured again the update handler as /update/extract.
> Because the same exact content is being indexed for both cores, I don't
> have a way to know if the Tika transformation connector is working
> properly or not.
>
> Up to here the testing outcomes. Now I would like to expose some
> conclusions from the point of view of our use case. Although the
> pipeline approach is great, as far as I have understood it, we can't
> still use it for our purposes. Specifically, what we would is somehow to
> create different repository documents in any moment of the chain and
> send them to different output connector. Let me put an easy use case:
>
> We want to process the documents to extract Named Entities: Persons,
> Places and Organizations. The first transformation of the pipeline can
> use any NER system to extract the name entities. Then I want to have
> separates repositories (outputs): one for the raw content and one for
> each type of entity. Let's say 4 different solr cores. Of course with
> current approach I could send the same repository document to all the
> outputs and respectively filtering, but doesn't sound to me as a good
> solution.
>
> Cheers,
> Rafa

Mime
View raw message