manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Tavard <olivier.tav...@francelabs.com>
Subject Logging and Document filter transformation connector
Date Thu, 11 Oct 2018 13:31:02 GMT
Hello,

I have a question regarding the Document filter transformation connector and the log about
it.
I would like to have a look of all the documents excluded by the rules configured in the Document
filter transformation connector by looking at the Simple history or by the MCF log but it
is not easy so far.

Let’s say that I want to crawl a website and I want to index html pages only. So I configure
a web repository connector with a Document filter transformation connector and I create the
rule with only one allowed mime type content and one file extension. So far so good, the job
works well but if I want to visualize on the MCF log or by the simple history all the files
that were excluded by the transformation connector it is quickly complicated : I have to search
manually all the files that were fetched but not processed by Tika transformation connector
or ingested by the output connector.

Of my understanding of the code, the document filter transformation connector can communicate
directly with the repo transformation connector to indicate the rules of exclusion of the
documents and so the document that need to be excluded are not processed in the Document filter
transformation connector but directly excluded by the web repo connector.
So in the simple history, I can see that a document that will be excluded is in "activity
fetch" and that’s it, there is no additional information about it.
Could it be possible to add a log entry with an explicit result code as excluded by "document
filter connector" or something like when the document is excluded by the repository connector?
 
Thank you,
Best regards,
Olivier 


Mime
View raw message