manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF SOLR request default Content-Type
Date Wed, 15 Jan 2014 11:18:13 GMT
Hi Paul,

That ticket applies only to the JCIFS connector, and other connectors that
have to map extensions to mime types.  The Web connector does not have to
do that.

The Web connector has certain mime types it knows it can extract links
from, but as far as content, it leaves that up to the output connection.
Here's the code:

>>>>>>
    // There are presumably mime types we can extract links from that we
can't index?
    if (interestingMimeTypeMap.get(contentType) != null)
      return true;

    boolean rval = activities.checkMimeTypeIndexable(contentType);
    if (rval == false && Logging.connectors.isDebugEnabled())
      Logging.connectors.debug("Web: For document '"+documentIdentifier+"',
not fetching because output connector does not want mimetype
'"+contentType+"'");
    return rval;
<<<<<<

You can tell if this is what is happening to your document by turning on
connector debug (in properties.xml: <property
name="org.apache.manifoldcf.connectors" value="DEBUG"/>).  But if you are
using the Solr connector, you can select the mime types desired on one of
the job tabs.

Karl




On Wed, Jan 15, 2014 at 5:39 AM, Paul Bieles <paulbieles@hotmail.com> wrote:

> Many thanks for the reply Karl...
>
> I discovered the following issue -
> https://issues.apache.org/jira/i#browse/CONNECTORS-768 extending
> this might help us resolve the problem.  Would it be a good idea to have
> this list in a config file, that way it could be extended easier?
>
> Paul
>
>  ------------------------------
> Date: Tue, 14 Jan 2014 12:36:20 -0500
> Subject: Re: ManifoldCF SOLR request default Content-Type
> From: daddywri@gmail.com
> To: user@manifoldcf.apache.org
>
>
>  Hi Paul,
>
> When there is no content type on a web crawl, the ManifoldCF web connector
> does not default anything -- it sets null as the content type.
>
> The Solr output connector also does not default anything; it returns null
> to SolrJ when SolrJ requests the content type.  What SolrJ does under those
> conditions is anyone's guess, but I suspect that that is where the
> application/octet content type is getting set.  I'd have to look at that
> code to be sure.
>
> Karl
>
>
>
> On Tue, Jan 14, 2014 at 12:29 PM, Paul Bieles <paulbieles@hotmail.com>wrote:
>
>  Does ManifoldCF default Content-Type to application/octet-stream for
> file types that it doesn't know? If so, is there a way to set it to
> something else? The reason I ask is I've got a load of kml files that I'm
> pushing into solr.
>
> Cheers,
>
> Paul
>
>
>

Mime
View raw message