manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: SOLR
Date Tue, 15 Mar 2011 06:57:22 GMT
No, all retrieval is being done by ManifoldCF.  Solr Cell does not retrieve it.

The RSS connector retrieves content from an RSS feed, which is
basically a bunch of references.  The feed itself is not indexed, but
the documents it refers to are.  If those documents, when you bring
them up in a browser, have login and navigation information, you may
well see these in the index.

The RSS connector can be configured to just index the document's
description or content information from the feed, and not the document
itself, but that requires you to change one of the settings for the
job.

Karl

On Tue, Mar 15, 2011 at 12:37 AM, Fuad Efendi <fuad@efendi.ca> wrote:
> Hi Karl,
>
> My only guess is we submit URI of a document to SOLR Cell, and Solr Cell
> retrieves it from Internet (using probably HttpClient and "may be" using own
> Robot signature?)
> Even in case of RSS...
> Only this can explain why I have "navigation" and "login" in SOLR index...
>
> Am I right?
>
>
> Thanks
>
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: March-15-11 12:26 AM
> To: connectors-user@incubator.apache.org
> Subject: RE: SOLR
>
> UPDATE:
> SOLR 1.4.1 (june-2010) works fine with ManifoldCF trunk.
> SOLR trunk doesn't work, and I suspect bugs in TIKA...
>
> But it is strange :)
>
> I am looking at SOLR, each document contains huge array of "links",
> including many links to Yahoo login... something weird (it doesn't look like
> RSS)... but searchable.
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: March-14-11 10:50 PM
> To: 'connectors-user@incubator.apache.org'
> Subject: RE: SOLR
>
>
> I just noticed:
> Currently, default for ManifoldCF is /update/extract, which corresponds to
> SOLR Cell request handler.
>
> So...
> It is EXTREMELY generic...
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
>
>

Mime
View raw message