manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Connector
Date Mon, 03 Jun 2013 14:25:23 GMT
Hi Stephane,

(1) ManifoldCF always uses the URL of a document as the primary ID when it
indexes it.  This is the standard treatment and has been since Day 1.

(2) For the "creation date" attribute, the RSS connector uses the date in
the feed, if there is one.  This is a date in ISO format, and comes out as
the metadata value "pubdateiso".  There is also an attribute called
"pubdate", which is in milliseconds since epoch, which is EITHER the date
in the feed (if present), or if not it's the date the document is fetched.

As for your other question, "chromed" data comes from the URLs referenced
by the items in the feed, and "dechromed" data comes from either the
content or description field that's actually in the feed, whichever you
specify.

All of this is described in the end-user-documentation, although I do
notice that "pubdateiso" is missing from the metadata listed.

http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository

Karl



On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <stephane@gamard.net>wrote:

>
> Hi all,
>
>
> I'm trying to use the RSS connector for the following feed:
> http://blog.mikemccandless.com/feeds/posts/default
>
> After setting the job up and ingesting documents I have 2 pending
> questions:
> - why is the connector using the URL as ID instead of the atom ID tag?
> - I have no creation and/or modified date in my Solr document, how is it
> so?
>
> Overall I am a bit confused as to where does the crawler gets it's
> information (chrome vs dechromed). I've downloaded the feed and tried to
> find the entries back into my index but could not do so (could only find
> pages which are linked from the rss entry).
>
> Sorry for the hassle, I'm reading over and over trying to piece it all
> together.
>
> Cheers,
>
> _Stephane
>

Mime
View raw message