manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Field mapping for RSS feed
Date Tue, 02 Aug 2011 19:04:19 GMT
Hi Kate,

Many news RSS feeds put the full article in either the item
description or the item content field, while the document described by
the url field is not just straight content but contains navigation and
advertising "chrome".  In such cases it's often preferable to generate
an index based on the description or content field contents rather
than the actual document with all of that chrome.  The Dechromed
Content options allow you to set up that behavior for a specific job.

Thanks for opening the ticket; I'll propose a solution shortly.

Karl


On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
> Hi Karl,
>
> Thank you for your quick response. I've opened a Jira ticket for this,
> though I don't really understand what sort of solution you had in mind so I
> didn't propose anything.
>
> I'm afraid I don't understand exactly what the Dechromed Content options do
> either. I read about them in the End User Documentation, but there wasn't
> much there yet.
>
> I find it odd that I would be the first person to have this problem. You'd
> think it would be very common.
>
>
> Kate
>
>
> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> I just looked at the code.  It's not a bug rather than an oversight of
>> sorts.  The "description" or "content" fields are indexed as the
>> primary content of the document if the "chrome" mode is selected
>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> description field is ignored even when present.
>>
>> So I recommend simply adding a new kind of "description" field for
>> when the "chrome" mode is set to "None".  "item/description" may be
>> its name, or maybe the full XPath, your choice.  Propose something in
>> the ticket and I'll respond.
>>
>> Thanks!
>> Karl
>>
>>
>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <daddywri@gmail.com> wrote:
>> > Hi Kate,
>> >
>> > The field mapping won't do the trick because the RSS connector is
>> > currently very selective about what fields it extracts - it by no
>> > means extracts all of them, so the ones that it *does* extract from
>> > the feed are "special".
>> >
>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>> > through the code at first opportunity.  In the meantime, could you
>> > create a Jira ticket describing the behavior you see vs. the behavior
>> > you want?
>> >
>> > Thanks!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <kmcgoniga@gmail.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>> >> of
>> >> works, but my main problem at the moment is that the *channel*
>> >> description
>> >> from the RSS feed is written to the "description" field in Solr when I
>> >> would
>> >> really like the *item* description to be written instead.
>> >>
>> >> I have a typical RSS feed with the general structure:
>> >>
>> >> <rss>
>> >>     <channel>
>> >>         <title></title>
>> >>         <link></link>
>> >>         <description> *** the description I don't want ***
>> >> </description>
>> >>         <item>
>> >>             <title></title>
>> >>             <link></link>
>> >>             <pubDate></pubDate>
>> >>             <description> *** the description I do want ***
>> >> </description>
>> >>             <author></author>
>> >>             <category></category>
>> >>         </item>
>> >>     </channel>
>> >> </rss>
>> >>
>> >> I tried setting up the  field mapping on the job with the XPath address
>> >> of
>> >> the second description, i.e. "/rss/channel/item/description" as the
>> >> source,
>> >> but that did not work.
>> >>
>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>> >> trying to
>> >> solve it.  I would be grateful for any help.
>> >>
>> >>
>> >> Kate McGonigal
>> >>
>> >>
>> >>
>> >
>
>

Mime
View raw message