manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kate McGonigal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-235) item description element not indexed
Date Wed, 03 Aug 2011 22:31:28 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079085#comment-13079085
] 

Kate McGonigal commented on CONNECTORS-235:
-------------------------------------------

I'm afraid these problems still exist for me. 

A few hours ago I built the latest from trunk. It is running on PostgreSQL.

Just in case, I also started from a fresh install of Solr 3.3.0.  I'm using the example that
comes with the distribution. It is thus running on Derby. I realize the schema is not optimal
for RSS feeds, but it does include a "description"  field, which is what I'm interested in
at the moment.

Problem 1) When I try running the example job with "Dechromed Content" set to "No dechromed
content", what shows up in the description field (for all documents) is "Jazz radio show from
Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." which is not the item-description in the
RSS feed's XML, but rather from the website's metadata description element in the HTML.  I
have tried another RSS feed, with the same result.

Problem 2) When I try running the example job (see original post) with "Dechromed Content"
set to "if present, in 'description' field" it still hangs with the log file showing:
{quote}FATAL 2011-08-03 16:08:21,703 (Worker thread '10') - Error tossed: java.lang.String
cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:611)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1221)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:824)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321){quote}

And just to be clear on what I am ultimately trying to do: I'd like to be able to show my
searchers the "description" from the RSS feed for each of the documents that match their searches.
I actually only need to index the item-description field (as opposed to what is at the item
link) since my RSS feeds are of scientific papers that will have a detailed abstract in the
item-description.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata
from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue."
in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm
not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message