manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Connector
Date Mon, 03 Jun 2013 15:04:56 GMT
Hi Stephane,


First, you would not want to select to get dechromed content from the feed
description field if there is no feed description field.  (In that case, by
default the connector fall back to use the actual content from the document
link.)

Second, for this kind of feed, the connector looks for either "published"
or "updated" and takes the latter of the two if both are found.  However,
the ISO8601 date parser we are using is not happy with any timezone other
than Z (zulu) at this time, but your dates have -0400 instead, and that is
the problem.  I'll create a ticket to deal with that issue.

Karl



On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <stephane@gamard.net>wrote:

> Hi Karl,
>
>
> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
> well used :). As for #2, I am still puzzled about the following. Here's an
> excerpt from  the feed xml:
>
>
> <entry>
>
> <id>tag:blogger.com
> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>
> <published>2013-05-21T18:23:00.000-04:00</published>
>
> <updated>2013-05-21T18:23:06.451-04:00</updated>
>
> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>
> <title type="text">Dynamic faceting with Lucene</title>
>
> <content type="html">Lucene's [...] Happy faceting!</content>
>
> <link rel="replies" type="application/atom+xml" href="
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
> title="Post Comments"/>
>
> <link rel="replies" type="text/html" href="
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
> title="0 Comments"/>
>
> <link rel="edit" type="application/atom+xml" href="
> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
> "/>
>
> <link rel="self" type="application/atom+xml" href="
> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
> "/>
>
> <link rel="alternate" type="text/html" href="
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
> title="Dynamic faceting with Lucene"/>
>
> <author>
>
> <name>Michael McCandless</name>
>
> <uri>https://plus.google.com/112759599082866346694</uri>
>
> <email>noreply@blogger.com</email>
>
> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
> height="32" src="//
> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
> "/>
>
> </author>
>
> <thr:total>0</thr:total>
>
> </entry>
>
>
> Below is the document once ingested in Solr (searched with query:
> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
> Note that I use a catch all field (<dynamicField name="*"  type="string"
>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
> save all submitted fields.
>
>
> I have two questions that I don't understand:
>
> - I've selected the option "Dechromed content, if present, in
> 'description' field"  and yet I have no description field
>
> - I have no pubDate of publications field available
>
>
> Here's the attached Solr output:
>
>
> This XML file does not appear to have any style information associated
> with it. The document tree is shown below.
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> <lst name="params">
> <str name="fl">*</str>
> <str name="q">
> id:
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> </lst>
> </lst>
> <result name="response" numFound="1" start="0">
> <doc>
> <arr name="link">
> <str>http://blog.mikemccandless.com/favicon.ico</str>
> <str>icon</str>
> <str>image/x-icon</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>canonical</str>
> <str>alternate</str>
> <str>application/atom+xml</str>
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> <str>alternate</str>
> <str>application/rss+xml</str>
> <str>
> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
> </str>
> <str>service.post</str>
> <str>application/atom+xml</str>
> <str>
> http://www.blogger.com/feeds/8623074010562846957/posts/default
> </str>
> <str>EditURI</str>
> <str>application/rsd+xml</str>
> <str>
> http://www.blogger.com/rsd.g?blogID=8623074010562846957
> </str>
> <str>alternate</str>
> <str>application/atom+xml</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>publisher</str>
> <str>text/css</str>
> <str>stylesheet</str>
> <str>
> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
> </str>
> <str>text/css</str>
> <str>stylesheet</str>
> <str>
> //
> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
> </str>
> </arr>
> <arr name="meta">
> <str>viewport</str>
> <str>width=1100</str>
> <str>stream_source_info</str>
> <str>docname</str>
> <str>stream_content_type</str>
> <str>text/html; charset=UTF-8</str>
> <str>stream_size</str>
> <str>80779</str>
> <str>Content-Encoding</str>
> <str>UTF-8</str>
> <str>stream_name</str>
> <str>docname</str>
> <str>generator</str>
> <str>blogger</str>
> <str>MSSmartTagsPreventParsing</str>
> <str>true</str>
> <str>Content-Type</str>
> <str>text/html; charset=UTF-8</str>
> <str>resourceName</str>
> <str>docname</str>
> <str>dc:title</str>
> <str>Changing Bits: Dynamic faceting with Lucene</str>
> </arr>
> <arr name="false">
> <str>rect</str>
> <str>http://blog.mikemccandless.com/</str>
> <str>rect</str>
> <str>6579597884362535238</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
> </str>
> <str>rect</str>
> <str>http://jirasearch.mikemccandless.com</str>
> <str>rect</str>
> <str>
> http://www.elasticsearch.org/guide/reference/api/search/facets/
> </str>
> <str>rect</str>
> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
> <str>rect</str>
> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
> <str>rect</str>
> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
> <str>rect</str>
> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
> <str>rect</str>
> <str>http://jirasearch.mikemccandless.com</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>bookmark</str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
> <str>tag</str>
> <str>rect</str>
> <str>comments</str>
> <str>rect</str>
> <str>comment-form</str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
> </str>
> <str>rect</str>
> <str>links</str>
> <str>rect</str>
> <str/>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>application/atom+xml</str>
> <str>rect</str>
> <str>
>
> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> <str>rect</str>
> <str>
>
> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_05_01_archive.html
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_07_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_06_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_07_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_06_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_07_01_archive.html
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
> </str>
> <str>rect</str>
> <str>http://www.blogger.com</str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
> </str>
> </arr>
> <arr name="img">
> <str/>
> <str>13</str>
> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
> </str>
> <str>18</str>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
> </str>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
> </str>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str>My Photo</str>
> <str>80</str>
> <str>
> //
> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
> </str>
> <str>80</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>187</str>
> <str>
>
> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
> </str>
> <str>150</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> </arr>
> <arr name="iframe">
> <str>0</str>
> <str>auto</str>
> <str>410</str>
> <str>comment-editor</str>
> <str/>
> <str>100%</str>
> </arr>
> <str name="filename">docname</str>
> <str name="mimetype">text/html; charset=UTF-8</str>
> <arr name="source">
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> </arr>
> <arr name="category">
> <str>Lucene</str>
> </arr>
> <str name="id">
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <arr name="source_type">
> <str>rss</str>
> </arr>
> <arr name="title">
> <str>Dynamic faceting with Lucene</str>
> </arr>
> <arr name="title_search">
> <str>Dynamic faceting with Lucene</str>
> </arr>
> <arr name="viewport">
> <str>width=1100</str>
> </arr>
> <arr name="stream_source_info">
> <str>docname</str>
> </arr>
> <arr name="stream_content_type">
> <str>text/html; charset=UTF-8</str>
> </arr>
> <arr name="stream_size">
> <str>80779</str>
> </arr>
> <arr name="content_encoding">
> <str>UTF-8</str>
> </arr>
> <arr name="stream_name">
> <str>docname</str>
> </arr>
> <arr name="generator">
> <str>blogger</str>
> </arr>
> <arr name="mssmarttagspreventparsing">
> <str>true</str>
> </arr>
> <arr name="content_type">
> <str>text/html; charset=UTF-8</str>
> </arr>
> <arr name="resourcename">
> <str>docname</str>
> </arr>
> <arr name="dc_title">
> <str>Changing Bits: Dynamic faceting with Lucene</str>
> </arr>
> <arr name="content">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways . The Jira issues search example showcases a number of facet
> features. Here I'll describe two recently committed facet features:
> sorted-set doc-values faceting, already available in 4.3, and dynamic range
> faceting, coming in the next (4.4) release. To understand these features,
> and why they are important, we first need a little background. Lucene's
> facet module does most of its work at indexing time: for each indexed
> document, it examines every facet label, each of which may be hierarchical,
> and maps each unique label in the hierarchy to an integer id, and then
> encodes all ids into a binary doc values field. A separate taxonomy index
> stores this mapping, and ensures that, even across segments, the same label
> gets the same id. At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count. This is in contrast to purely dynamic faceting
> implementations like ElasticSearch 's and Solr 's, which do all work at
> search time. Such approaches are more flexible: you need not do anything
> special during indexing, and for every query you can pick and choose
> exactly which facets to compute. However, the price for that flexibility is
> slower searching, as each search must do more work for every matched
> document. Furthermore, the impact on near-real-time reopen latency can be
> horribly costly if top-level data-structures, such as Solr's
> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
> by the facet module means no extra work needs to be done on each
> near-real-time reopen. Enough background, now on to our two new features!
> Sorted-set doc-values faceting These features bring two dynamic
> alternatives to the facet module, both computing facet counts from
> previously indexed doc-values fields. The first feature, sorted-set
> doc-values faceting (see LUCENE-4795 ), allows the application to index a
> normal sorted-set doc-values field, for example: doc.add(new
> SortedSetDocValuesField("foo")); doc.add(new
> SortedSetDocValuesField("bar")); and then to compute facet counts at search
> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
> This feature does not use the taxonomy index, since all state is stored in
> the doc-values, but the tradeoff is that on each near-real-time reopen, a
> top-level data-structure is recomputed to map per-segment integer ordinals
> to global ordinals. The good news is this should be relatively low cost
> since it's just merge-sorting already sorted terms, and it doesn't need to
> visit the documents (unlike UnInvertedField). At search time there is also
> a small performance hit (~25%, depending on the query) since each
> per-segment ord must be re-mapped to the global ord space. Likely this
> could be improved (no time was spend optimizing). Furthermore, this feature
> currently only works with non-hierarchical facet fields, though this should
> be fixable (patches welcome!). Dynamic range faceting The second new
> feature, dynamic range faceting, works on top of a numeric doc-values field
> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
> You create a RangeFacetRequest, providing custom ranges with their labels.
> Each matched document is checked against all ranges and the count is
> incremented when there is a match. The range-test is a naive simple linear
> search, which is probably OK since there are usually only a few ranges, but
> we could eventually upgrade this to an interval tree to get better
> performance (patches welcome!). Likewise, this new feature does not use the
> taxonomy index, only a numeric doc-values field. This feature is especially
> useful with time-based fields. You can see it in action in the Jira issues
> search example in the Updated field. Happy faceting! Posted by Michael
> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
> Atom Comments About Me Michael McCandless Michael loves building software;
> he's been building search engines for more than a decade. In 1999 he
> co-founded iPhrase Technologies, a startup providing a user-centric
> enterprise search application, written primarily in Python and C. After IBM
> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
> committer in 2006 and PMC member in 2008. Michael has remained an active
> committer, helping to push Lucene to new places in recent years. He's
> co-author of Lucene in Action, 2nd edition. In his spare time Michael
> enjoys building his own computers, writing software to control his house
> (mostly in Python), encoding videos and tinkering with all sorts of other
> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
> template. Powered by Blogger .
> </str>
> </arr>
> <arr name="content_search">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways . The Jira issues search example showcases a number of facet
> features. Here I'll describe two recently committed facet features:
> sorted-set doc-values faceting, already available in 4.3, and dynamic range
> faceting, coming in the next (4.4) release. To understand these features,
> and why they are important, we first need a little background. Lucene's
> facet module does most of its work at indexing time: for each indexed
> document, it examines every facet label, each of which may be hierarchical,
> and maps each unique label in the hierarchy to an integer id, and then
> encodes all ids into a binary doc values field. A separate taxonomy index
> stores this mapping, and ensures that, even across segments, the same label
> gets the same id. At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count. This is in contrast to purely dynamic faceting
> implementations like ElasticSearch 's and Solr 's, which do all work at
> search time. Such approaches are more flexible: you need not do anything
> special during indexing, and for every query you can pick and choose
> exactly which facets to compute. However, the price for that flexibility is
> slower searching, as each search must do more work for every matched
> document. Furthermore, the impact on near-real-time reopen latency can be
> horribly costly if top-level data-structures, such as Solr's
> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
> by the facet module means no extra work needs to be done on each
> near-real-time reopen. Enough background, now on to our two new features!
> Sorted-set doc-values faceting These features bring two dynamic
> alternatives to the facet module, both computing facet counts from
> previously indexed doc-values fields. The first feature, sorted-set
> doc-values faceting (see LUCENE-4795 ), allows the application to index a
> normal sorted-set doc-values field, for example: doc.add(new
> SortedSetDocValuesField("foo")); doc.add(new
> SortedSetDocValuesField("bar")); and then to compute facet counts at search
> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
> This feature does not use the taxonomy index, since all state is stored in
> the doc-values, but the tradeoff is that on each near-real-time reopen, a
> top-level data-structure is recomputed to map per-segment integer ordinals
> to global ordinals. The good news is this should be relatively low cost
> since it's just merge-sorting already sorted terms, and it doesn't need to
> visit the documents (unlike UnInvertedField). At search time there is also
> a small performance hit (~25%, depending on the query) since each
> per-segment ord must be re-mapped to the global ord space. Likely this
> could be improved (no time was spend optimizing). Furthermore, this feature
> currently only works with non-hierarchical facet fields, though this should
> be fixable (patches welcome!). Dynamic range faceting The second new
> feature, dynamic range faceting, works on top of a numeric doc-values field
> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
> You create a RangeFacetRequest, providing custom ranges with their labels.
> Each matched document is checked against all ranges and the count is
> incremented when there is a match. The range-test is a naive simple linear
> search, which is probably OK since there are usually only a few ranges, but
> we could eventually upgrade this to an interval tree to get better
> performance (patches welcome!). Likewise, this new feature does not use the
> taxonomy index, only a numeric doc-values field. This feature is especially
> useful with time-based fields. You can see it in action in the Jira issues
> search example in the Updated field. Happy faceting! Posted by Michael
> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
> Atom Comments About Me Michael McCandless Michael loves building software;
> he's been building search engines for more than a decade. In 1999 he
> co-founded iPhrase Technologies, a startup providing a user-centric
> enterprise search application, written primarily in Python and C. After IBM
> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
> committer in 2006 and PMC member in 2008. Michael has remained an active
> committer, helping to push Lucene to new places in recent years. He's
> co-author of Lucene in Action, 2nd edition. In his spare time Michael
> enjoys building his own computers, writing software to control his house
> (mostly in Python), encoding videos and tinkering with all sorts of other
> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
> template. Powered by Blogger .
> </str>
> </arr>
> <arr name="language">
> <str>en</str>
> </arr>
> <arr name="url">
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> </arr>
> <arr name="snippet">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways ....At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count....The range-test is a naive simple linear search, which is
> probably OK since there are usually only a few ranges, but we could
> eventually upgrade this to an interval tree to get better performance
> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
> McCandless Michael loves building software; he's been building search
> engines for more than a decade....View my complete profile Blog Archive ▼
> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
> Followers Follow by Email Simple template.
> </str>
> </arr>
> <arr name="host">
> <str>blog.mikemccandless.com</str>
> </arr>
> <arr name="path">
> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
> </arr>
> <long name="_version_">1436832383182569472</long>
> </doc>
> </result>
> </response>
>
>
>
> I can see there are published and updated markup, and yet none of those
> fields (pubDate or publications) are present in the solr document.
>
>
> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
> well used :). As for #2, I am still puzzled about the following. Here's an
> excerpt from  the feed xml:
>
> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>
> Hi Stephane,
>
> (1) ManifoldCF always uses the URL of a document as the primary ID when it
> indexes it.  This is the standard treatment and has been since Day 1.
>
> (2) For the "creation date" attribute, the RSS connector uses the date in
> the feed, if there is one.  This is a date in ISO format, and comes out as
> the metadata value "pubdateiso".  There is also an attribute called
> "pubdate", which is in milliseconds since epoch, which is EITHER the date
> in the feed (if present), or if not it's the date the document is fetched.
>
> As for your other question, "chromed" data comes from the URLs referenced
> by the items in the feed, and "dechromed" data comes from either the
> content or description field that's actually in the feed, whichever you
> specify.
>
> All of this is described in the end-user-documentation, although I do
> notice that "pubdateiso" is missing from the metadata listed.
>
>
> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <stephane@gamard.net>wrote:
>
>>
>> Hi all,
>>
>>
>> I'm trying to use the RSS connector for the following feed:
>> http://blog.mikemccandless.com/feeds/posts/default
>>
>> After setting the job up and ingesting documents I have 2 pending
>> questions:
>> - why is the connector using the URL as ID instead of the atom ID tag?
>> - I have no creation and/or modified date in my Solr document, how is it
>> so?
>>
>> Overall I am a bit confused as to where does the crawler gets it's
>> information (chrome vs dechromed). I've downloaded the feed and tried to
>> find the entries back into my index but could not do so (could only find
>> pages which are linked from the rss entry).
>>
>> Sorry for the hassle, I'm reading over and over trying to piece it all
>> together.
>>
>> Cheers,
>>
>> _Stephane
>>
>
>

Mime
View raw message