manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: RSS Connector
Date Mon, 03 Jun 2013 17:09:30 GMT
CONNECTORS-700 has now been resolved.

Karl


On Mon, Jun 3, 2013 at 11:12 AM, Karl Wright <daddywri@gmail.com> wrote:

> I've created CONNECTORS-700 for the date parsing issue.
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 11:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Stephane,
>>
>>
>> First, you would not want to select to get dechromed content from the
>> feed description field if there is no feed description field.  (In that
>> case, by default the connector fall back to use the actual content from the
>> document link.)
>>
>> Second, for this kind of feed, the connector looks for either "published"
>> or "updated" and takes the latter of the two if both are found.  However,
>> the ISO8601 date parser we are using is not happy with any timezone other
>> than Z (zulu) at this time, but your dates have -0400 instead, and that is
>> the problem.  I'll create a ticket to deal with that issue.
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <stephane@gamard.net>wrote:
>>
>>> Hi Karl,
>>>
>>>
>>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>>> well used :). As for #2, I am still puzzled about the following. Here's an
>>> excerpt from  the feed xml:
>>>
>>>
>>>  <entry>
>>>
>>> <id>tag:blogger.com
>>> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>>>
>>> <published>2013-05-21T18:23:00.000-04:00</published>
>>>
>>> <updated>2013-05-21T18:23:06.451-04:00</updated>
>>>
>>> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>>>
>>> <title type="text">Dynamic faceting with Lucene</title>
>>>
>>> <content type="html">Lucene's [...] Happy faceting!</content>
>>>
>>> <link rel="replies" type="application/atom+xml" href="
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
>>> title="Post Comments"/>
>>>
>>> <link rel="replies" type="text/html" href="
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
>>> title="0 Comments"/>
>>>
>>> <link rel="edit" type="application/atom+xml" href="
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>>> "/>
>>>
>>> <link rel="self" type="application/atom+xml" href="
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>>> "/>
>>>
>>> <link rel="alternate" type="text/html" href="
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
>>> title="Dynamic faceting with Lucene"/>
>>>
>>> <author>
>>>
>>> <name>Michael McCandless</name>
>>>
>>>  <uri>https://plus.google.com/112759599082866346694</uri>
>>>
>>> <email>noreply@blogger.com</email>
>>>
>>> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
>>> height="32" src="//
>>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>>> "/>
>>>
>>> </author>
>>>
>>> <thr:total>0</thr:total>
>>>
>>> </entry>
>>>
>>>
>>> Below is the document once ingested in Solr (searched with query:
>>> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
>>> Note that I use a catch all field (<dynamicField name="*"  type="string"
>>>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
>>> save all submitted fields.
>>>
>>>
>>> I have two questions that I don't understand:
>>>
>>> - I've selected the option "Dechromed content, if present, in
>>> 'description' field"  and yet I have no description field
>>>
>>> - I have no pubDate of publications field available
>>>
>>>
>>> Here's the attached Solr output:
>>>
>>>
>>> This XML file does not appear to have any style information associated
>>> with it. The document tree is shown below.
>>> <response>
>>> <lst name="responseHeader">
>>> <int name="status">0</int>
>>> <int name="QTime">1</int>
>>> <lst name="params">
>>> <str name="fl">*</str>
>>> <str name="q">
>>> id:
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> </lst>
>>> </lst>
>>> <result name="response" numFound="1" start="0">
>>> <doc>
>>> <arr name="link">
>>> <str>http://blog.mikemccandless.com/favicon.ico</str>
>>> <str>icon</str>
>>> <str>image/x-icon</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>canonical</str>
>>> <str>alternate</str>
>>> <str>application/atom+xml</str>
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> <str>alternate</str>
>>> <str>application/rss+xml</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
>>> </str>
>>> <str>service.post</str>
>>> <str>application/atom+xml</str>
>>> <str>
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default
>>> </str>
>>> <str>EditURI</str>
>>> <str>application/rsd+xml</str>
>>> <str>
>>> http://www.blogger.com/rsd.g?blogID=8623074010562846957
>>> </str>
>>> <str>alternate</str>
>>> <str>application/atom+xml</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>publisher</str>
>>> <str>text/css</str>
>>> <str>stylesheet</str>
>>> <str>
>>> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
>>> </str>
>>> <str>text/css</str>
>>> <str>stylesheet</str>
>>> <str>
>>> //
>>> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
>>> </str>
>>> </arr>
>>> <arr name="meta">
>>> <str>viewport</str>
>>> <str>width=1100</str>
>>> <str>stream_source_info</str>
>>> <str>docname</str>
>>> <str>stream_content_type</str>
>>> <str>text/html; charset=UTF-8</str>
>>> <str>stream_size</str>
>>> <str>80779</str>
>>> <str>Content-Encoding</str>
>>> <str>UTF-8</str>
>>> <str>stream_name</str>
>>> <str>docname</str>
>>> <str>generator</str>
>>> <str>blogger</str>
>>> <str>MSSmartTagsPreventParsing</str>
>>> <str>true</str>
>>> <str>Content-Type</str>
>>> <str>text/html; charset=UTF-8</str>
>>> <str>resourceName</str>
>>> <str>docname</str>
>>> <str>dc:title</str>
>>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="false">
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/</str>
>>> <str>rect</str>
>>> <str>6579597884362535238</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>http://jirasearch.mikemccandless.com</str>
>>> <str>rect</str>
>>> <str>
>>> http://www.elasticsearch.org/guide/reference/api/search/facets/
>>> </str>
>>> <str>rect</str>
>>> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
>>> <str>rect</str>
>>> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
>>> <str>rect</str>
>>> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
>>> <str>rect</str>
>>> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
>>> <str>rect</str>
>>> <str>http://jirasearch.mikemccandless.com</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>bookmark</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
>>> <str>tag</str>
>>> <str>rect</str>
>>> <str>comments</str>
>>> <str>rect</str>
>>> <str>comment-form</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
>>> </str>
>>> <str>rect</str>
>>> <str>links</str>
>>> <str>rect</str>
>>> <str/>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>application/atom+xml</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_06_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_06_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
>>> </str>
>>> <str>rect</str>
>>> <str>http://www.blogger.com</str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
>>> </str>
>>> </arr>
>>> <arr name="img">
>>> <str/>
>>> <str>13</str>
>>> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
>>> </str>
>>> <str>18</str>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>>> </str>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>>> </str>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str>My Photo</str>
>>> <str>80</str>
>>> <str>
>>> //
>>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>>> </str>
>>> <str>80</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>187</str>
>>> <str>
>>>
>>> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
>>> </str>
>>> <str>150</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> </arr>
>>> <arr name="iframe">
>>> <str>0</str>
>>> <str>auto</str>
>>> <str>410</str>
>>> <str>comment-editor</str>
>>> <str/>
>>> <str>100%</str>
>>> </arr>
>>> <str name="filename">docname</str>
>>> <str name="mimetype">text/html; charset=UTF-8</str>
>>> <arr name="source">
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> </arr>
>>> <arr name="category">
>>> <str>Lucene</str>
>>> </arr>
>>> <str name="id">
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <arr name="source_type">
>>> <str>rss</str>
>>> </arr>
>>> <arr name="title">
>>> <str>Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="title_search">
>>> <str>Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="viewport">
>>> <str>width=1100</str>
>>> </arr>
>>> <arr name="stream_source_info">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="stream_content_type">
>>> <str>text/html; charset=UTF-8</str>
>>> </arr>
>>> <arr name="stream_size">
>>> <str>80779</str>
>>> </arr>
>>> <arr name="content_encoding">
>>> <str>UTF-8</str>
>>> </arr>
>>> <arr name="stream_name">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="generator">
>>> <str>blogger</str>
>>> </arr>
>>> <arr name="mssmarttagspreventparsing">
>>> <str>true</str>
>>> </arr>
>>> <arr name="content_type">
>>> <str>text/html; charset=UTF-8</str>
>>> </arr>
>>> <arr name="resourcename">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="dc_title">
>>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="content">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways . The Jira issues search example showcases a number of
>>> facet features. Here I'll describe two recently committed facet features:
>>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>>> faceting, coming in the next (4.4) release. To understand these features,
>>> and why they are important, we first need a little background. Lucene's
>>> facet module does most of its work at indexing time: for each indexed
>>> document, it examines every facet label, each of which may be hierarchical,
>>> and maps each unique label in the hierarchy to an integer id, and then
>>> encodes all ids into a binary doc values field. A separate taxonomy index
>>> stores this mapping, and ensures that, even across segments, the same label
>>> gets the same id. At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count. This is in contrast to purely dynamic faceting
>>> implementations like ElasticSearch 's and Solr 's, which do all work at
>>> search time. Such approaches are more flexible: you need not do anything
>>> special during indexing, and for every query you can pick and choose
>>> exactly which facets to compute. However, the price for that flexibility is
>>> slower searching, as each search must do more work for every matched
>>> document. Furthermore, the impact on near-real-time reopen latency can be
>>> horribly costly if top-level data-structures, such as Solr's
>>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>>> by the facet module means no extra work needs to be done on each
>>> near-real-time reopen. Enough background, now on to our two new features!
>>> Sorted-set doc-values faceting These features bring two dynamic
>>> alternatives to the facet module, both computing facet counts from
>>> previously indexed doc-values fields. The first feature, sorted-set
>>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>>> normal sorted-set doc-values field, for example: doc.add(new
>>> SortedSetDocValuesField("foo")); doc.add(new
>>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>>> This feature does not use the taxonomy index, since all state is stored in
>>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>>> top-level data-structure is recomputed to map per-segment integer ordinals
>>> to global ordinals. The good news is this should be relatively low cost
>>> since it's just merge-sorting already sorted terms, and it doesn't need to
>>> visit the documents (unlike UnInvertedField). At search time there is also
>>> a small performance hit (~25%, depending on the query) since each
>>> per-segment ord must be re-mapped to the global ord space. Likely this
>>> could be improved (no time was spend optimizing). Furthermore, this feature
>>> currently only works with non-hierarchical facet fields, though this should
>>> be fixable (patches welcome!). Dynamic range faceting The second new
>>> feature, dynamic range faceting, works on top of a numeric doc-values field
>>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>>> You create a RangeFacetRequest, providing custom ranges with their labels.
>>> Each matched document is checked against all ranges and the count is
>>> incremented when there is a match. The range-test is a naive simple linear
>>> search, which is probably OK since there are usually only a few ranges, but
>>> we could eventually upgrade this to an interval tree to get better
>>> performance (patches welcome!). Likewise, this new feature does not use the
>>> taxonomy index, only a numeric doc-values field. This feature is especially
>>> useful with time-based fields. You can see it in action in the Jira issues
>>> search example in the Updated field. Happy faceting! Posted by Michael
>>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>>> Atom Comments About Me Michael McCandless Michael loves building software;
>>> he's been building search engines for more than a decade. In 1999 he
>>> co-founded iPhrase Technologies, a startup providing a user-centric
>>> enterprise search application, written primarily in Python and C. After IBM
>>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>>> committer in 2006 and PMC member in 2008. Michael has remained an active
>>> committer, helping to push Lucene to new places in recent years. He's
>>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>>> enjoys building his own computers, writing software to control his house
>>> (mostly in Python), encoding videos and tinkering with all sorts of other
>>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>>> template. Powered by Blogger .
>>> </str>
>>> </arr>
>>> <arr name="content_search">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways . The Jira issues search example showcases a number of
>>> facet features. Here I'll describe two recently committed facet features:
>>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>>> faceting, coming in the next (4.4) release. To understand these features,
>>> and why they are important, we first need a little background. Lucene's
>>> facet module does most of its work at indexing time: for each indexed
>>> document, it examines every facet label, each of which may be hierarchical,
>>> and maps each unique label in the hierarchy to an integer id, and then
>>> encodes all ids into a binary doc values field. A separate taxonomy index
>>> stores this mapping, and ensures that, even across segments, the same label
>>> gets the same id. At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count. This is in contrast to purely dynamic faceting
>>> implementations like ElasticSearch 's and Solr 's, which do all work at
>>> search time. Such approaches are more flexible: you need not do anything
>>> special during indexing, and for every query you can pick and choose
>>> exactly which facets to compute. However, the price for that flexibility is
>>> slower searching, as each search must do more work for every matched
>>> document. Furthermore, the impact on near-real-time reopen latency can be
>>> horribly costly if top-level data-structures, such as Solr's
>>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>>> by the facet module means no extra work needs to be done on each
>>> near-real-time reopen. Enough background, now on to our two new features!
>>> Sorted-set doc-values faceting These features bring two dynamic
>>> alternatives to the facet module, both computing facet counts from
>>> previously indexed doc-values fields. The first feature, sorted-set
>>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>>> normal sorted-set doc-values field, for example: doc.add(new
>>> SortedSetDocValuesField("foo")); doc.add(new
>>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>>> This feature does not use the taxonomy index, since all state is stored in
>>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>>> top-level data-structure is recomputed to map per-segment integer ordinals
>>> to global ordinals. The good news is this should be relatively low cost
>>> since it's just merge-sorting already sorted terms, and it doesn't need to
>>> visit the documents (unlike UnInvertedField). At search time there is also
>>> a small performance hit (~25%, depending on the query) since each
>>> per-segment ord must be re-mapped to the global ord space. Likely this
>>> could be improved (no time was spend optimizing). Furthermore, this feature
>>> currently only works with non-hierarchical facet fields, though this should
>>> be fixable (patches welcome!). Dynamic range faceting The second new
>>> feature, dynamic range faceting, works on top of a numeric doc-values field
>>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>>> You create a RangeFacetRequest, providing custom ranges with their labels.
>>> Each matched document is checked against all ranges and the count is
>>> incremented when there is a match. The range-test is a naive simple linear
>>> search, which is probably OK since there are usually only a few ranges, but
>>> we could eventually upgrade this to an interval tree to get better
>>> performance (patches welcome!). Likewise, this new feature does not use the
>>> taxonomy index, only a numeric doc-values field. This feature is especially
>>> useful with time-based fields. You can see it in action in the Jira issues
>>> search example in the Updated field. Happy faceting! Posted by Michael
>>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>>> Atom Comments About Me Michael McCandless Michael loves building software;
>>> he's been building search engines for more than a decade. In 1999 he
>>> co-founded iPhrase Technologies, a startup providing a user-centric
>>> enterprise search application, written primarily in Python and C. After IBM
>>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>>> committer in 2006 and PMC member in 2008. Michael has remained an active
>>> committer, helping to push Lucene to new places in recent years. He's
>>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>>> enjoys building his own computers, writing software to control his house
>>> (mostly in Python), encoding videos and tinkering with all sorts of other
>>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>>> template. Powered by Blogger .
>>> </str>
>>> </arr>
>>> <arr name="language">
>>> <str>en</str>
>>> </arr>
>>> <arr name="url">
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> </arr>
>>> <arr name="snippet">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways ....At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count....The range-test is a naive simple linear search, which is
>>> probably OK since there are usually only a few ranges, but we could
>>> eventually upgrade this to an interval tree to get better performance
>>> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
>>> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
>>> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
>>> McCandless Michael loves building software; he's been building search
>>> engines for more than a decade....View my complete profile Blog Archive ▼
>>> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
>>> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
>>> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
>>> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
>>> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
>>> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
>>> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
>>> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
>>> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
>>> Followers Follow by Email Simple template.
>>> </str>
>>> </arr>
>>> <arr name="host">
>>> <str>blog.mikemccandless.com</str>
>>> </arr>
>>> <arr name="path">
>>> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
>>> </arr>
>>> <long name="_version_">1436832383182569472</long>
>>> </doc>
>>> </result>
>>> </response>
>>>
>>>
>>>
>>> I can see there are published and updated markup, and yet none of those
>>> fields (pubDate or publications) are present in the solr document.
>>>
>>>
>>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>>> well used :). As for #2, I am still puzzled about the following. Here's an
>>> excerpt from  the feed xml:
>>>
>>> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>>>
>>> Hi Stephane,
>>>
>>> (1) ManifoldCF always uses the URL of a document as the primary ID when
>>> it indexes it.  This is the standard treatment and has been since Day 1.
>>>
>>> (2) For the "creation date" attribute, the RSS connector uses the date
>>> in the feed, if there is one.  This is a date in ISO format, and comes out
>>> as the metadata value "pubdateiso".  There is also an attribute called
>>> "pubdate", which is in milliseconds since epoch, which is EITHER the date
>>> in the feed (if present), or if not it's the date the document is fetched.
>>>
>>> As for your other question, "chromed" data comes from the URLs
>>> referenced by the items in the feed, and "dechromed" data comes from either
>>> the content or description field that's actually in the feed, whichever you
>>> specify.
>>>
>>> All of this is described in the end-user-documentation, although I do
>>> notice that "pubdateiso" is missing from the metadata listed.
>>>
>>>
>>> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>>>
>>> Karl
>>>
>>>
>>>
>>> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <stephane@gamard.net>wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>> I'm trying to use the RSS connector for the following feed:
>>>> http://blog.mikemccandless.com/feeds/posts/default
>>>>
>>>> After setting the job up and ingesting documents I have 2 pending
>>>> questions:
>>>> - why is the connector using the URL as ID instead of the atom ID tag?
>>>> - I have no creation and/or modified date in my Solr document, how is
>>>> it so?
>>>>
>>>> Overall I am a bit confused as to where does the crawler gets it's
>>>> information (chrome vs dechromed). I've downloaded the feed and tried to
>>>> find the entries back into my index but could not do so (could only find
>>>> pages which are linked from the rss entry).
>>>>
>>>> Sorry for the hassle, I'm reading over and over trying to piece it all
>>>> together.
>>>>
>>>> Cheers,
>>>>
>>>> _Stephane
>>>>
>>>
>>>
>>
>

Mime
View raw message