manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Gamard <steph...@gamard.net>
Subject Re: RSS Connector
Date Mon, 03 Jun 2013 14:48:56 GMT
Hi Karl, 

Thank you for the prompt reply. Agreed on #1, url is perfectly fine and well used :). As for
#2, I am still puzzled about the following. Here's an excerpt from  the feed xml:



	<entry>
		<id>tag:blogger.com,1999:blog-8623074010562846957.post-6579597884362535238</id>
		<published>2013-05-21T18:23:00.000-04:00</published>
		<updated>2013-05-21T18:23:06.451-04:00</updated>
		<category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
		<title type="text">Dynamic faceting with Lucene</title>
		<content type="html">Lucene's [...] Happy faceting!</content>
		<link rel="replies" type="application/atom+xml" href="http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
title="Post Comments"/>
		<link rel="replies" type="text/html" href="http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
title="0 Comments"/>
		<link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238"/>
		<link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238"/>
		<link rel="alternate" type="text/html" href="http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
title="Dynamic faceting with Lucene"/>
		<author>
			<name>Michael McCandless</name>
			<uri>https://plus.google.com/112759599082866346694</uri>
			<email>noreply@blogger.com</email>
			<gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg"/>
		</author>
		<thr:total>0</thr:total>
	</entry>


Below is the document once ingested in Solr (searched with query: http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
Note that I use a catch all field (<dynamicField name="*"  type="string"  indexed="true"
 multiValued="true" stored="true" omitNorms="true"/>) to save all submitted fields. 

I have two questions that I don't understand: 
- I've selected the option "Dechromed content, if present, in 'description' field"  and
yet I have no description field
- I have no pubDate of publications field available

Here's the attached Solr output:


This XML file does not appear to have any style information associated with it. The document
tree is shown below.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*</str>
<str name="q">
id:http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="link">
<str>http://blog.mikemccandless.com/favicon.ico</str>
<str>icon</str>
<str>image/x-icon</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>canonical</str>
<str>alternate</str>
<str>application/atom+xml</str>
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
<str>alternate</str>
<str>application/rss+xml</str>
<str>
http://blog.mikemccandless.com/feeds/posts/default?alt=rss
</str>
<str>service.post</str>
<str>application/atom+xml</str>
<str>
http://www.blogger.com/feeds/8623074010562846957/posts/default
</str>
<str>EditURI</str>
<str>application/rsd+xml</str>
<str>
http://www.blogger.com/rsd.g?blogID=8623074010562846957
</str>
<str>alternate</str>
<str>application/atom+xml</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>publisher</str>
<str>text/css</str>
<str>stylesheet</str>
<str>
//www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
</str>
<str>text/css</str>
<str>stylesheet</str>
<str>
//www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
</str>
</arr>
<arr name="meta">
<str>viewport</str>
<str>width=1100</str>
<str>stream_source_info</str>
<str>docname</str>
<str>stream_content_type</str>
<str>text/html; charset=UTF-8</str>
<str>stream_size</str>
<str>80779</str>
<str>Content-Encoding</str>
<str>UTF-8</str>
<str>stream_name</str>
<str>docname</str>
<str>generator</str>
<str>blogger</str>
<str>MSSmartTagsPreventParsing</str>
<str>true</str>
<str>Content-Type</str>
<str>text/html; charset=UTF-8</str>
<str>resourceName</str>
<str>docname</str>
<str>dc:title</str>
<str>Changing Bits: Dynamic faceting with Lucene</str>
</arr>
<arr name="false">
<str>rect</str>
<str>http://blog.mikemccandless.com/</str>
<str>rect</str>
<str>6579597884362535238</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
</str>
<str>rect</str>
<str>http://jirasearch.mikemccandless.com</str>
<str>rect</str>
<str>
http://www.elasticsearch.org/guide/reference/api/search/facets/
</str>
<str>rect</str>
<str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
<str>rect</str>
<str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
<str>rect</str>
<str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
<str>rect</str>
<str>http://en.wikipedia.org/wiki/Interval_tree</str>
<str>rect</str>
<str>http://jirasearch.mikemccandless.com</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>bookmark</str>
<str>rect</str>
<str>
http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
</str>
<str>rect</str>
<str>
http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/search/label/Lucene</str>
<str>tag</str>
<str>rect</str>
<str>comments</str>
<str>rect</str>
<str>comment-form</str>
<str>rect</str>
<str>
http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
</str>
<str>rect</str>
<str>links</str>
<str>rect</str>
<str/>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>application/atom+xml</str>
<str>rect</str>
<str>
http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>
http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>
http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
<str>rect</str>
<str>
http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
http://affiliate.manning.com/idevaffiliate.php?id=1171_147
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_05_01_archive.html
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_07_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_06_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_07_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_06_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_07_01_archive.html
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
</str>
<str>rect</str>
<str>http://www.blogger.com</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
</str>
</arr>
<arr name="img">
<str/>
<str>13</str>
<str>http://img1.blogblog.com/img/icon18_email.gif</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img2.blogblog.com/img/icon18_edit_allbkg.gif
</str>
<str>18</str>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
</str>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
</str>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str>My Photo</str>
<str>80</str>
<str>
//lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
</str>
<str>80</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>187</str>
<str>
http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
</str>
<str>150</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
</arr>
<arr name="iframe">
<str>0</str>
<str>auto</str>
<str>410</str>
<str>comment-editor</str>
<str/>
<str>100%</str>
</arr>
<str name="filename">docname</str>
<str name="mimetype">text/html; charset=UTF-8</str>
<arr name="source">
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
</arr>
<arr name="category">
<str>Lucene</str>
</arr>
<str name="id">
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<arr name="source_type">
<str>rss</str>
</arr>
<arr name="title">
<str>Dynamic faceting with Lucene</str>
</arr>
<arr name="title_search">
<str>Dynamic faceting with Lucene</str>
</arr>
<arr name="viewport">
<str>width=1100</str>
</arr>
<arr name="stream_source_info">
<str>docname</str>
</arr>
<arr name="stream_content_type">
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="stream_size">
<str>80779</str>
</arr>
<arr name="content_encoding">
<str>UTF-8</str>
</arr>
<arr name="stream_name">
<str>docname</str>
</arr>
<arr name="generator">
<str>blogger</str>
</arr>
<arr name="mssmarttagspreventparsing">
<str>true</str>
</arr>
<arr name="content_type">
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="resourcename">
<str>docname</str>
</arr>
<arr name="dc_title">
<str>Changing Bits: Dynamic faceting with Lucene</str>
</arr>
<arr name="content">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting
with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly
4X) speedups and new features like DrillSideways . The Jira issues search example showcases
a number of facet features. Here I'll describe two recently committed facet features: sorted-set
doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next
(4.4) release. To understand these features, and why they are important, we first need a little
background. Lucene's facet module does most of its work at indexing time: for each indexed
document, it examines every facet label, each of which may be hierarchical, and maps each
unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc
values field. A separate taxonomy index stores this mapping, and ensures that, even across
segments, the same label gets the same id. At search time, faceting cost is minimal: for each
matched document, we visit all integer ids and aggregate counts in an array, summarizing the
results in the end, for example as top N facet labels by count. This is in contrast to purely
dynamic faceting implementations like ElasticSearch 's and Solr 's, which do all work at search
time. Such approaches are more flexible: you need not do anything special during indexing,
and for every query you can pick and choose exactly which facets to compute. However, the
price for that flexibility is slower searching, as each search must do more work for every
matched document. Furthermore, the impact on near-real-time reopen latency can be horribly
costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every
reopen. The taxonomy index used by the facet module means no extra work needs to be done on
each near-real-time reopen. Enough background, now on to our two new features! Sorted-set
doc-values faceting These features bring two dynamic alternatives to the facet module, both
computing facet counts from previously indexed doc-values fields. The first feature, sorted-set
doc-values faceting (see LUCENE-4795 ), allows the application to index a normal sorted-set
doc-values field, for example: doc.add(new SortedSetDocValuesField("foo")); doc.add(new SortedSetDocValuesField("bar"));
and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
This feature does not use the taxonomy index, since all state is stored in the doc-values,
but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed
to map per-segment integer ordinals to global ordinals. The good news is this should be relatively
low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit
the documents (unlike UnInvertedField). At search time there is also a small performance hit
(~25%, depending on the query) since each per-segment ord must be re-mapped to the global
ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this
feature currently only works with non-hierarchical facet fields, though this should be fixable
(patches welcome!). Dynamic range faceting The second new feature, dynamic range faceting,
works on top of a numeric doc-values field (see LUCENE-4965 ), and implements dynamic faceting
over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels.
Each matched document is checked against all ranges and the count is incremented when there
is a match. The range-test is a naive simple linear search, which is probably OK since there
are usually only a few ranges, but we could eventually upgrade this to an interval tree to
get better performance (patches welcome!). Likewise, this new feature does not use the taxonomy
index, only a numeric doc-values field. This feature is especially useful with time-based
fields. You can see it in action in the Jira issues search example in the Updated field. Happy
faceting! Posted by Michael McCandless on 5/21/2013 Email This BlogThis! Share to Twitter
Share to Facebook Labels: Lucene No comments: Post a Comment Older Post Home Subscribe to:
Post Comments (Atom) Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
McCandless Michael loves building software; he's been building search engines for more than
a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise
search application, written primarily in Python and C. After IBM acquired iPhrase in 2005,
Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael
has remained an active committer, helping to push Lucene to new places in recent years. He's
co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his
own computers, writing software to control his house (mostly in Python), encoding videos and
tinkering with all sorts of other things. View my complete profile Blog Archive ▼  2013
(5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February
(1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September
(1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
(2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3)
►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010
(43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August
(4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February
(3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September
(4) ►  August (6) ►  July (5) Followers Follow by Email Simple template. Powered by
Blogger .
</str>
</arr>
<arr name="content_search">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting
with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly
4X) speedups and new features like DrillSideways . The Jira issues search example showcases
a number of facet features. Here I'll describe two recently committed facet features: sorted-set
doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next
(4.4) release. To understand these features, and why they are important, we first need a little
background. Lucene's facet module does most of its work at indexing time: for each indexed
document, it examines every facet label, each of which may be hierarchical, and maps each
unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc
values field. A separate taxonomy index stores this mapping, and ensures that, even across
segments, the same label gets the same id. At search time, faceting cost is minimal: for each
matched document, we visit all integer ids and aggregate counts in an array, summarizing the
results in the end, for example as top N facet labels by count. This is in contrast to purely
dynamic faceting implementations like ElasticSearch 's and Solr 's, which do all work at search
time. Such approaches are more flexible: you need not do anything special during indexing,
and for every query you can pick and choose exactly which facets to compute. However, the
price for that flexibility is slower searching, as each search must do more work for every
matched document. Furthermore, the impact on near-real-time reopen latency can be horribly
costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every
reopen. The taxonomy index used by the facet module means no extra work needs to be done on
each near-real-time reopen. Enough background, now on to our two new features! Sorted-set
doc-values faceting These features bring two dynamic alternatives to the facet module, both
computing facet counts from previously indexed doc-values fields. The first feature, sorted-set
doc-values faceting (see LUCENE-4795 ), allows the application to index a normal sorted-set
doc-values field, for example: doc.add(new SortedSetDocValuesField("foo")); doc.add(new SortedSetDocValuesField("bar"));
and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
This feature does not use the taxonomy index, since all state is stored in the doc-values,
but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed
to map per-segment integer ordinals to global ordinals. The good news is this should be relatively
low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit
the documents (unlike UnInvertedField). At search time there is also a small performance hit
(~25%, depending on the query) since each per-segment ord must be re-mapped to the global
ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this
feature currently only works with non-hierarchical facet fields, though this should be fixable
(patches welcome!). Dynamic range faceting The second new feature, dynamic range faceting,
works on top of a numeric doc-values field (see LUCENE-4965 ), and implements dynamic faceting
over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels.
Each matched document is checked against all ranges and the count is incremented when there
is a match. The range-test is a naive simple linear search, which is probably OK since there
are usually only a few ranges, but we could eventually upgrade this to an interval tree to
get better performance (patches welcome!). Likewise, this new feature does not use the taxonomy
index, only a numeric doc-values field. This feature is especially useful with time-based
fields. You can see it in action in the Jira issues search example in the Updated field. Happy
faceting! Posted by Michael McCandless on 5/21/2013 Email This BlogThis! Share to Twitter
Share to Facebook Labels: Lucene No comments: Post a Comment Older Post Home Subscribe to:
Post Comments (Atom) Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
McCandless Michael loves building software; he's been building search engines for more than
a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise
search application, written primarily in Python and C. After IBM acquired iPhrase in 2005,
Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael
has remained an active committer, helping to push Lucene to new places in recent years. He's
co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his
own computers, writing software to control his house (mostly in Python), encoding videos and
tinkering with all sorts of other things. View my complete profile Blog Archive ▼  2013
(5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February
(1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September
(1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
(2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3)
►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010
(43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August
(4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February
(3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September
(4) ►  August (6) ►  July (5) Followers Follow by Email Simple template. Powered by
Blogger .
</str>
</arr>
<arr name="language">
<str>en</str>
</arr>
<arr name="url">
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
</arr>
<arr name="snippet">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting
with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly
4X) speedups and new features like DrillSideways ....At search time, faceting cost is minimal:
for each matched document, we visit all integer ids and aggregate counts in an array, summarizing
the results in the end, for example as top N facet labels by count....The range-test is a
naive simple linear search, which is probably OK since there are usually only a few ranges,
but we could eventually upgrade this to an interval tree to get better performance (patches
welcome!)....Share to Twitter Share to Facebook Labels: Lucene No comments: Post a Comment
Older Post Home Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
Atom Comments About Me Michael McCandless Michael loves building software; he's been building
search engines for more than a decade....View my complete profile Blog Archive ▼  2013
(5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February
(1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September
(1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
(2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3)
►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010
(43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August
(4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February
(3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September
(4) ►  August (6) ►  July (5) Followers Follow by Email Simple template.
</str>
</arr>
<arr name="host">
<str>blog.mikemccandless.com</str>
</arr>
<arr name="path">
<str>/2013/05/dynamic-faceting-with-lucene.html</str>
</arr>
<long name="_version_">1436832383182569472</long>
</doc>
</result>
</response>




I can see there are published and updated markup, and yet none of those fields (pubDate or
publications) are present in the solr document. 

Thank you for the prompt reply. Agreed on #1, url is perfectly fine and well used :). As for
#2, I am still puzzled about the following. Here's an excerpt from  the feed xml:


On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
Hi Stephane,

(1) ManifoldCF always uses the URL of a document as the primary ID when it indexes it.  This
is the standard treatment and has been since Day 1.

(2) For the "creation date" attribute, the RSS connector uses the date in the feed, if there
is one.  This is a date in ISO format, and comes out as the metadata value "pubdateiso". 
There is also an attribute called "pubdate", which is in milliseconds since epoch, which is
EITHER the date in the feed (if present), or if not it's the date the document is fetched.

As for your other question, "chromed" data comes from the URLs referenced by the items in
the feed, and "dechromed" data comes from either the content or description field that's actually
in the feed, whichever you specify.

All of this is described in the end-user-documentation, although I do notice that "pubdateiso"
is missing from the metadata listed.

http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository

Karl



On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <stephane@gamard.net> wrote:

Hi all, 

I'm trying to use the RSS connector for the following feed: http://blog.mikemccandless.com/feeds/posts/default

After setting the job up and ingesting documents I have 2 pending questions: 
- why is the connector using the URL as ID instead of the atom ID tag?
- I have no creation and/or modified date in my Solr document, how is it so?

Overall I am a bit confused as to where does the crawler gets it's information (chrome vs
dechromed). I've downloaded the feed and tried to find the entries back into my index but
could not do so (could only find pages which are linked from the rss entry). 

Sorry for the hassle, I'm reading over and over trying to piece it all together.

Cheers, 

_Stephane
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
    • Unnamed multipart/alternative (inline, None, 0 bytes)
      • Unnamed multipart/related (inline, None, 0 bytes)
View raw message