lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fergus McMenemie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed
Date Thu, 19 Mar 2009 17:28:50 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683532#action_12683532
] 

Fergus McMenemie commented on SOLR-1060:
----------------------------------------

I have applied the latest version of SOLR-1059 and I just *cannot* get delete to work!

{code}
     <entity name="single-delete"
		 dataSource="myURIreader"
		 processor="XPathEntityProcessor"
		 url="${dataimporter.request.single}"
		 rootEntity="true"
		 flatten="true"
		 stream="false"
		 forEach="/record | /record/mediaBlock"
		 transformer="TemplateTransformer">

      <field column="$skipDoc"            template="true" /> 
      <field column="fileAbsolutePath"    template="${dataimporter.request.single}" />

      <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.request.single}"
/> 	       
      <field column="vdkvgwkey"           template="${dataimporter.request.single}" />

      </entity>
{code}

And here is a section from the log file showing that after an attempt to wipe the file, it
is still there; it was not removed.

{code}
Mar 19, 2009 5:24:52 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}
hits=3 status=0 QTime=10 



Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}
status=0 QTime=0 
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.URLDataSource getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument
SEVERE: Exception while processing: single-delete document : SolrInputDocument[{}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null
Processing Document # 1
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
Caused by: java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
	... 10 more
Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null
Processing Document # 1
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
Caused by: java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
	... 10 more
Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@281e7e main
Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@281e7e main
	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@281e7e main
	filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@281e7e main
	queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@281e7e main
	documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}
Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@281e7e main
Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={rows=10&start=0&q=solr} hits=0 status=0 QTime=6

Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={rows=10&start=0&q=rocks} hits=90 status=0 QTime=34

Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={q=static+newSearcher+warming+query+from+solrconfig.xml}
hits=12327 status=0 QTime=98 
Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher@281e7e main
Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher close
INFO: Closing Searcher@7740f6 main
	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}



Mar 19, 2009 5:25:12 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}
hits=3 status=0 QTime=11 
{code}

Any hints on what I should try next?

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea that whatever
demon is used to maintain your content store it is likely to drop a report or log file explaining
what has changed within your content store. I wish to use this report file to control the
indexing of the new or changed content and the removal of old content. The report files, perhaps
from un-tar or un-zip, are likely to reference jpegs and directory stubs which need to be
ignored. I assumed a file based content repository but this should be expanded to handle URI's
as well
> I feel that the current FileListEntityProcessor is poorly named. It should be called
the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this new EntityProcessor
should have the name FileListEntityProcessor. However what is done is done. I then came up
with manifestEnityProcessor which I thought suited, manifest files are all over the content
sets I deal with and the dictionary definition seemed close enough ("ships manifest"). However
how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this value is
relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line which does
not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex and discards
any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched should cause
docs to be added to the index. As well as matching the line it should also return the portion
of the line which contains the filepath as group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents which when
matched should be deleted from the index. As well as matching the line it should also return
the portion of the line which contains the filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message