lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fergus McMenemie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed
Date Mon, 23 Mar 2009 12:49:51 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688280#action_12688280
] 

Fergus McMenemie commented on SOLR-1060:
----------------------------------------

Down loaded your version of my patch. Thanks for taking a look at it and making the improvements.

However I still can get things to work. My solr-data.xml is now as follows:-
{code}
     <entity name="single-delete"
		 dataSource="myURIreader"
		 processor="XPathEntityProcessor"
		 url="${dataimporter.request.single}"
		 rootEntity="true"
		 flatten="true"
		 stream="false"
		 forEach="/record | /record/mediaBlock"
		 transformer="TemplateTransformer">

      <field column="fileAbsolutePath"    template="${dataimporter.request.single}" />

      <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}"
/> 	       
      <field column="vdkvgwkey"           template="${dataimporter.request.single}" />

      </entity>

{code}

But an attempt to delete a document produces the following..
{code}
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}
status=0 QTime=1 
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single)
while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_3,version=1237809265075,generation=3,filenames=[_5.nrm,
_5.tii, _5.tis, _5.fdx, _5.prx, _5.fdt, _5.fnm, segments_3, _5.frq]
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1237809265075
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single)
while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single)
while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DocBuilder commit
INFO: Full Import completed successfully
Mar 23, 2009 12:45:42 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true)
{code}

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: regex-fix.patch, SOLR-1060.patch, SOLR-1060.patch, SOLR-1060.patch,
SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea that whatever
demon is used to maintain your content store it is likely to drop a report or log file explaining
what has changed within your content store. I wish to use this report file to control the
indexing of the new or changed content and the removal of old content. The report files, perhaps
from un-tar or un-zip, are likely to reference jpegs and directory stubs which need to be
ignored. I assumed a file based content repository but this should be expanded to handle URI's
as well
> I feel that the current FileListEntityProcessor is poorly named. It should be called
the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this new EntityProcessor
should have the name FileListEntityProcessor. However what is done is done. I then came up
with manifestEnityProcessor which I thought suited, manifest files are all over the content
sets I deal with and the dictionary definition seemed close enough ("ships manifest"). However
how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this value is
relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line which does
not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex and discards
any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched should cause
docs to be added to the index. As well as matching the line it should also return the portion
of the line which contains the filepath as group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents which when
matched should be deleted from the index. As well as matching the line it should also return
the portion of the line which contains the filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message