Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@minotaur.apache.org Received: (qmail 67411 invoked from network); 1 Apr 2009 09:09:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2009 09:09:35 -0000 Received: (qmail 35610 invoked by uid 500); 1 Apr 2009 09:09:34 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 35516 invoked by uid 500); 1 Apr 2009 09:09:34 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 35502 invoked by uid 99); 1 Apr 2009 09:09:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 09:09:34 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 09:09:33 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2AFD3234C046 for ; Wed, 1 Apr 2009 02:09:13 -0700 (PDT) Message-ID: <1216592295.1238576953174.JavaMail.jira@brutus> Date: Wed, 1 Apr 2009 02:09:13 -0700 (PDT) From: "Fergus McMenemie (JIRA)" To: solr-dev@lucene.apache.org Subject: [jira] Updated: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed In-Reply-To: <1808539901.1236699173760.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fergus McMenemie updated SOLR-1060: ----------------------------------- Attachment: SOLR-1060.patch A more complete version of the patch with docs and an expanded regex test case. Ready for submission? > a new DIH EnityProcessor allowing text file lists of files to be indexed > ------------------------------------------------------------------------ > > Key: SOLR-1060 > URL: https://issues.apache.org/jira/browse/SOLR-1060 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Affects Versions: 1.4 > Reporter: Fergus McMenemie > Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: regex-fix.patch, SOLR-1060.patch, SOLR-1060.patch, SOLR-1060.patch, SOLR-1060.patch, SOLR-1060.patch > > Original Estimate: 120h > Remaining Estimate: 120h > > I have finished a new DIH EntityProcessor. It is designed around the idea that whatever demon is used to maintain your content store it is likely to drop a report or log file explaining what has changed within your content store. I wish to use this report file to control the indexing of the new or changed content and the removal of old content. The report files, perhaps from un-tar or un-zip, are likely to reference jpegs and directory stubs which need to be ignored. I assumed a file based content repository but this should be expanded to handle URI's as well > I feel that the current FileListEntityProcessor is poorly named. It should be called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this new EntityProcessor should have the name FileListEntityProcessor. However what is done is done. I then came up with manifestEnityProcessor which I thought suited, manifest files are all over the content sets I deal with and the dictionary definition seemed close enough ("ships manifest"). However how about ChangeListEntityProcessor > {code} > processor="ManifestEntityProcessor" > baseDir="/Volumes/Techmore/ts/aaa/schema/data" > rootEntity="false" > dataSource="null" > allowRegex="^.*\.xml$" > blockRegex="usc2009" > manifestFileName="/Volumes/ts/man-find.txt" > docAddRegex=".*" > > > {code} > The new entity fields are as follows. > > *manifestFileName* is the required location of the manifest file. If this value is relative, it assumed to be relative to baseDir. > *allowRegex* is an optional attribute that if present discards any line which does not match the regExp > > *blockRegex* is an optional attribute that is applied after any allowRegex and discards any line which matches the regExp > *docAddRegex* is a required regex to identify lines which when matched should cause docs to be added to the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) > *docDeleteRegex* is an optional value of a regex to identify documents which when matched should be deleted from the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) **PLANNED** -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.