lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan McKinley" <>
Subject Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Date Tue, 16 Jan 2007 08:14:33 GMT
> : In addition to RequestProcessors, maybe there should be a general
> : DocumentProcessor
> :
> : interface SolrDocumentParser
> : {
> :   Document parse(ContentStream content);
> : }
> :
> : solrconfig could register "text/html" -> HtmlDocumentParser, and
> : RequestProcessors could share the same parser.
> what else would the RequestProcessor do if it was delegating all of the
> parsing to something else?

Parsing is just one task that a RequestProcessor may do.  It is the
entry point for all kinds of stuff: searching, admin tasks, augment
search results with SQL queries, writing uploaded files to the file
system.  This is where people will do whatever suits their fancy.

RequestHandler is probalby better name RequestProcessor, but I think
we should choose name that can live peacefully with existing
RequestHandler code.

I imagine there will be a standard 'Processor' gets a list of streams
and processes them into Documents.  Since the way these documents are
parsed depends totally on the schema, we will need some way to make
this user configurable.

In addition, consider the case where you want to index a SVN
repository.  Yes, this could be done in SolrRequestParser that logs in
and returns the files as a stream iterator.  But this seems like more
'work' then the RequestParser is supposed to do.  Not to mention you
would need to augment the Document with svn specific attributes.

Parsing a PDF file from svn should (be able to) use the same parser if
it were uploaded via HTTP POST.

I think a DocumentParser registry is a good way to isolate this top level task.

View raw message