lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eivind Hasle Amundsen <>
Subject Re: Connectors, Parsers, Plugin architecture
Date Tue, 16 Jan 2007 15:39:25 GMT
> : Solr aims at being an answer to "enterprise needs", by indexing
> : structured data for different applications. However I think that many
> : enterprises would like to be able to structure information themselves.
> thta's exactly what Solr is about: letting a schema creator define
> what the structure is, and letting putting data in whatever fields they
> want.

Could a future "parser plugin" architecture make sure that the outcome 
is in a well-defined format? In this case there could be a step for pure 
document processing.

Everything fed into the document processor stage should in other words 
be in a universal format - complete with source and which parser was 
used, of course. From this document, fields could be extracted and 
computed via simple programming to meet the requirements of the schema.

> the problem with providing support for unstructured data out of hte box is
> that it's got no strucutre :) ... how would Solr know what to do with the
> binary data it finds? how would it know what charset to use when reading
> thta data? ... assuming it gets character data, how does it know which
> strings should go in which fields? how does it know which analyzers to
> use?

With regards to the above, this could be handled by the parser, which 
creates the "standard document". This document would also contain meta 
data relevant to solving these tasks. The document processing stage 
would then know which conversion to use.

> some code somewhere has to make these decissions ... at the moment that
> code needs to be provided by the user and run outside of Solr ... i
> suspect it won't be long before much of that code can run inside of Solr
> as a plugin, but it will still need to be provided by the user to parse
> truely unstructured data.

Yep. But my idea of a "standard document" - wouldn't that help a bit? 
Don't look at me, I'm just a newbie :)


View raw message