Hi,

I'm looking to get some direction on where I should focus my attention, with regards to the Solr codebase and documentation.  Rather than write a ton of stuff no one wants to read, I'll just start with a use-case.  For context, the data originates from Nutch crawls and is indexed into Solr.

Imagine a web page has the following content (4 occurences of "Johnson" are bolded):

--content_--
Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem.
--_content--

First; I would like to have the entire "content" block be indexed within Solr.  This is done and definitely not an issue.

Second (+); during the injection of crawl data into Solr, I would like to grab every occurence of a specific word, or phrase, with "Johnson" being my example for the above.  I want to take every such phrase (without collision), as well as its unique-context, and inject that into its own, separate Solr index.  For example, the above "content" example, having been indexed in its entirety, would also be the source of 4 additional indexes.  In each index, "Johnson" would only appear once.  All of the text before and after "Johnson" would be BOUND BY any other occurrence of "Johnson."  eg:

--index1_--
Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum
--_index1-- --index2_--
sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
--_index2-- --index3_--
in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada
--_index3-- --index4_--
sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem.
--_index4--

Q:
How much of this is feasible in "present-day Solr" and how much of it do I need to produce in a patch of my own?  Can anyone give me some direction on where I should look, in approaching this problem (ie, libs / classes / confs)?  I sincerely appreciate it.

Third; I would later like to go through the above, child indexes and dismiss any that appear within a given context.  For example, I may deem "ipsum dolor Johnson sit amet" as not being useful and I'd want to delete any indexes matching that particular phrase-context.  The deletion is trivial and, with the 2nd item resolved--this becomes a fairly non-issue.

Q:
The question, more or less, comes from the fact that my source data is from a web crawler.  When recrawled, I need to repeat the process of dismissing phrase-contexts that are not relevant to me.  Where is the best place to perform this work?  I could easily perform queries, after indexing my crawl, but that seems needlessly intensive.  I think the answer to that will be "wherever I implement #2", but assumptions can be painfully expensive.


Thank you for reading my bloated e-mail.  Again, I'm mostly just looking to be pointed to various pieces of the Lucene / Solr code-base, and am trolling for any insight that people might share.

Scott Gonyea