A new document, yes.  I should watch my terminology closer.

Scott

On Wed, Sep 1, 2010 at 11:53 PM, Lance Norskog <goksron@gmail.com> wrote:
Do you mean a new Solr/Lucene index, or a new document with only the snippet?

On Wed, Sep 1, 2010 at 5:29 PM, Scott Gonyea <scott@aitrus.org> wrote:
> Hi,
>
> I'm looking to get some direction on where I should focus my attention, with
> regards to the Solr codebase and documentation.  Rather than write a ton of
> stuff no one wants to read, I'll just start with a use-case.  For context,
> the data originates from Nutch crawls and is indexed into Solr.
>
> Imagine a web page has the following content (4 occurences of "Johnson" are
> bolded):
>
> --content_--
> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id
> urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla magna,
> nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum. Mauris a
> arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula
> nisi. Ut fringilla ullamcorper sem.
> --_content--
>
> First; I would like to have the entire "content" block be indexed within
> Solr.  This is done and definitely not an issue.
>
> Second (+); during the injection of crawl data into Solr, I would like to
> grab every occurence of a specific word, or phrase, with "Johnson" being my
> example for the above.  I want to take every such phrase (without
> collision), as well as its unique-context, and inject that into its own,
> separate Solr index.  For example, the above "content" example, having been
> indexed in its entirety, would also be the source of 4 additional indexes.
> In each index, "Johnson" would only appear once.  All of the text before and
> after "Johnson" would be BOUND BY any other occurrence of "Johnson."  eg:
>
> --index1_--
> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id
> urna et justo fringilla dictum
> --_index1-- --index2_--
> sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> --_index2-- --index3_--
> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit
> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
> malesuada
> --_index3-- --index4_--
> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus
> vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper
> sem.
> --_index4--
>
> Q:
> How much of this is feasible in "present-day Solr" and how much of it do I
> need to produce in a patch of my own?  Can anyone give me some direction on
> where I should look, in approaching this problem (ie, libs / classes /
> confs)?  I sincerely appreciate it.
>
> Third; I would later like to go through the above, child indexes and dismiss
> any that appear within a given context.  For example, I may deem "ipsum
> dolor Johnson sit amet" as not being useful and I'd want to delete any
> indexes matching that particular phrase-context.  The deletion is trivial
> and, with the 2nd item resolved--this becomes a fairly non-issue.
>
> Q:
> The question, more or less, comes from the fact that my source data is from
> a web crawler.  When recrawled, I need to repeat the process of dismissing
> phrase-contexts that are not relevant to me.  Where is the best place to
> perform this work?  I could easily perform queries, after indexing my crawl,
> but that seems needlessly intensive.  I think the answer to that will be
> "wherever I implement #2", but assumptions can be painfully expensive.
>
>
> Thank you for reading my bloated e-mail.  Again, I'm mostly just looking to
> be pointed to various pieces of the Lucene / Solr code-base, and am trolling
> for any insight that people might share.
>
> Scott Gonyea
>



--
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org