lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Gonyea ...@sgonyea.com>
Subject Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Date Thu, 02 Sep 2010 20:34:09 GMT
A new document, yes.  I should watch my terminology closer.

Scott

On Wed, Sep 1, 2010 at 11:53 PM, Lance Norskog <goksron@gmail.com> wrote:

> Do you mean a new Solr/Lucene index, or a new document with only the
> snippet?
>
> On Wed, Sep 1, 2010 at 5:29 PM, Scott Gonyea <scott@aitrus.org> wrote:
> > Hi,
> >
> > I'm looking to get some direction on where I should focus my attention,
> with
> > regards to the Solr codebase and documentation.  Rather than write a ton
> of
> > stuff no one wants to read, I'll just start with a use-case.  For
> context,
> > the data originates from Nutch crawls and is indexed into Solr.
> >
> > Imagine a web page has the following content (4 occurences of "Johnson"
> are
> > bolded):
> >
> > --content_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id
> > urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
> magna,
> > nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum.
> Mauris a
> > arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget ligula
> > nisi. Ut fringilla ullamcorper sem.
> > --_content--
> >
> > First; I would like to have the entire "content" block be indexed within
> > Solr.  This is done and definitely not an issue.
> >
> > Second (+); during the injection of crawl data into Solr, I would like to
> > grab every occurence of a specific word, or phrase, with "Johnson" being
> my
> > example for the above.  I want to take every such phrase (without
> > collision), as well as its unique-context, and inject that into its own,
> > separate Solr index.  For example, the above "content" example, having
> been
> > indexed in its entirety, would also be the source of 4 additional
> indexes.
> > In each index, "Johnson" would only appear once.  All of the text before
> and
> > after "Johnson" would be BOUND BY any other occurrence of "Johnson."  eg:
> >
> > --index1_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id
> > urna et justo fringilla dictum
> > --_index1-- --index2_--
> > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
> > dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> > --_index2-- --index3_--
> > in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit
> > non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
> > malesuada
> > --_index3-- --index4_--
> > sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis
> rhoncus
> > vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla
> ullamcorper
> > sem.
> > --_index4--
> >
> > Q:
> > How much of this is feasible in "present-day Solr" and how much of it do
> I
> > need to produce in a patch of my own?  Can anyone give me some direction
> on
> > where I should look, in approaching this problem (ie, libs / classes /
> > confs)?  I sincerely appreciate it.
> >
> > Third; I would later like to go through the above, child indexes and
> dismiss
> > any that appear within a given context.  For example, I may deem "ipsum
> > dolor Johnson sit amet" as not being useful and I'd want to delete any
> > indexes matching that particular phrase-context.  The deletion is trivial
> > and, with the 2nd item resolved--this becomes a fairly non-issue.
> >
> > Q:
> > The question, more or less, comes from the fact that my source data is
> from
> > a web crawler.  When recrawled, I need to repeat the process of
> dismissing
> > phrase-contexts that are not relevant to me.  Where is the best place to
> > perform this work?  I could easily perform queries, after indexing my
> crawl,
> > but that seems needlessly intensive.  I think the answer to that will be
> > "wherever I implement #2", but assumptions can be painfully expensive.
> >
> >
> > Thank you for reading my bloated e-mail.  Again, I'm mostly just looking
> to
> > be pointed to various pieces of the Lucene / Solr code-base, and am
> trolling
> > for any insight that people might share.
> >
> > Scott Gonyea
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message