lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Gonyea ...@sgonyea.com>
Subject Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Date Thu, 02 Sep 2010 23:03:46 GMT
Hi Grant,

Thanks for replying--sorry for sticking this on dev; I had imagined that
development against the Solr codebase would be inevitable.

The application has to do with regulatory and legal compliance work by a
non-profit, and is "socially good," but I need to 'abstract' the
problem/goals--as it's not mine to disclose.

Crawl several websites, ie: slashdot, engadget, etc., inject them into Solr,
and search for a given word.

Issue 1: How many times did that word appear, on the URL returned by Solr?

Suppose that word is "Linux" and you want to make sure that each occurence
of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism gone
wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such as
Linux" is OK too.  So, now:

Issue 2: Suppose that your goal is now to separate the noise from the
signal.  You therefore "white list" occurrences in which "Linux" appears
without a "GNU/" prefix, yet which you've deemed acceptable within the given
context.  "GNU/Linux" would be a starting point for any of your
white-listing tasks.

Simply iterating over what is--and is not--a "white list" just doesn't scale
on a lot of levels.  So my approach is to maintain a separate datastore,
which contains a list of phrases that are worthy of whomever's attention, as
well as a whole lot of "phrase-contexts"... Or the context in which the
phrase appeared.

Suppose that one website lists "Linux" 20 times; the goal is to white-list
all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
the same context, then you might only need 1 "white list" to knock out all
20.  Further, the white-listing can generally be applied to other sites in
which they appear.

I'd love to get some thoughts on how to tackle this problem, but I think
that kicking off separate documents, within Solr, for each specific
occurrence... would be the simplest path.  But again, I'd love for some
thoughts on how else I might do this, or where I should start my coding :)

Thank you very much,
Scott Gonyea

On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsingers@apache.org> wrote:

> Dropping dev@lucene.a.o.
>
> How about we step back and please explain the problem you are trying to
> solve, as opposed to the proposed solution to the problem below.  You can
> likely do what you want below in Solr/Lucene (modulo replacing the index
> with a new document), but the bigger question is "is that the best way to do
> it?"  I think if you give us that context, then perhaps we can brainstorm on
> solutions.
>
> Thanks,
> Grant
>
>
> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:
>
> > Hi,
> >
> > I'm looking to get some direction on where I should focus my attention,
> with regards to the Solr codebase and documentation.  Rather than write a
> ton of stuff no one wants to read, I'll just start with a use-case.  For
> context, the data originates from Nutch crawls and is indexed into Solr.
> >
> > Imagine a web page has the following content (4 occurences of "Johnson"
> are bolded):
> >
> > --content_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum.
> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget
> ligula nisi. Ut fringilla ullamcorper sem.
> > --_content--
> >
> > First; I would like to have the entire "content" block be indexed within
> Solr.  This is done and definitely not an issue.
> >
> > Second (+); during the injection of crawl data into Solr, I would like to
> grab every occurence of a specific word, or phrase, with "Johnson" being my
> example for the above.  I want to take every such phrase (without
> collision), as well as its unique-context, and inject that into its own,
> separate Solr index.  For example, the above "content" example, having been
> indexed in its entirety, would also be the source of 4 additional indexes.
>  In each index, "Johnson" would only appear once.  All of the text before
> and after "Johnson" would be BOUND BY any other occurrence of "Johnson."
>  eg:
> >
> > --index1_--
> > Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
> id urna et justo fringilla dictum
> > --_index1-- --index2_--
> > sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> > --_index2-- --index3_--
> > in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit
> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
> malesuada
> > --_index3-- --index4_--
> > sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis
> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla
> ullamcorper sem.
> > --_index4--
> >
> > Q:
> > How much of this is feasible in "present-day Solr" and how much of it do
> I need to produce in a patch of my own?  Can anyone give me some direction
> on where I should look, in approaching this problem (ie, libs / classes /
> confs)?  I sincerely appreciate it.
> >
> > Third; I would later like to go through the above, child indexes and
> dismiss any that appear within a given context.  For example, I may deem
> "ipsum dolor Johnson sit amet" as not being useful and I'd want to delete
> any indexes matching that particular phrase-context.  The deletion is
> trivial and, with the 2nd item resolved--this becomes a fairly non-issue.
> >
> > Q:
> > The question, more or less, comes from the fact that my source data is
> from a web crawler.  When recrawled, I need to repeat the process of
> dismissing phrase-contexts that are not relevant to me.  Where is the best
> place to perform this work?  I could easily perform queries, after indexing
> my crawl, but that seems needlessly intensive.  I think the answer to that
> will be "wherever I implement #2", but assumptions can be painfully
> expensive.
> >
> >
> > Thank you for reading my bloated e-mail.  Again, I'm mostly just looking
> to be pointed to various pieces of the Lucene / Solr code-base, and am
> trolling for any insight that people might share.
> >
> > Scott Gonyea
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message