lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Date Fri, 03 Sep 2010 08:19:39 GMT
Hi,

This smells like a job for Hadoop and perhaps Mahout, unless your use cases are totally ad-hoc
research.
After Nutch has fetched the sites, kick off some MapReduce jobs for each case you wish to
study:
1. Extract phrases/contexts
2. For each context, perform detection and whitelisting
3. In the reduce step, sum it all up, and write the results to some store
4. Now you may index a "report" per site into Solr, with links to the original pages for each
context

You may be able to represent your grammar as textual rules instead of code. Your latency may
be minutes instead of milliseconds though...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 3. sep. 2010, at 01.03, Scott Gonyea wrote:

> Hi Grant,
> 
> Thanks for replying--sorry for sticking this on dev; I had imagined that
> development against the Solr codebase would be inevitable.
> 
> The application has to do with regulatory and legal compliance work by a
> non-profit, and is "socially good," but I need to 'abstract' the
> problem/goals--as it's not mine to disclose.
> 
> Crawl several websites, ie: slashdot, engadget, etc., inject them into Solr,
> and search for a given word.
> 
> Issue 1: How many times did that word appear, on the URL returned by Solr?
> 
> Suppose that word is "Linux" and you want to make sure that each occurence
> of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism gone
> wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such as
> Linux" is OK too.  So, now:
> 
> Issue 2: Suppose that your goal is now to separate the noise from the
> signal.  You therefore "white list" occurrences in which "Linux" appears
> without a "GNU/" prefix, yet which you've deemed acceptable within the given
> context.  "GNU/Linux" would be a starting point for any of your
> white-listing tasks.
> 
> Simply iterating over what is--and is not--a "white list" just doesn't scale
> on a lot of levels.  So my approach is to maintain a separate datastore,
> which contains a list of phrases that are worthy of whomever's attention, as
> well as a whole lot of "phrase-contexts"... Or the context in which the
> phrase appeared.
> 
> Suppose that one website lists "Linux" 20 times; the goal is to white-list
> all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
> the same context, then you might only need 1 "white list" to knock out all
> 20.  Further, the white-listing can generally be applied to other sites in
> which they appear.
> 
> I'd love to get some thoughts on how to tackle this problem, but I think
> that kicking off separate documents, within Solr, for each specific
> occurrence... would be the simplest path.  But again, I'd love for some
> thoughts on how else I might do this, or where I should start my coding :)
> 
> Thank you very much,
> Scott Gonyea
> 
> On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> 
>> Dropping dev@lucene.a.o.
>> 
>> How about we step back and please explain the problem you are trying to
>> solve, as opposed to the proposed solution to the problem below.  You can
>> likely do what you want below in Solr/Lucene (modulo replacing the index
>> with a new document), but the bigger question is "is that the best way to do
>> it?"  I think if you give us that context, then perhaps we can brainstorm on
>> solutions.
>> 
>> Thanks,
>> Grant
>> 
>> 
>> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:
>> 
>>> Hi,
>>> 
>>> I'm looking to get some direction on where I should focus my attention,
>> with regards to the Solr codebase and documentation.  Rather than write a
>> ton of stuff no one wants to read, I'll just start with a use-case.  For
>> context, the data originates from Nutch crawls and is indexed into Solr.
>>> 
>>> Imagine a web page has the following content (4 occurences of "Johnson"
>> are bolded):
>>> 
>>> --content_--
>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
>> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
>> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis fermentum.
>> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi eget
>> ligula nisi. Ut fringilla ullamcorper sem.
>>> --_content--
>>> 
>>> First; I would like to have the entire "content" block be indexed within
>> Solr.  This is done and definitely not an issue.
>>> 
>>> Second (+); during the injection of crawl data into Solr, I would like to
>> grab every occurence of a specific word, or phrase, with "Johnson" being my
>> example for the above.  I want to take every such phrase (without
>> collision), as well as its unique-context, and inject that into its own,
>> separate Solr index.  For example, the above "content" example, having been
>> indexed in its entirety, would also be the source of 4 additional indexes.
>> In each index, "Johnson" would only appear once.  All of the text before
>> and after "Johnson" would be BOUND BY any other occurrence of "Johnson."
>> eg:
>>> 
>>> --index1_--
>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
>> id urna et justo fringilla dictum
>>> --_index1-- --index2_--
>>> sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
>> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
>>> --_index2-- --index3_--
>>> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit
>> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
>> malesuada
>>> --_index3-- --index4_--
>>> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis
>> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla
>> ullamcorper sem.
>>> --_index4--
>>> 
>>> Q:
>>> How much of this is feasible in "present-day Solr" and how much of it do
>> I need to produce in a patch of my own?  Can anyone give me some direction
>> on where I should look, in approaching this problem (ie, libs / classes /
>> confs)?  I sincerely appreciate it.
>>> 
>>> Third; I would later like to go through the above, child indexes and
>> dismiss any that appear within a given context.  For example, I may deem
>> "ipsum dolor Johnson sit amet" as not being useful and I'd want to delete
>> any indexes matching that particular phrase-context.  The deletion is
>> trivial and, with the 2nd item resolved--this becomes a fairly non-issue.
>>> 
>>> Q:
>>> The question, more or less, comes from the fact that my source data is
>> from a web crawler.  When recrawled, I need to repeat the process of
>> dismissing phrase-contexts that are not relevant to me.  Where is the best
>> place to perform this work?  I could easily perform queries, after indexing
>> my crawl, but that seems needlessly intensive.  I think the answer to that
>> will be "wherever I implement #2", but assumptions can be painfully
>> expensive.
>>> 
>>> 
>>> Thank you for reading my bloated e-mail.  Again, I'm mostly just looking
>> to be pointed to various pieces of the Lucene / Solr code-base, and am
>> trolling for any insight that people might share.
>>> 
>>> Scott Gonyea
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 


Mime
View raw message