lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Date Mon, 06 Sep 2010 09:35:23 GMT
Hi,

Yes, the stemming and other features of Solr is nice. A search result from Solr gives you
each occurence of X in Y through highlighting - the regex highlighter is programmable to extract
e.g. a sentence as context. You can also get number of occurrences (term frequency TF) from
the termvectors. TF also plays a role in scoring as you point out. It just sounds a bit overkill
to me for your usecse.

I don't have enough hands-on experience with Hadoop yet to guide you tell you how to do it
in M/R. I would suppose there exists good light-weight text extraction frameworks out there
which could do much of what you need. Did you know you can also embed Solr through EmbeddedSolr
to include it in a workflow (See SOLR-1301). Also, I just found http://sna-projects.com/azkaban/
which looks promising to control advanced Hadoop workflows. Just some pointers..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 3. sep. 2010, at 19.53, Scott Gonyea wrote:

> I've been considering the use of Hadoop, since that's what Nutch uses.
> Unless I piggy-back onto Nutch's MR job, when creating a Solr index, I'm
> wondering if it's overkill.  I can see ways of working it into a MapReduce
> workflow, but it would involve dumping the database onto HDFS beforehand.
> I'm still debating that one, with myself.
> 
> One other thing that I want to take advantage of is Lucene/Solr's filter
> factories (?).  I'm not sure if I have the terminology right, but there are
> a lot of advanced text-parsing features.  IE, a search for "reality" would
> also turn up "reale."  It seems that I would want to perform my "find words,
> filter out any white-listed context, and re-inject" after Nutch stuffs Solr
> with all of its crawl data.
> 
> So, perhaps I can get help starting at #1 of your suggestion:
> 
> How would I best extract a phrase from Solr?  IE, can I tell Solr "give me
> each occurence of X in document Y" or (and I'm guessing this is it) where
> would I look to perform that kind of a search, myself?
> 
> Thinking about it, I imagine that Solr might tend to "flatten" words in its
> index.  Ie, the string "reality" only really occurs once in a given page's
> index, and (maybe?) it'll have some boost reflecting the number of times it
> appeared.  Please excuse my obscene generalizations :(.
> 
> I'm going to do some more digging through the Solr.  I appreciate your
> help.  I am a bit of a beggar when it comes to seeking out help on where to
> start.  But, as I mentioned on the Nutch list, I will contribute all of my
> changes back to Solr.  I'll also look to improve documentation, which I
> still owe Nutch,  but that's queueing up for when there's a lull.
> 
> Thank you, - Scott
> 
> On Fri, Sep 3, 2010 at 1:19 AM, Jan Høydahl / Cominvent <
> jan.asf@cominvent.com> wrote:
> 
>> Hi,
>> 
>> This smells like a job for Hadoop and perhaps Mahout, unless your use cases
>> are totally ad-hoc research.
>> After Nutch has fetched the sites, kick off some MapReduce jobs for each
>> case you wish to study:
>> 1. Extract phrases/contexts
>> 2. For each context, perform detection and whitelisting
>> 3. In the reduce step, sum it all up, and write the results to some store
>> 4. Now you may index a "report" per site into Solr, with links to the
>> original pages for each context
>> 
>> You may be able to represent your grammar as textual rules instead of code.
>> Your latency may be minutes instead of milliseconds though...
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 3. sep. 2010, at 01.03, Scott Gonyea wrote:
>> 
>>> Hi Grant,
>>> 
>>> Thanks for replying--sorry for sticking this on dev; I had imagined that
>>> development against the Solr codebase would be inevitable.
>>> 
>>> The application has to do with regulatory and legal compliance work by a
>>> non-profit, and is "socially good," but I need to 'abstract' the
>>> problem/goals--as it's not mine to disclose.
>>> 
>>> Crawl several websites, ie: slashdot, engadget, etc., inject them into
>> Solr,
>>> and search for a given word.
>>> 
>>> Issue 1: How many times did that word appear, on the URL returned by
>> Solr?
>>> 
>>> Suppose that word is "Linux" and you want to make sure that each
>> occurence
>>> of "Linux" also acknowledges that "Linux" is "GNU/Linux" (pedanticism
>> gone
>>> wild).  Now, suppose that "GNU Linux" is ok.  And even "GNU Projects such
>> as
>>> Linux" is OK too.  So, now:
>>> 
>>> Issue 2: Suppose that your goal is now to separate the noise from the
>>> signal.  You therefore "white list" occurrences in which "Linux" appears
>>> without a "GNU/" prefix, yet which you've deemed acceptable within the
>> given
>>> context.  "GNU/Linux" would be a starting point for any of your
>>> white-listing tasks.
>>> 
>>> Simply iterating over what is--and is not--a "white list" just doesn't
>> scale
>>> on a lot of levels.  So my approach is to maintain a separate datastore,
>>> which contains a list of phrases that are worthy of whomever's attention,
>> as
>>> well as a whole lot of "phrase-contexts"... Or the context in which the
>>> phrase appeared.
>>> 
>>> Suppose that one website lists "Linux" 20 times; the goal is to
>> white-list
>>> all 20 of those occurrences.  Or perhaps "Linux" appears 20 times, within
>>> the same context, then you might only need 1 "white list" to knock out
>> all
>>> 20.  Further, the white-listing can generally be applied to other sites
>> in
>>> which they appear.
>>> 
>>> I'd love to get some thoughts on how to tackle this problem, but I think
>>> that kicking off separate documents, within Solr, for each specific
>>> occurrence... would be the simplest path.  But again, I'd love for some
>>> thoughts on how else I might do this, or where I should start my coding
>> :)
>>> 
>>> Thank you very much,
>>> Scott Gonyea
>>> 
>>> On Thu, Sep 2, 2010 at 2:12 PM, Grant Ingersoll <gsingers@apache.org>
>> wrote:
>>> 
>>>> Dropping dev@lucene.a.o.
>>>> 
>>>> How about we step back and please explain the problem you are trying to
>>>> solve, as opposed to the proposed solution to the problem below.  You
>> can
>>>> likely do what you want below in Solr/Lucene (modulo replacing the index
>>>> with a new document), but the bigger question is "is that the best way
>> to do
>>>> it?"  I think if you give us that context, then perhaps we can
>> brainstorm on
>>>> solutions.
>>>> 
>>>> Thanks,
>>>> Grant
>>>> 
>>>> 
>>>> On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm looking to get some direction on where I should focus my attention,
>>>> with regards to the Solr codebase and documentation.  Rather than write
>> a
>>>> ton of stuff no one wants to read, I'll just start with a use-case.  For
>>>> context, the data originates from Nutch crawls and is indexed into Solr.
>>>>> 
>>>>> Imagine a web page has the following content (4 occurences of "Johnson"
>>>> are bolded):
>>>>> 
>>>>> --content_--
>>>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
>>>> id urna et justo fringilla dictum johnson in at tortor. Nulla eu nulla
>>>> magna, nec sodales est. Sed johnSon sed elit non lorem sagittis
>> fermentum.
>>>> Mauris a arcu et sem sagittis rhoncus vel malesuada Johnsons mi. Morbi
>> eget
>>>> ligula nisi. Ut fringilla ullamcorper sem.
>>>>> --_content--
>>>>> 
>>>>> First; I would like to have the entire "content" block be indexed
>> within
>>>> Solr.  This is done and definitely not an issue.
>>>>> 
>>>>> Second (+); during the injection of crawl data into Solr, I would like
>> to
>>>> grab every occurence of a specific word, or phrase, with "Johnson" being
>> my
>>>> example for the above.  I want to take every such phrase (without
>>>> collision), as well as its unique-context, and inject that into its own,
>>>> separate Solr index.  For example, the above "content" example, having
>> been
>>>> indexed in its entirety, would also be the source of 4 additional
>> indexes.
>>>> In each index, "Johnson" would only appear once.  All of the text before
>>>> and after "Johnson" would be BOUND BY any other occurrence of "Johnson."
>>>> eg:
>>>>> 
>>>>> --index1_--
>>>>> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean
>>>> id urna et justo fringilla dictum
>>>>> --_index1-- --index2_--
>>>>> sit amet, consectetur adipiscing elit. Aenean id urna et justo
>> fringilla
>>>> dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed
>>>>> --_index2-- --index3_--
>>>>> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed
>> elit
>>>> non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
>>>> malesuada
>>>>> --_index3-- --index4_--
>>>>> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis
>>>> rhoncus vel malesuada Johnsons mi. Morbi eget ligula nisi. Ut fringilla
>>>> ullamcorper sem.
>>>>> --_index4--
>>>>> 
>>>>> Q:
>>>>> How much of this is feasible in "present-day Solr" and how much of it
>> do
>>>> I need to produce in a patch of my own?  Can anyone give me some
>> direction
>>>> on where I should look, in approaching this problem (ie, libs / classes
>> /
>>>> confs)?  I sincerely appreciate it.
>>>>> 
>>>>> Third; I would later like to go through the above, child indexes and
>>>> dismiss any that appear within a given context.  For example, I may deem
>>>> "ipsum dolor Johnson sit amet" as not being useful and I'd want to
>> delete
>>>> any indexes matching that particular phrase-context.  The deletion is
>>>> trivial and, with the 2nd item resolved--this becomes a fairly
>> non-issue.
>>>>> 
>>>>> Q:
>>>>> The question, more or less, comes from the fact that my source data is
>>>> from a web crawler.  When recrawled, I need to repeat the process of
>>>> dismissing phrase-contexts that are not relevant to me.  Where is the
>> best
>>>> place to perform this work?  I could easily perform queries, after
>> indexing
>>>> my crawl, but that seems needlessly intensive.  I think the answer to
>> that
>>>> will be "wherever I implement #2", but assumptions can be painfully
>>>> expensive.
>>>>> 
>>>>> 
>>>>> Thank you for reading my bloated e-mail.  Again, I'm mostly just
>> looking
>>>> to be pointed to various pieces of the Lucene / Solr code-base, and am
>>>> trolling for any insight that people might share.
>>>>> 
>>>>> Scott Gonyea
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>> 7-8
>>>> 
>>>> 
>> 
>> 


Mime
View raw message