lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: In Need of Direction; Phrase-Context Tracking / Injection (Child Indexes) / Dismissal
Date Thu, 02 Sep 2010 21:12:19 GMT
Dropping dev@lucene.a.o.

How about we step back and please explain the problem you are trying to solve, as opposed
to the proposed solution to the problem below.  You can likely do what you want below in Solr/Lucene
(modulo replacing the index with a new document), but the bigger question is "is that the
best way to do it?"  I think if you give us that context, then perhaps we can brainstorm on
solutions.

Thanks,
Grant


On Sep 1, 2010, at 8:29 PM, Scott Gonyea wrote:

> Hi,
> 
> I'm looking to get some direction on where I should focus my attention, with regards
to the Solr codebase and documentation.  Rather than write a ton of stuff no one wants to
read, I'll just start with a use-case.  For context, the data originates from Nutch crawls
and is indexed into Solr.
> 
> Imagine a web page has the following content (4 occurences of "Johnson" are bolded):
> 
> --content_--
> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id urna et justo
fringilla dictum johnson in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon
sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada
Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem.
> --_content--
> 
> First; I would like to have the entire "content" block be indexed within Solr.  This
is done and definitely not an issue.
> 
> Second (+); during the injection of crawl data into Solr, I would like to grab every
occurence of a specific word, or phrase, with "Johnson" being my example for the above.  I
want to take every such phrase (without collision), as well as its unique-context, and inject
that into its own, separate Solr index.  For example, the above "content" example, having
been indexed in its entirety, would also be the source of 4 additional indexes.  In each index,
"Johnson" would only appear once.  All of the text before and after "Johnson" would be BOUND
BY any other occurrence of "Johnson."  eg:
> 
> --index1_--
> Lorem ipsum dolor Johnson sit amet, consectetur adipiscing elit. Aenean id urna et justo
fringilla dictum
> --_index1-- --index2_--
> sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum johnson
in at tortor. Nulla eu nulla magna, nec sodales est. Sed
> --_index2-- --index3_--
> in at tortor. Nulla eu nulla magna, nec sodales est. Sed johnSon sed elit non lorem sagittis
fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada
> --_index3-- --index4_--
> sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada
Johnsons mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem.
> --_index4--
> 
> Q:
> How much of this is feasible in "present-day Solr" and how much of it do I need to produce
in a patch of my own?  Can anyone give me some direction on where I should look, in approaching
this problem (ie, libs / classes / confs)?  I sincerely appreciate it.
> 
> Third; I would later like to go through the above, child indexes and dismiss any that
appear within a given context.  For example, I may deem "ipsum dolor Johnson sit amet" as
not being useful and I'd want to delete any indexes matching that particular phrase-context.
 The deletion is trivial and, with the 2nd item resolved--this becomes a fairly non-issue.
> 
> Q:
> The question, more or less, comes from the fact that my source data is from a web crawler.
 When recrawled, I need to repeat the process of dismissing phrase-contexts that are not relevant
to me.  Where is the best place to perform this work?  I could easily perform queries, after
indexing my crawl, but that seems needlessly intensive.  I think the answer to that will be
"wherever I implement #2", but assumptions can be painfully expensive.
> 
> 
> Thank you for reading my bloated e-mail.  Again, I'm mostly just looking to be pointed
to various pieces of the Lucene / Solr code-base, and am trolling for any insight that people
might share.
> 
> Scott Gonyea

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message