lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Neubert <devmecr...@yahoo.com>
Subject Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Date Mon, 12 Nov 2007 19:20:49 GMT
Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am
also considering Lucene -- problem is, I am one week into both technologies (though have years
in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :)

Dave

----- Original Message ----
From: Erick Erickson <erickerickson@gmail.com>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:11:14 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this
 may be
useful....

For your line number, page number etc perspective, it is possible to
 index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token "$$$$$".
Coincident
with the last token of every paragraph index the token "#####". If you
 get
the
offsets of the matching terms, you can quite quickly simply count the
 number
of line and paragraph tokens using TermDocs/TermEnums and correlate
 hits
to lines and paragraphs. The trick is to index your special tokens with
 an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on
 this).


Another possibility is to add a special field with each document with
 the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not
 indexed).
Again, "given the offsets",  you can read in this field and figure out
 what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of
 your
particular problem space. I'm not sure either of them is suitable for
 very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective,
 so
don't
even *think* of asking me how to really make this work in SOLR <G>.

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert <devmecrazy@yahoo.com> wrote:

> Ryan (and others who need something to put them so sleep :) )
>
> Wow -- the light-bulb finally went off -- the Analzyer admin page is
 very
> cool -- I just was not at all thinking the SOLR/Lucene way.
>
> I need to rethink my whole approach now that I understand (from
 reviewing
> the schema.xml closer and playing with the Analyser) how compatible
 index
> and query policies can be applied automatically on a field by field
 basis by
> SOLR at both index and query time.
>
> I still may have a stumper here, but I need to give it some thought,
 and
> may return again with another question:
>
> The problem is that my text is book text (fairly large) that ooks
 very
> much like one would expect:
> <book>
> <chapter>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>...</sen></para>
> </chapter>
> </book
>
> The search results need to return exact sentences or paragraphs with
 their
> exact page:line numbers (which is available in the embedded markup in
 the
> text).
>
> There were previous responses by others, suggesting I look into
 payloads,
> but I did not fully understand that -- I may have to re-read those
 e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
>
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single
 document,
> is because I was planning on pre-parsing the text myself (outside of
 SOLR)
> -- and feeding separate <doc> elements to the <add> because in that
 way I
> could produce the page:line reference in the pre-parsing (again
 outside of
> SOLR) and feed it in as explict field in the <doc> elements of the
 <add>
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
>
> But I am beginning to suspect, I was planning to do a lot of work
 that
> SOLR can do for me.
>
> I will continue to study this and respond when I am a bit clearer,
 but the
> closer I could get to just submitting the books a chapter at a time
 -- and
> letting SOLR do the work, the better (cause I have all the books in
 well
> formed xml at chapter levels).  However, I don't  see yet how I could
 get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and
 sens as
> single documents, not chapters hits, and also return the entire text
 of the
> sen or par, and highlight the keywords within (for the search result
 hit).
>  Once a search result hit is selected, it would then act as expected
 and
> position into the chapter, at the selected reference, highlight again
 the
> key words, but this time in the context of an entire chapter (the
 whole
> document to the user's mind).
>
> Even with my new understanding you (and others) have given me, which
 I can
> use to certainly improve my approach -- it still seems to me that
 because
> multi-valued fields concatenate text -- even if you use the
> positionGapIncrment feature to prohibit unwanted phrase matches, how
 do you
> produce a well definied search result hit, bounded by the exact sen
 or par,
> unless you index them as single documents?
>
> Should I still read up on the payload discussion?
>
> Dave
>
>
>
>
> ----- Original Message ----
> From: Ryan McKinley <ryantxu@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November 10, 2007 5:00:43 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and
 case
> sensitivity)
>
>
> David Neubert wrote:
> > Ryan,
> >
> > Thanks for your response.  I infer from your response that you can
>  have a different analyzer for each field
>
> yes!  each field can have its own indexing strategy.
>
>
> > I believe that the Analyzer approach you suggested requires the use
> > of the same Analzyer at query time that was used during indexing.
>
> it does not require the *same* Analyzer - it just requires one that
> generates compatiable tokens.  That is, you may want the indexing to
> split the input into sentences, but the query time analyzer keeps the
> input as a single token.
>
> check the example schema.xml file -- the 'text' field type applies
> synonyms at index time, but does at query time.
>
> re searching acrross multiple fields, don't worry, lucene handles
 this
> well.  You may want to do that explicitly or with the dismax handler.
>
> I'd suggest you play around with indexing some data.  check the
> analysis.jsp in the admin section.  It is a great tool to help figure
> out what analyzers do at index vs query time.
>
> ryan
>
>
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message