lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Neubert <>
Subject Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Date Mon, 12 Nov 2007 19:20:49 GMT
Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am
also considering Lucene -- problem is, I am one week into both technologies (though have years
in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :)


----- Original Message ----
From: Erick Erickson <>
Sent: Monday, November 12, 2007 2:11:14 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this
 may be

For your line number, page number etc perspective, it is possible to
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
coincident with the last term in each line, index the token "$$$$$".
with the last token of every paragraph index the token "#####". If you
offsets of the matching terms, you can quite quickly simply count the
of line and paragraph tokens using TermDocs/TermEnums and correlate
to lines and paragraphs. The trick is to index your special tokens with
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on

Another possibility is to add a special field with each document with
of each end-of-sentence and end-of-paragraph offsets (stored, not
Again, "given the offsets",  you can read in this field and figure out
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of
particular problem space. I'm not sure either of them is suitable for
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective,
even *think* of asking me how to really make this work in SOLR <G>.


On Nov 11, 2007 12:44 AM, David Neubert <> wrote:

> Ryan (and others who need something to put them so sleep :) )
> Wow -- the light-bulb finally went off -- the Analzyer admin page is
> cool -- I just was not at all thinking the SOLR/Lucene way.
> I need to rethink my whole approach now that I understand (from
> the schema.xml closer and playing with the Analyser) how compatible
> and query policies can be applied automatically on a field by field
 basis by
> SOLR at both index and query time.
> I still may have a stumper here, but I need to give it some thought,
> may return again with another question:
> The problem is that my text is book text (fairly large) that ooks
> much like one would expect:
> <book>
> <chapter>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>...</sen></para>
> </chapter>
> </book
> The search results need to return exact sentences or paragraphs with
> exact page:line numbers (which is available in the embedded markup in
> text).
> There were previous responses by others, suggesting I look into
> but I did not fully understand that -- I may have to re-read those
> now that I am getting a clearer picture of SOLR/Lucene.
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single
> is because I was planning on pre-parsing the text myself (outside of
> -- and feeding separate <doc> elements to the <add> because in that
 way I
> could produce the page:line reference in the pre-parsing (again
 outside of
> SOLR) and feed it in as explict field in the <doc> elements of the
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
> But I am beginning to suspect, I was planning to do a lot of work
> SOLR can do for me.
> I will continue to study this and respond when I am a bit clearer,
 but the
> closer I could get to just submitting the books a chapter at a time
 -- and
> letting SOLR do the work, the better (cause I have all the books in
> formed xml at chapter levels).  However, I don't  see yet how I could
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and
 sens as
> single documents, not chapters hits, and also return the entire text
 of the
> sen or par, and highlight the keywords within (for the search result
>  Once a search result hit is selected, it would then act as expected
> position into the chapter, at the selected reference, highlight again
> key words, but this time in the context of an entire chapter (the
> document to the user's mind).
> Even with my new understanding you (and others) have given me, which
 I can
> use to certainly improve my approach -- it still seems to me that
> multi-valued fields concatenate text -- even if you use the
> positionGapIncrment feature to prohibit unwanted phrase matches, how
 do you
> produce a well definied search result hit, bounded by the exact sen
 or par,
> unless you index them as single documents?
> Should I still read up on the payload discussion?
> Dave
> ----- Original Message ----
> From: Ryan McKinley <>
> To:
> Sent: Saturday, November 10, 2007 5:00:43 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and
> sensitivity)
> David Neubert wrote:
> > Ryan,
> >
> > Thanks for your response.  I infer from your response that you can
>  have a different analyzer for each field
> yes!  each field can have its own indexing strategy.
> > I believe that the Analyzer approach you suggested requires the use
> > of the same Analzyer at query time that was used during indexing.
> it does not require the *same* Analyzer - it just requires one that
> generates compatiable tokens.  That is, you may want the indexing to
> split the input into sentences, but the query time analyzer keeps the
> input as a single token.
> check the example schema.xml file -- the 'text' field type applies
> synonyms at index time, but does at query time.
> re searching acrross multiple fields, don't worry, lucene handles
> well.  You may want to do that explicitly or with the dismax handler.
> I'd suggest you play around with indexing some data.  check the
> analysis.jsp in the admin section.  It is a great tool to help figure
> out what analyzers do at index vs query time.
> ryan
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message