lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Context specific summary with the search term
Date Tue, 23 Oct 2001 16:48:55 GMT
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> > 
> > How did the title ever get indexed as the title?  
> 
> I'm indexing HTML documents marked up with comments to indicate field
> boundaries. So I'd typically have:
> 
> <!--field:section_title-->
> blurb
> <!--field:text-->
> more blurb
> 
> and so on. The documents were indexed by looking for each field marker
> and then adding the subsequent lines to the relevant field.
> 
> In order to obtain a generic solution for context generation

If you're doing application-specific processing to extract fields from
documents, then a completely generic solution for extracting hit context
from documents is, by definition, impossible, since context extraction
requires field extraction.

> are you
> suggesting I write a method that takes plain text, (eg, text form of
> document) and a query, and assumes the plain text is in the query's
> default field?

I'm not exactly sure what you're proposing here, but, no, it doesn't sound
like something that I have suggested.

> This doesn't seem quite as useful as getContext(Hashset queryTerms,
> Reader originalDocument); which is what I was originally 
> aiming towards.

Such a method is easy to define if the Reader contains text from a single
field.  (Although you should probably pass in an Analyzer too.)  However if
you're expecting such a method to automatically divide the text into fields,
then things will be harder, since Lucene's model is that applications divide
documents into fields.  So you could write an application-specific version
that divides fields automatically, or, to use more generic code, you could
call such a generic method once for each field of your document, leaving
field extraction in application-specific code.  Does that make sense?

Doug

Mime
View raw message