lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Vaucher <>
Subject Re: How to store document meta information
Date Mon, 28 Apr 2003 21:09:04 GMT
On Mon, 28 Apr 2003, Joshua O'Madadhain wrote:

> On Mon, 28 Apr 2003, Stephane Vaucher wrote:
> > I've got a document that I run through an information extraction engine
> > that returns a list of concepts associated to a document with an
> > appropriate relevancy factor (for example, with a news article, it might
> > return sport=100%, litterature=84% and politics=10%).
> It's unclear what the semantics of your relevancy measure are.  Is it
> something like a fuzzy set measure ('this article is 100% in the set of
> documents about sports, 84% in ... literature, and 10% in politics')?

My IR server extracts entities such as geographic locations and 
people names and give a weight using some sort of combination between a 
statistic and linguistic model to attribute a weight that should 
determine the importance of the term. It's been a while since my AI 
course, so I hope I'm not saying nonsence, but it probably is fuzzy.

If an input is a news article, it can return for example:
"United States" type=geo location weight=98 frequency=4
"Bush"          type=person name  weight=78 fequency=2

> > I would like to index these concepts with an indication of their relevancy
> > levels. Is there a recommended way of doing this? Searching the FAQs, I
> > found none, but from my knowledge of lucene, I gather I could do it the
> > following ways:
> >
> > 1) If all concepts were to be stored in a single field (as I would
> > prefer), I don't think I can use field boosting, so I would have to
> > probably hold multiple instances of my concept (e.g. I could have 100
> > "sport", 84 "litterature" and 10 "politics") in my field.
> >
> > 2) I could use multiple fields with varying boost factors. But I would be
> > forced to determine ahead of time how many concepts I'll have to perform
> > searches on all of the appropriate fields. This could probably affect the
> > performance of the app (I say this with no numbers, simple intuition, so
> > correct me if I'm wrong).
> How do you intend to use these concepts in the search process?  That is,
> how will these concepts be used by (a) the user in specifying a query, (b)
> the indexer in storing the associated documents, (c) the searcher in
> retrieving documents, and (d) the presentation of the results to the user?
> Without knowing these things, it's hard to answer your question (at least
> for me).

I hope I answer your questions correctly:

a) a query could include geographical locations extracted, so a user might 
want to search for american news, so he might specify "United States" and 
articles with a high weight for U.S. should show up first.
b) I'll have to store these concepts in a (or multiple fields) so I can 
search on it. I need to find a way to represent the weight though...
c) I'll have fields with concepts. How I use it will depend on the way I 
index the docs.
d) I would like to show the concept found in the result set, so I can help 
a user refine his search. e.g. a search "manchester uniter" returns docs 
with concepts 'sports'. I would allow a user to add the concept 'sports' 
to his query.

My two points described the ways I thought I could index concept fields.


> Regards,
> Joshua O'Madadhain
> Per
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any organization.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message