lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "patrick o'leary" <>
Subject Re: similarity function
Date Thu, 05 Mar 2009 20:20:56 GMT
Sounds like your most difficult part will be the question parser using POS.

This is kind of old school but use something like the AliceBot AIML library

Where the subjective terms can be extracted from the questions, and indexed

Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe
(LingPipe license is a little bit of a pain)
for entity extraction.

An interesting way to look at the data would be to construct 3 fields,
Original_Question, Question_base, Subject

Original_Question: Who is the president of the UN
Question_base: Who is the president of
Question_base: Who is
Subject: the president of the UN
Subject: the president
Subject: the UN

And similarity can be somewhat easier to calculate with similar question
bases, subjects, etc


On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll <> wrote:

> Hi Seid,
> Do you have a reference for the article?  I've done some QA in my day, but
> don't recall reading that one.
> At any rate, I do think it is possible to do what you are after.  See
> below.
> On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:
>  For my work, I have read an article stating that " Answer type can be
>> automatically constructed by Indexing Different Questions and Answer
>> types. Later, when an unseen question apears, answer type for this
>> question will be found with the help of 'similarity function'
>> computation"
>> so I am clear with the arguement above. my problem is,
>> 1. how can I index individual questions and Answer types as is ( not
>> tokenized
> I'm not sure you want this, but when constructing your Field, just use the
> NOT_ANALYZED option.
>> 2. how can I calculate the similarity between indexed questions and
>> and unseen questions (question of any type that can be asked latter)
> In line with #1, I think you might be better off to actually tokenize the
> question as one one field, and the answer type as a second field.  Then, you
> can let Lucene calculate similarity via it's normal query mechanisms.  In
> this case, I would like try experimenting with things like: exact match,
> phrase queries with slop, etc.  That way, not only can you match "Who is the
> president of UN" but you might also match on things that are a bit fuzzier.
>  To do this, you might need to have several fields per document with
> variations.  I could also see using Lucene's payload mechanism as well.
> But, as Vasu said, you will likely need other parts too, like OpenNLP.
> HTH,
> Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message