lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "patrick o'leary" <pj...@pjaol.com>
Subject Re: similarity function
Date Thu, 05 Mar 2009 20:20:56 GMT
Sounds like your most difficult part will be the question parser using POS.

This is kind of old school but use something like the AliceBot AIML library
http://en.wikipedia.org/wiki/AIML

Where the subjective terms can be extracted from the questions, and indexed
separately.

Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe
(LingPipe license is a little bit of a pain)
for entity extraction.

An interesting way to look at the data would be to construct 3 fields,
Original_Question, Question_base, Subject

Doc:
Original_Question: Who is the president of the UN
Question_base: Who is the president of
Question_base: Who is
Subject: the president of the UN
Subject: the president
Subject: the UN
/Doc

And similarity can be somewhat easier to calculate with similar question
bases, subjects, etc

P


On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll <gsingers@apache.org> wrote:

> Hi Seid,
>
> Do you have a reference for the article?  I've done some QA in my day, but
> don't recall reading that one.
>
> At any rate, I do think it is possible to do what you are after.  See
> below.
>
> On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:
>
>  For my work, I have read an article stating that " Answer type can be
>> automatically constructed by Indexing Different Questions and Answer
>> types. Later, when an unseen question apears, answer type for this
>> question will be found with the help of 'similarity function'
>> computation"
>>
>> so I am clear with the arguement above. my problem is,
>> 1. how can I index individual questions and Answer types as is ( not
>> tokenized
>>
>
> I'm not sure you want this, but when constructing your Field, just use the
> NOT_ANALYZED option.
>
>
>> 2. how can I calculate the similarity between indexed questions and
>> and unseen questions (question of any type that can be asked latter)
>>
>
> In line with #1, I think you might be better off to actually tokenize the
> question as one one field, and the answer type as a second field.  Then, you
> can let Lucene calculate similarity via it's normal query mechanisms.  In
> this case, I would like try experimenting with things like: exact match,
> phrase queries with slop, etc.  That way, not only can you match "Who is the
> president of UN" but you might also match on things that are a bit fuzzier.
>  To do this, you might need to have several fields per document with
> variations.  I could also see using Lucene's payload mechanism as well.
>
> But, as Vasu said, you will likely need other parts too, like OpenNLP.
>
> HTH,
> Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message