lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Villarejo <villare...@gmail.com>
Subject Re: Part of speech search with lucene
Date Wed, 04 Mar 2015 18:21:47 GMT
Hi Mike,

Your solution work! I've been trying it with PhraseQuery and It works
pretty good.

Thank you so much.

David.

2015-03-03 23:00 GMT+01:00 Michael Sokolov <msokolov@safaribooksonline.com>:

> I believe you can accomplish what you are talking about using PhraseQuery,
> say: note that it has
>
> public void add(Term term, int position)
>
> which does enable searching for multiple terms at the same position
>
> and you should be able to encode different kinds of attributes using text
> tricks like I suggested, or with payloads: I'm less clear about how to use
> the payloads in queries though
>
> -Mike
>
>
> On 03/03/2015 04:41 PM, David Villarejo wrote:
>
>> What you propose is good if you want to index only the pos of a token. But
>> I want to index some extra info, such as "lemma" of a token, phonetic
>> encoding, etc. Sorry, I was not too general in my previous post.
>> Imagine you want to ask this:
>>
>> an adj whose lemma is "quick" followed by "brown" followed by a noun whose
>> phonetic enconding is "fots".
>>
>> So, the main problem is you cannot ask if several "synonyms" exist at the
>> same position.
>>
>> Thank you Michael for your answer.
>>
>> 2015-03-03 20:52 GMT+01:00 Michael Sokolov <msokolov@safaribooksonline.
>> com>:
>>
>>  What if you indexed every word with two synonyms: the plain unadorned
>>> word
>>> and a token formed by concatenating the pos and the word with some
>>> unusual
>>> separator character?
>>>
>>> For example, "the quick brown fox" would be:
>>>
>>> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
>>> noun:fox }
>>>
>>> with punctuation to suggest the token graph
>>>
>>> -Mike
>>>
>>>
>>> On 03/03/2015 01:21 PM, David Villarejo wrote:
>>>
>>>  After many google searchs I decided to post my problem here hoping that
>>>> someone help me. What I want to achieve is to perform queries as follows
>>>> (Don't worry about the query format):
>>>>
>>>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>>>> followed by any prep.
>>>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>>>> "jumps" followed by any prep.
>>>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj
>>>> followed
>>>> by jumps as verb followed by any preposition.
>>>>
>>>> In a more general form, what I want is
>>>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>>>
>>>> For that, I have the text tagged as follows:
>>>>
>>>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>>>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>>>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>>>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy]
>>>> dog|[pos:NN][lemma:dog]
>>>>
>>>> The first thing I thought was to index extra info of each term as
>>>> payload
>>>> and using PayloadNearQuery after in order to access to the payload of
>>>> each
>>>> span. The problem is that PayloadNearQuery match the terms first and
>>>> then
>>>> access its payload, so none of the 3 above queries will work. (correct
>>>> me
>>>> if I'm wrong)
>>>>
>>>> The second thing I thought was to index extra info as synonyms of the
>>>> term
>>>> but, this way, the second query won't work since I can't ask if the
>>>> first
>>>> term is an adj and the specific word "brown" simultaneously.
>>>>
>>>> Any way to address this problem, suggestions, etc. will be appreciated.
>>>>
>>>>
>>>> David.
>>>>
>>>>
>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message