lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Custom Tokenizer/Analyzer
Date Thu, 20 Feb 2014 12:30:51 GMT
If you already know the set of phrases you need to detect then you can
use Lucene's SynonymFilter to spot them and insert a new token.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies <benson@basistech.com> wrote:
> It sounds like you've been asked to implement Named Entity Recognition.
> OpenNLP has some capability here. There are also, um, commercial
> alternatives.
>
>
> On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio <ye.perio@gmail.com>wrote:
>
>> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar <geetgangwar@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> > My requirement is it should have capabilities to match multiple words as
>> > one token. for example. When user passes String as International Business
>> > machine logo or IBM logo it should return International Business Machine
>> as
>> > one token and logo as one token.
>>
>> This is an interesting problem. I suppose that if the user enters
>> "International Business Machines", possibly with some misspelling, you
>> want to find all documents containing "IBM" - and that if he enters
>> the string "IBM", you want to find documents which contain the string
>> "International Business Machines", or even only parts of it. So this
>> means you need some kind of map relating some acronyms with their
>> content parts. There really are two directions here: acronym to
>> content and content to acronym.
>>
>> One cannot find what an acronym means without some kind of acronym
>> dictionary. This means that whatever approach you intend to use, there
>> should be an external dictionary involved, which, for each acronym,
>> would map a list of possible phrases. Retrieving all phrases matching
>> the inputted acronym, you'd inject each part of each phrase as a token
>> (removing possible duplicates between phrase parts). That's basically
>> it for the direction "acronym to content".
>>
>> The direction "content to acronym" is trickier, I believe. One way is
>> to generate a second (reversed) map, matching each acronym content
>> part to a list of acronyms containing that part. You'd simply inject
>> acronyms (and possibly other things) if one part of their content is
>> matched (or more than one part, if you want to increase relevance).
>> This could however possibly require the definition of a specific
>> hashing mechanism, if you want to find approximate (distanced) keys
>> (e.g. "intenational", with the lacking "r", would still find "IBM"). A
>> second way (more coupled to the concept of acronym, so less generic)
>> could be to consider that every word starting with a capital letter if
>> part of an acronym, buffering sequences of words starting with a
>> capital letter, and eventually injecting the resulting acronym, if
>> found in the acronym dictionary. This might not be safe, though - the
>> user may not have the discipline to capitalize the words being part of
>> an acronym (or may even misspell the first letter), or concatenated
>> first letters could match an irrelevant acronym (many word sequences
>> can give the acronym "IBM").
>>
>> I do not know whether there already exists some Lucene module which
>> processes acronyms, or if someone is working on one. It's definitely
>> worth a search though, because writing a good one from scratch could
>> mean a few days of work, or more.
>>
>> HTH.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message