lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandar Radovanovic <Aleksan...@Radovanovic.com>
Subject Re: [lucy-user] New feature suggestion
Date Sun, 30 Dec 2012 19:26:00 GMT
On 12/30/12 7:12 PM, Peter Karman wrote:
> Aleksandar Radovanovic wrote on 12/30/12 5:21 AM:
>
>> Thank you Marvin, I tried what you have suggested! It works fine, but my
>> main problem still remains: how to find and index *predefined* phrases.
>> In your example this boils down to the implementation of 
>> /extract_chem_names($content). /
>>
>> I was hoping to use some Lucy functionality for this - indexing the
>> whole text, searching the index for predefined phrases and index them
>> separately. But this does not work correctly for biomedical documents in
>> which text often looks like random sequence of weird characters, and
>> strange, no-language words which Lucy simply skips, or stems incorrectly.
>>
>> So, the core of my idea is to have something opposite to stopwords. A
>> list of phrases which will be indexed without stemmer - exactly as they
>> appear in the user supplied list. I was wondering why such a simple and
>> obvious feature was not implemented - or am I missing something?
>>
> You're missing something. Stopword filtering happens *after* tokenizing in the
> analysis chain; so too would your Goword filter. It's the tokenizing that's
> problematic.
>
> The problem isn't the lack of a GoWordFilter, it's the lack of a ChemTokenizer:
> how to tokenize a block of text that contains *both* chemical strings and
> narrative strings. It's like trying to apply an English stemmer to a text that
> contains both English and French. The problem is: how to apply the rules for one
> grammar against a text that contains mixed grammars that use the same alphabet.
> Writing a single regex is practically impossible.
>
> If you just wanted to pull out the chemical strings from your text, and ignore
> everything else, that would be a fairly straightforward task. If you wanted to
> ignore all the chemical strings, that too would be straightforward (that's what
> basically happens by default). But you seem to want to combine them. That's not
> simple or straightforward.
>
> Marvin's suggestion tries to address the complexity you're after. If what you're
> missing is an implementation of extract_chem_names(), that seems like a suitable
> exercise for you to undertake, since that requires domain-specific knowledge. I
> might start with something naive like:
>
>  my @chem_names = (
>      'NH4+/H+K+/NH4+(H+)',
>      '[Hg(CN)2]',
>      'Ca(.-)',
>  );
>
>  sub extract_chem_names {
>     my $text = shift;
>     my @matches;
>     for my $n (@chem_names) {
>        my $esc = quotemeta($n);
>        if ($text =~ m/$esc/) {
>            push @matches, $n;
>        }
>     }
>     return \@matches;
>  }
>
>
>

I see it clearly now. To express it in Lucy syntax, I would need some
expanded polyanalyzer|:|||||

my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new (
    dictionaries => [ $chemicals, $genes, $human_anatomy ],
    language => 'en',
);

Since such a magic does not (yet:-) exists, I'll follow your advice.
Marvin, Peter, thank you so much for all your help!

Regards, Alex


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message