lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Will Martin" <wmartin...@gmail.com>
Subject RE: A really hairy token graph case
Date Fri, 24 Oct 2014 23:10:08 GMT
Benson:  I'm in danger of trying to remember CPL's german decompounder and how we used it.
That would be a very unreliable  memory.

However at the link below David and Rupert have a resoundingly informative discussion about
making similar work for synonyms. It might bear reading through the kb info captured there.

https://github.com/OpenSextant/SolrTextTagger/issues/10




-----Original Message-----
From: Benson Margulies [mailto:benson@basistech.com] 
Sent: Friday, October 24, 2014 5:54 PM
To: java-user@lucene apache. org; Richard Barnes
Subject: Re: A really hairy token graph case

I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma and a sequence
of components.

So, we product

     surface form
     lemma    PI 0
     comp1    PI 0
     comp2    PI 1
     .....

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver another lemma
and another set of components, but, of course, we can't do that.

The best we could do is something like:

    surface form
    lemma1  PI 0
    lemma2 PI 0
    ....
    lemmaN PI 0

    comp0-1  PI 0
    comp1-1 PI 0
    ....
     ....
     comp0-N
    compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin <wmartinusa@gmail.com> wrote:

> HI Benson:
>
> This is the case with n-gramming (though you have a more complicated 
> start chooser than most I imagine).  Does that help get your ideas unblocked?
>
> Will
>
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Friday, October 24, 2014 4:43 PM
> To: java-user@lucene.apache.org
> Subject: A really hairy token graph case
>
> Consider a case where we have a token which can be subdivided in 
> several ways. This can happen in German. We'd like to represent this 
> with positionIncrement/positionLength, but it does not seem possible.
>
> Once the position has moved out from one set of 'subtokens', we see no 
> way to move it back for the second set of alternatives.
>
> Is this something that was considered?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message