lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cedric Ho" <cedric...@gmail.com>
Subject Re: Chinese Segmentation with Phase Query
Date Sun, 11 Nov 2007 02:45:12 GMT
Hi Uwe,

I believe this is the segmentation method used by CJKAnalyzer in Lucene.
The problem is with this Analyzer, many incorrect hits will be
returned during search.

In fact, for pure chinese document (not containing English words), I
believe there's no difference between using a CJKAnalyzer or a simple
CharTokenizer.

Cheers,
Cedric


On Nov 10, 2007 8:08 PM, Uwe Goetzke <uwe.goetzke@healy-hudson.com> wrote:
> Hi Cedric,
>
> Although I have no idea how to use the Chinese language
> but I went a different route to overcome language specific problems.
>
> Instead of using a language specific segmentation we use now the statistical segmentation
with bigrams
>
> e.g.
> Given a your sentence XYZABCDEF
> suppose the segmentation is
> XY YZ ZA AB BC CD DE EF
>
> A SpanNearQuery of (XY, BC, DE, EF) with distance of 10 should work to
> match this document.
>
> I am not sure if this works in your case because we index product information and their
descriptions which are not language friendly anyway because of the abbreviations)
>
> Regards
>
> Uwe Goetzke
>
>
> -----Ursprüngliche Nachricht-----
> Von: Cedric Ho [mailto:cedric.ho@gmail.com]
> Gesendet: Samstag, 10. November 2007 02:28
> An: java-user@lucene.apache.org
> Betreff: - Re: Chinese Segmentation with Phase Query
>
>
> On Nov 10, 2007 2:08 AM, Steven A Rowe <sarowe@syr.edu> wrote:
> > Hi Cedric,
> >
> > On 11/08/2007, Cedric Ho wrote:
> > > a sentence containing characters ABC, it may be segmented into AB, C or A,
BC.
> > [snip]
> > > In this cases we would like to index both segmentation into the index:
> > >
> > > AB offset (0,1) position 0            A offset (0,0) position 0
> > > C offset (2,2) position 1             BC offset (1,2) position 1
> > >
> > > Now the problem is, when someone search using a PhraseQuery (AC) it
> > > will find this line ABC because it match A (position 0) and C
> > > (position 1).
> > >
> > > Are there any ways to search for exact match using the offset
> > > information instead of the position information ?
> >
> > Since you are writing the tokenizer (the Lucene term for the module that performs
the segmentation), you yourself can substitute the beginning offset for the position.  But
I think that without the end offset, it won't get you what you want.
> >
> > For example, if your above example were indexed with beginning offsets as positions,
a phrase query for "AB, C" will fail to match -- even though it should match -- because the
segments' beginning offsets (0 and 2) are not contiguous.
> >
> > The new Payloads feature could provide the basis for storing beginning and ending
offsets required to determine contiguity when matching phrases, but you would have to write
matching and scoring for this representation, and that may not be the quickest route available
to you.
> >
> > Solution #1: Create multiple fields, one for each full alternative segmentation,
and then query against all of them.
> >
> > Solution #2: Store the alternative segmentations in the same field, but instead
of interleaving the segments' positions, as in your example, make the position ranges of the
alternatives non-contiguous.  Recasting your example:
> >
> >         lternative #1   Alternative #2  Alternative #3
> >         -------------   --------------  --------------
> >         AB position 0   A position 100  A position 200
> >         C position 1    BC position 101 B position 201
> >                                                         C position 202
> >
> > There is a problem with both of the above-described solutions: in my limited experience
with Chinese segmentation, substantially less than half the text has alternative segmentations.
 As a result, the segments on which all of alternatives agree (call them "uncontested segments")
will have higher term frequencies than those segments which differ among the alternatives
("contested segments").  This means that document scores will be influenced by the variable
density of the contested segments they contain.
> >
> > However, if you were to use my above-described Solution #1 along with a DisjunctionMaxQuery[1]
as a wrapper around one query per alternative segmentation field, the term frequency problem
would no longer be an issue.  From the API doc for DisjunctionMaxQuery:
> >
> >     A query that generates the union of documents produced by its
> >     subqueries, and that scores each document with the maximum
> >     score for that document as produced by any subquery, plus a
> >     tie breaking increment for any additional matching subqueries.
> >     This is useful when searching for a word in multiple fields
> >     with different boost factors (so that the fields cannot be
> >     combined equivalently into a single search field).  We want
> >     the primary score to be the one associated with the highest
> >     boost, not the sum of the field scores (as BooleanQuery would
> >     give).
> >
> > Unlike the use-case mentioned above, where each field will be boosted differently,
you probably don't have any information about the relative probability of the alternative
segmentations, so you'll want to use the same boost for each sub-query.
> >
> > Steve
> >
> > [1] <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html>
> >
> > --
> > Steve Rowe
> > Center for Natural Language Processing
> > http://www.cnlp.org/tech/lucene.asp
> >
>
> Hi Steve,
>
> We have actually thought about solution #1, and in our case, sorting
> by scoring is not a very important factor as well. However this would
> double the index size. A full index of our documents now would take >
> 80G already, and it's expected to grow much faster in the near future.
> As you've mentioned, the ambiguities in segmentation are rare indeed.
> So we are not very willing to double the index size just for that.
>
> For solution #2, it is a good solution indeed, but you see, our
> problem is much deeper in fact. We are currently try to replace our
> old search engine with Lucene. Which means whatever features the old
> engine have, we need to simulate it in Lucene. One such feature is
> similar to Lucene's SpanNearQuery, which, unfortunately, doesn't work
> with solution #2 if the search terms contains the term BC in
> Alternative #2.
>
> e.g.
> Given a sentence XYZABCDEF
> suppose the segmentation is
> XY position 0
> Z position 1           Alternative #2
> AB position 2         A position 102
> C position 3          BC position 103
> DEF position 4
>
> A SpanNearQuery of (XY, BC, DEF) with distance of 10 would fail to
> match this document.
>
> So it seems we may have to go the more difficult route in exploring
> the Payload feature that you mentioned. I've seen it being announced
> during the release of 2.2.0, but the API says "experimental" and there
> ain't any example about how can it be used.
>
> But thanks very much for your good suggestions.
>
> Cheers,
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen
Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler
bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender
zurückschicken. Bitte löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose
or use this information contained in it. If you have received this email in error please tell
us immediately by return email and delete the document.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Mime
View raw message