Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <17460956.63001276352174112.JavaMail.jira@thor>
Date: Sat, 12 Jun 2010 10:16:14 -0400 (EDT)
From: "Robert Muir (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with
 the UAX#29 Standard
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878268#action_12878268 ] 

Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. You had mentioned wanting to make a Thai syllabification routine - I was thinking that either you or I would do this.

OK, this makes sense. 

bq. You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the tokenizer in addition to the tokens. Hmm.

Or attributes that mark sentence boundaries. or bumped position increments for sentence boundaries (that also prevent phrase searches across sentences). or maybe other ideas.

bq. Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well.

Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal.
either way, the argument in the paper is for bigrams (n=2)... how is this quadrupled index size? its just like CJKTokenizer...

{quote}
I don't understand how Sentence segmentation could help?

One other possibility is to return everything from the tokenizer, marking the non-tokens with an appropriate type, similar to how the ICU tokenizer works. This has the unfortunate side effect of requiring post-tokenization filtering to discard non-tokens.
{quote}

Right, but it could be attributes or position increments for sentence boundaries too. then you just wouldnt shingle across missing position increments, and phrase queries wouldnt match across sentence boundaries either.

In my opinion, I think the patch here already solves a lot of problems on its own, and I suggest we explore these ideas later (including thai etc) in a separate issue. With the patch as-is now, people can use the ThaiWordFilter. If they need support for the other languages, they have ICUTokenizer as a workaround. We could think about how to do the more complex stuff in more general ways (sentence seg., conditional branching, etc).

In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too. Perhaps it could be re-used by highlighting etc as well.


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org