lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Laflamme" <plafla...@konova.com>
Subject RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Mon, 17 Nov 2003 14:38:59 GMT
There is already an implementation in the Java API for sentence boundary
detection. The BreakIterator in the java.text package has this to say about
sentence splitting:

"Sentence boundary analysis allows selection with correct interpretation of
periods within numbers and abbreviations, and trailing punctuation marks
such as quotation marks and parentheses."
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM:
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn more about it's efficiency.

Regards,
Phil

> -----Original Message-----
> From: Chong, Herb [mailto:HChong3@bloomberg.com]
> Sent: November 17, 2003 08:53
> To: Lucene Users List
> Subject: RE: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> i have a program written in Icon that does basic sentence
> splitting. with about 5 heuristics and one small lookup table, i
> can get well over 90% accuracy doing sentence boundary detection
> on email. for well edited English text, like newswires, i can
> manage closer to 99%. this is all that is needed for
> significantly improving a search engine's performance when the
> query engine respects sentence boundaries. incidentally, the GATE
> Information Extraction framework cites some references that
> indicate that for named entity feature extraction, their system
> can exceed the ability of trained humans to detect and classify
> named entities if only one person does the detection.
> collaborating humans are still better, but no-one has the time in
> practical applications.
>
> you probably know, since you know about Markov chains, that
> within sentence term correlation, and hence the language model,
> is different than across sentences. linguists have known this for
> a very long time. it isn't hard to put this capability into a
> search engine, but it absolutely breaks down unless there is
> sentence boundary information stored for use at query time.
>
> Herb....
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Friday, November 14, 2003 5:54 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> Well ... Sure, nothing can replace a human mind. But believe it or not,
> there are studies which show that even human experts can significantly
> differ in their opinions on what are key-phrases for a given text. So,
> the results are never clear cut with humans either...
>
> So, in this sense a heuristic tool for sentence splitting and key-phrase
> detection can go long ways. For example, the application I mentioned,
> uses quite a few heuristic rules (+ Markov chains as a heavier
> ammunition :-), and it comes up with the following phrases for your
> email discussion (the text quoted below):
>
> (lang=EN): NLP, trainable rule-based tagging, natural language
> processing, apache, NLP expert
>
> Now, this set of key-phrases does reflect the main noun-phrases in the
> text... which means I have a practical and tangible benefit from NLP.
> QED ;-)
>
> Best regards,
> Andrzej
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message