lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
Subject Re: tokenize into sentences/sentence splitter
Date Wed, 23 Sep 2015 19:26:00 GMT
Thanks Steve.

It probably also makes sense to extract sentences and then store them. 
But along with each sentence i also need to store its start/end offset. 
I'm not sure how to do that without creating a separate index that 
stores each sentence as a document? Basically the field for sentence and 
the field for terms should be in the same index.

Thanks



On 23/09/2015 19:08, Steve Rowe wrote:
> Hi Ziqi,
>
> Lucene has support for sentence chunking - see SegmentingTokenizerBase, implemeented
in ThaiTokenizer and HMMChineseTokenizer.  There is an example in that class’s tests that
creates tokens out of individual sentences: TestSegmentingTokenizerBase.WholeSentenceTokenizer.
>
> However, it sounds like you only need to store the sentences, not search against them,
so I don’t think you need sentence *tokenization*.
>
> why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do sentence splitting
and add to the doc as stored fields?
>
> Steve
> www.lucidworks.com
>
>> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zhang@sheffield.ac.uk> wrote:
>>
>> Thanks that is understood.
>>
>> My application is a bit special in the way that I need both an indexed field with
standard tokenization and an unindexed but stored field of sentences. Both must be present
for each document.
>>
>> I could possibly do with PatternTokenizer, but that is of course, less accurate than
e.g., wrapping OpenNLP sentence splitter in a lucene Tokenizer.
>>
>>
>>
>> On 23/09/2015 16:23, Doug Turnbull wrote:
>>> Sentence recognition is usually an NLP problem. Probably best handled
>>> outside of Solr. For example, you probably want to train and run a sentence
>>> recognition algorithm, inject a sentence delimiter, then use that delimiter
>>> as the basis for tokenization.
>>>
>>> More info on sentence recognition
>>> http://opennlp.apache.org/documentation/manual/opennlp.html
>>>
>>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zhang@sheffield.ac.uk>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I need a special kind of 'token' which is a sentence, so I need a
>>>> tokenizer that splits texts into sentences.
>>>>
>>>> I wonder if there is already such or similar implementations?
>>>>
>>>> If I have to implement it myself, I suppose I need to implement a subclass
>>>> of Tokenizer. Having looked at a few existing implementations, it does not
>>>> look very straightforward how to do it. A few pointers would be highly
>>>> appreciated.
>>>>
>>>> Many thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>
>> -- 
>> Ziqi Zhang
>> Research Associate
>> Department of Computer Science
>> University of Sheffield
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message