lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: simple tokenizer question
Date Sun, 08 Dec 2013 15:29:43 GMT
If you want to just split on whitespace, then the WhitespaceTokenizer
will do the job.

However, this will mean that these two tokens aren't the same, and won't
match each other:

cat
cat.

A simple regex filter could handle those cases, remove a comma or dot
when at the end of a word. Although there are other similar situations
(quotes, colons, etc) that you may want to handle eventually.

Upayavira

On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote:
> Thanks for your email.
> 
> Great, I will look at the WordDelimiterFactory. Just to make clear, I
> DON'T
> want any other tokenizing on digits, specialchars, punctuations etc done
> other than word delimiting on whitespace.
> 
> All I want for my first version is NO removal of punctuations/special
> characters at indexing time and during search time i.e., input as-is and
> search as-is (like a simple sql db?) . I was assuming this would be a
> trivial case with SOLR and not sure what I am missing here.
> 
> thanks
> Vulcanoid
> 
> 
> 
> On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <uv@odoko.co.uk> wrote:
> 
> > Have you tried a WhitespaceTokenizerFactory followed by the
> > WordDelimiterFilterFactory? The latter is perhaps more configurable at
> > what it does. Alternatively, you could use a RegexFilterFactory to
> > remove extraneous punctuation that wasn't removed by the Whitespace
> > Tokenizer.
> >
> > Upayavira
> >
> > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> > > Hi,
> > >
> > > I am new to solr and I guess this is a basic tokenizer question so please
> > > bear with me.
> > >
> > > I am trying to use SOLR to index a few (Indian) legal judgments in text
> > > form and search against them. One of the key points with these documents
> > > is
> > > that the sections/provisions of law usually have punctuation/special
> > > characters in them. For example search queries will TYPICALLY be section
> > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> > > themselves will contain these sort of text with section references all
> > > over
> > > the place.
> > >
> > > Now, using a default schema setup with standardtokenizer, which seems to
> > > delimit on whitespace AND punctuations, I get really bad results because
> > > it
> > > looks like 12AA is split and results such having 12 and AA in them turn
> > > up.
> > >  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
> > >  being
> > > turned up.
> > >
> > > What is the best solution here? I really just want to index the document
> > > as-is and also to do whitespace tokenizing on the search and nothing
> > > more.
> > >
> > > So in other words:
> > > a) I would like the text document to be indexed as-is with say 12AA and
> > > 9(1)(vii) in the document stored as it is mentioned.
> > > b) I would like to be able to search for 12AA and for 9(1)(vii) and get
> > > proper full matches on them without any splitting up/munging etc.
> > >
> > > Any suggestions are appreciated.  Thank you for your time.
> > >
> > > Thanks
> > > Vulcanoid
> >

Mime
View raw message