lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Fri, 05 Jun 2009 14:06:07 GMT
kk as for your first issue, that WordDelimiterFilter is package
protected, one option is to make a copy of the code and change the
class declaration to public.
the other option is to put your entire analyzer in
'org.apache.solr.analysis' package so that you can access it...

for the 2nd issue, yes you need to supply some options to it. the
default options solr applies to type 'text' seemed to work well for me
with indic:

{splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
generateWordParts=1, catenateAll=0, catenateNumbers=1}

On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.software@gmail.com> wrote:
>
> Thanks Robert. There is one problem though, I'm able to plugin the word
> delimiter filter from solr-nightly jar file. When I tried to do something
> like,
>  TokenStream ts = new WhitespaceTokenizer(reader);
>   ts = new WordDelimiterFilter(ts);
>   ts = new PorterStemmerFilter(ts);
>   ...rest as in the last mail...
>
> It gave me an error saying that
>
> org.apache.solr.analysis.WordDelimiterFilter is not public in
> org.apache.solr.analysis; cannot be accessed from outside package
> import org.apache.solr.analysis.WordDelimiterFilter;
>                               ^
> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> symbol  : class WordDelimiterFilter
> location: class solrSearch.IndicAnalyzer
>    ts = new WordDelimiterFilter(ts);
>             ^
> 2 errors
>
> Then i tried to see the code for worddelimitefiter from solrnightly src and
> found that there are many deprecated constructors though they require a lot
> of parameters alongwith tokenstream. I went through the solr wiki for
> worddelimiterfilterfactory here,
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> and say that there also its specified that we've to mention the parameters
> and both are different for indexing and querying.
> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> custom analyzer, I've to use it anyway.
> In my code I've to make use of worddelimiterfilter and not
> worddelimiterfilterfactory, right? I don't know whats the use of the other
> one. Anyway can you guide me getting rid of the above error. And yes I'll
> change the order of applying the filters as you said.
>
> Thanks,
> KK.
>
>
>
>
>
>
>
> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > KK, you got the right idea.
> >
> > though I think you might want to change the order, move the stopfilter
> > before the porter stem filter... otherwise it might not work correctly.
> >
> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com> wrote:
> >
> > > Thanks Robert. This is exactly what I did and  its working but delimiter
> > is
> > > missing I'm going to add that from solr-nightly.jar
> > >
> > > /**
> > >  * Analyzer for Indian language.
> > >  */
> > > public class IndicAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> > >    ts = new PorterStemFilter(ts);
> > >    ts = new LowerCaseFilter(ts);
> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > >    return ts;
> > >  }
> > > }
> > >
> > > Its able to do stemming/case-folding and supports search for both english
> > > and indic texts. let me try out the delimiter. Will update you on that.
> > >
> > > Thanks a lot.
> > > KK
> > >
> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > i think you are on the right track... once you build your analyzer, put
> > > it
> > > > in your classpath and play around with it in luke and see if it does
> > what
> > > > you want.
> > > >
> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com>
wrote:
> > > >
> > > > > Hi Robert,
> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > > > >
> > > > > public class ThaiAnalyzer extends Analyzer {
> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
{
> > > > >      TokenStream ts = new StandardTokenizer(reader);
> > > > >    ts = new StandardFilter(ts);
> > > > >    ts = new ThaiWordFilter(ts);
> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > > > >    return ts;
> > > > >  }
> > > > > }
> > > > >
> > > > > Now as you said, I've to use whitespacetokenizer
> > > > > withworddelimitefilter[solr
> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> > something
> > > > like
> > > > > this,
> > > > > public class IndicAnalyzer extends Analyzer {
> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
{
> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > > > >   ts = new WordDelimiterFilter(ts);
> > > > >   ts = new LowerCaseFilter(ts);
> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> > english
> > > > > stop filter, is this the default one?
> > > > >   ts = new PorterFilter(ts);
> > > > >   return ts;
> > > > >  }
> > > > > }
> > > > >
> > > > > Does this sound OK? I think it will do the job...let me try it out..
> > > > > I dont need custom filter as per my requirement, at least not for
> > these
> > > > > basic things I'm doing? I think so...
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcmuir@gmail.com>
> > wrote:
> > > > >
> > > > > > KK well you can always get some good examples from the lucene
> > contrib
> > > > > > codebase.
> > > > > > For example, look at the DutchAnalyzer, especially:
> > > > > >
> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > > > >
> > > > > > See how it combines a specified tokenizer with various filters?
> > this
> > > is
> > > > > > what
> > > > > > you want to do, except of course you want to use different
> > tokenizer
> > > > and
> > > > > > filters.
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.software@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Thanks Muir.
> > > > > > > Thanks for letting me know that I dont need language identifiers.
> > > > > > >  I'll have a look and will try to write the analyzer.
For my case
> > I
> > > > > think
> > > > > > > it
> > > > > > > wont be that difficult.
> > > > > > > BTW, can you point me to some sample codes/tutorials writing
> > custom
> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
something
> > > > htere?
> > > > > > do
> > > > > > > let me know.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > KK.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcmuir@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > KK, for your case, you don't really need to go to
the effort of
> > > > > > detecting
> > > > > > > > whether fragments are english or not.
> > > > > > > > Because the English stemmers in lucene will not modify
your
> > Indic
> > > > > text,
> > > > > > > and
> > > > > > > > neither will the LowerCaseFilter.
> > > > > > > >
> > > > > > > > what you want to do is create a custom analyzer that
works like
> > > > this
> > > > > > > >
> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
Solr
> > nightly
> > > > > jar],
> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Robert
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thank you all.
> > > > > > > > > To be frank I was using Solr in the begining
half a month
> > ago.
> > > > The
> > > > > > > > > problem[rather bug] with solr was creation of
new index on
> > the
> > > > fly.
> > > > > > > > Though
> > > > > > > > > they have a restful method for teh same, but
it was not
> > > working.
> > > > If
> > > > > I
> > > > > > > > > remember properly one of Solr commiter "Noble
Paul"[I dont
> > know
> > > > his
> > > > > > > real
> > > > > > > > > name] was trying to help me. I tried many nightly
builds and
> > > > > spending
> > > > > > a
> > > > > > > > > couple of days stuck at that made me think of
lucene and I
> > > > switched
> > > > > > to
> > > > > > > > it.
> > > > > > > > > Now after working with lucene which gives you
full control of
> > > > > > > everything
> > > > > > > > I
> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
is
> > similar
> > > > to
> > > > > > > > > Window$:Linux, its my view only, though]. Coming
back to the
> > > > point
> > > > > as
> > > > > > > Uwe
> > > > > > > > > mentioned that we can do the same thing in lucene
as well,
> > what
> > > > is
> > > > > > > > > available
> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > > > I request Uwe to give me some more ideas on using
the
> > analyzers
> > > > > from
> > > > > > > solr
> > > > > > > > > that will do the job for me, handling a mix of
both english
> > and
> > > > > > > > non-english
> > > > > > > > > content.
> > > > > > > > > Muir, can you give me a bit detail description
of how to use
> > > the
> > > > > > > > > WordDelimiteFilter to do my job.
> > > > > > > > > On a side note, I was thingking of writing a
simple analyzer
> > > that
> > > > > > will
> > > > > > > do
> > > > > > > > > the following,
> > > > > > > > > #. If the webpage fragment is non-english[for
me its some
> > > indian
> > > > > > > > language]
> > > > > > > > > then index them as such, no stemming/ stop word
removal to
> > > begin
> > > > > > with.
> > > > > > > As
> > > > > > > > I
> > > > > > > > > know its in UCN unicode something like
> > > > > \u0021\u0012\u34ae\u0031[just
> > > > > > a
> > > > > > > > > sample]
> > > > > > > > > # If the fragment is english then apply standard
anlyzing
> > > process
> > > > > for
> > > > > > > > > english content. I've not thought of quering
in the same way
> > as
> > > > of
> > > > > > now
> > > > > > > > i.e
> > > > > > > > > mix of non-english and engish words.
> > > > > > > > > Now to get all this,
> > > > > > > > >  #1. I need some sort of way which will let
me know if the
> > > > content
> > > > > is
> > > > > > > > > english or not. If not english just add the tokens
to the
> > > > document.
> > > > > > Do
> > > > > > > we
> > > > > > > > > really need language identifiers, as i dont have
any other
> > > > content
> > > > > > that
> > > > > > > > > uses
> > > > > > > > > the same script as english other than those \u1234
things for
> > > my
> > > > > > indian
> > > > > > > > > language content. Any smart hack/trick for the
same?
> > > > > > > > >  #2. If the its english apply all normal process
and add the
> > > > > stemmed
> > > > > > > > token
> > > > > > > > > to document.
> > > > > > > > > For all this I was thinking of iterating earch
word of the
> > web
> > > > page
> > > > > > and
> > > > > > > > > apply the above procedure. And finallyadd  the
newly created
> > > > > document
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > index.
> > > > > > > > >
> > > > > > > > > I would like some one to guide me in this direction.
I'm
> > pretty
> > > > > > people
> > > > > > > > must
> > > > > > > > > have done similar/same thing earlier, I request
them to guide
> > > me/
> > > > > > point
> > > > > > > > me
> > > > > > > > > to some tutorials for the same.
> > > > > > > > > Else help me out writing a custom analyzer only
if thats not
> > > > going
> > > > > to
> > > > > > > be
> > > > > > > > > too
> > > > > > > > > complex. LOL, I'm a new user to lucene and know
basics of
> > Java
> > > > > > coding.
> > > > > > > > > Thank you very much.
> > > > > > > > >
> > > > > > > > > --KK.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> > rcmuir@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > yes this is true. for starters KK, might
be good to startup
> > > > solr
> > > > > > and
> > > > > > > > look
> > > > > > > > > > at
> > > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > > > >
> > > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter
> > is
> > > > the
> > > > > > > piece
> > > > > > > > > you
> > > > > > > > > > will want for your text, mainly for punctuation
but also
> > for
> > > > > format
> > > > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler
<
> > > uwe@thetaphi.de
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > You can also re-use the solr analyzers,
as far as I found
> > > > out.
> > > > > > > There
> > > > > > > > is
> > > > > > > > > > an
> > > > > > > > > > > issue in jIRA/discussion on java-dev
to merge them.
> > > > > > > > > > >
> > > > > > > > > > > -----
> > > > > > > > > > > Uwe Schindler
> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > > > http://www.thetaphi.de
> > > > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > > > > Sent: Thursday, June 04, 2009
1:18 PM
> > > > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > > > Subject: Re: How to support stemming
and case folding
> > for
> > > > > > english
> > > > > > > > > > content
> > > > > > > > > > > > mixed with non-english content?
> > > > > > > > > > > >
> > > > > > > > > > > > KK, ok, so you only really want
to stem the english.
> > This
> > > > is
> > > > > > > good.
> > > > > > > > > > > >
> > > > > > > > > > > > Is it possible for you to consider
using solr? solr's
> > > > default
> > > > > > > > > analyzer
> > > > > > > > > > > for
> > > > > > > > > > > > type 'text' will be good for your
case. it will do the
> > > > > > following
> > > > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > > > 2. handle both indian language
and english punctuation
> > > > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > > > 4. stem the english.
> > > > > > > > > > > >
> > > > > > > > > > > > try a nightly build,
> > > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM,
KK <
> > > > > dioxide.software@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > > > I'm indexing indian language
web pages which has got
> > > > > descent
> > > > > > > > amount
> > > > > > > > > > of
> > > > > > > > > > > > > english content mixed with
therein. For the time
> > being
> > > > I'm
> > > > > > not
> > > > > > > > > going
> > > > > > > > > > to
> > > > > > > > > > > > use
> > > > > > > > > > > > > any stemmers as we don't
have standard stemmers for
> > > > indian
> > > > > > > > > languages
> > > > > > > > > > .
> > > > > > > > > > > > So
> > > > > > > > > > > > > what I want to do is like
this,
> > > > > > > > > > > > > Say I've a web page having
hindi content with 5%
> > > english
> > > > > > > content.
> > > > > > > > > > Then
> > > > > > > > > > > > for
> > > > > > > > > > > > > hindi I want to use the basic
white space analyzer as
> > > we
> > > > > dont
> > > > > > > > have
> > > > > > > > > > > > stemmers
> > > > > > > > > > > > > for this as I mentioned earlier
and whereever english
> > > > > appears
> > > > > > I
> > > > > > > > > want
> > > > > > > > > > > > them
> > > > > > > > > > > > > to
> > > > > > > > > > > > > be stemmed tokenized etc[the
standard process used
> > for
> > > > > > english
> > > > > > > > > > > content].
> > > > > > > > > > > > As
> > > > > > > > > > > > > of now I'm using whitespace
analyzer for the full
> > > content
> > > > > > which
> > > > > > > > > > doesnot
> > > > > > > > > > > > > support case folding, stemming
etc for teh content.
> > So
> > > if
> > > > > > there
> > > > > > > > is
> > > > > > > > > an
> > > > > > > > > > > > > english word say "Detection"
indexed as such then
> > > > searching
> > > > > > for
> > > > > > > > > > > > detection
> > > > > > > > > > > > > or
> > > > > > > > > > > > > detect is not giving any
results, which is the
> > expected
> > > > > > > behavior,
> > > > > > > > > but
> > > > > > > > > > I
> > > > > > > > > > > > > want
> > > > > > > > > > > > > this kind of queries to give
results.
> > > > > > > > > > > > > I hope I made it clear. Let
me know any ideas on
> > doing
> > > > the
> > > > > > > same.
> > > > > > > > > And
> > > > > > > > > > > one
> > > > > > > > > > > > > more thing, I'm storing the
full webpage content
> > under
> > > a
> > > > > > single
> > > > > > > > > > field,
> > > > > > > > > > > I
> > > > > > > > > > > > > hope this will not make any
difference, right?
> > > > > > > > > > > > > It seems I've to use language
identifiers, but do we
> > > > really
> > > > > > > need
> > > > > > > > > > that?
> > > > > > > > > > > > > Because we've only non-english
content mixed with
> > > > > english[and
> > > > > > > not
> > > > > > > > > > > french
> > > > > > > > > > > > or
> > > > > > > > > > > > > russian etc].
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is the best way of approaching
the problem? Any
> > > > > > thoughts!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > KK.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42
PM, Robert Muir <
> > > > > > rcmuir@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > KK, is all of your latin
script text actually
> > > english?
> > > > Is
> > > > > > > there
> > > > > > > > > > stuff
> > > > > > > > > > > > > like
> > > > > > > > > > > > > > german or french mixed
in?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And for your non-english
content (your examples
> > have
> > > > been
> > > > > > > > indian
> > > > > > > > > > > > writing
> > > > > > > > > > > > > > systems), is it generally
true that if you had
> > > > > devanagari,
> > > > > > > you
> > > > > > > > > can
> > > > > > > > > > > > assume
> > > > > > > > > > > > > > its hindi? or is there
stuff like marathi mixed in?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Reason I say this is
to invoke the right stemmers,
> > > you
> > > > > > really
> > > > > > > > > need
> > > > > > > > > > > > some
> > > > > > > > > > > > > > language detection,
but perhaps in your case you
> > can
> > > > > cheat
> > > > > > > and
> > > > > > > > > > detect
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > based on scripts...
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Robert
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jun 3, 2009
at 10:15 AM, KK <
> > > > > > > > dioxide.software@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > I'm indexing some
non-english content. But the
> > page
> > > > > also
> > > > > > > > > contains
> > > > > > > > > > > > > english
> > > > > > > > > > > > > > > content. As of
now I'm using WhitespaceAnalyzer
> > for
> > > > all
> > > > > > > > content
> > > > > > > > > > and
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > storing the full
webpage content under a single
> > > > filed.
> > > > > > Now
> > > > > > > we
> > > > > > > > > > > > require
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > support case folding
and stemmming for the
> > english
> > > > > > content
> > > > > > > > > > > > intermingled
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > non-english content.
I must metion that we dont
> > > have
> > > > > > > stemming
> > > > > > > > > and
> > > > > > > > > > > > case
> > > > > > > > > > > > > > > folding for these
non-english content. I'm stuck
> > > with
> > > > > > this.
> > > > > > > > > Some
> > > > > > > > > > > one
> > > > > > > > > > > > do
> > > > > > > > > > > > > > let
> > > > > > > > > > > > > > > me know how to
proceed for fixing this issue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > KK.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcmuir@gmail.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >



--
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message