lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Fri, 05 Jun 2009 13:12:53 GMT
Thanks Robert. There is one problem though, I'm able to plugin the word
delimiter filter from solr-nightly jar file. When I tried to do something
like,
 TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new WordDelimiterFilter(ts);
   ts = new PorterStemmerFilter(ts);
   ...rest as in the last mail...

It gave me an error saying that

org.apache.solr.analysis.WordDelimiterFilter is not public in
org.apache.solr.analysis; cannot be accessed from outside package
import org.apache.solr.analysis.WordDelimiterFilter;
                               ^
solrSearch/IndicAnalyzer.java:38: cannot find symbol
symbol  : class WordDelimiterFilter
location: class solrSearch.IndicAnalyzer
    ts = new WordDelimiterFilter(ts);
             ^
2 errors

Then i tried to see the code for worddelimitefiter from solrnightly src and
found that there are many deprecated constructors though they require a lot
of parameters alongwith tokenstream. I went through the solr wiki for
worddelimiterfilterfactory here,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
and say that there also its specified that we've to mention the parameters
and both are different for indexing and querying.
I'm kind of stuck here, how do I make use of worddelimiterfilter in my
custom analyzer, I've to use it anyway.
In my code I've to make use of worddelimiterfilter and not
worddelimiterfilterfactory, right? I don't know whats the use of the other
one. Anyway can you guide me getting rid of the above error. And yes I'll
change the order of applying the filters as you said.

Thanks,
KK.







On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com> wrote:

> KK, you got the right idea.
>
> though I think you might want to change the order, move the stopfilter
> before the porter stem filter... otherwise it might not work correctly.
>
> On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com> wrote:
>
> > Thanks Robert. This is exactly what I did and  its working but delimiter
> is
> > missing I'm going to add that from solr-nightly.jar
> >
> > /**
> >  * Analyzer for Indian language.
> >  */
> > public class IndicAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >     TokenStream ts = new WhitespaceTokenizer(reader);
> >    ts = new PorterStemFilter(ts);
> >    ts = new LowerCaseFilter(ts);
> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >    return ts;
> >  }
> > }
> >
> > Its able to do stemming/case-folding and supports search for both english
> > and indic texts. let me try out the delimiter. Will update you on that.
> >
> > Thanks a lot.
> > KK
> >
> > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > i think you are on the right track... once you build your analyzer, put
> > it
> > > in your classpath and play around with it in luke and see if it does
> what
> > > you want.
> > >
> > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com> wrote:
> > >
> > > > Hi Robert,
> > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > > >
> > > > public class ThaiAnalyzer extends Analyzer {
> > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > >      TokenStream ts = new StandardTokenizer(reader);
> > > >    ts = new StandardFilter(ts);
> > > >    ts = new ThaiWordFilter(ts);
> > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > > >    return ts;
> > > >  }
> > > > }
> > > >
> > > > Now as you said, I've to use whitespacetokenizer
> > > > withworddelimitefilter[solr
> > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> something
> > > like
> > > > this,
> > > > public class IndicAnalyzer extends Analyzer {
> > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > > >   ts = new WordDelimiterFilter(ts);
> > > >   ts = new LowerCaseFilter(ts);
> > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> english
> > > > stop filter, is this the default one?
> > > >   ts = new PorterFilter(ts);
> > > >   return ts;
> > > >  }
> > > > }
> > > >
> > > > Does this sound OK? I think it will do the job...let me try it out..
> > > > I dont need custom filter as per my requirement, at least not for
> these
> > > > basic things I'm doing? I think so...
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> > > >
> > > > > KK well you can always get some good examples from the lucene
> contrib
> > > > > codebase.
> > > > > For example, look at the DutchAnalyzer, especially:
> > > > >
> > > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > > >
> > > > > See how it combines a specified tokenizer with various filters?
> this
> > is
> > > > > what
> > > > > you want to do, except of course you want to use different
> tokenizer
> > > and
> > > > > filters.
> > > > >
> > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.software@gmail.com>
> > wrote:
> > > > >
> > > > > > Thanks Muir.
> > > > > > Thanks for letting me know that I dont need language identifiers.
> > > > > >  I'll have a look and will try to write the analyzer. For my
case
> I
> > > > think
> > > > > > it
> > > > > > wont be that difficult.
> > > > > > BTW, can you point me to some sample codes/tutorials writing
> custom
> > > > > > analyzers. I could not find something in LIA2ndEdn. Is something
> > > htere?
> > > > > do
> > > > > > let me know.
> > > > > >
> > > > > > Thanks,
> > > > > > KK.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcmuir@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > KK, for your case, you don't really need to go to the effort
of
> > > > > detecting
> > > > > > > whether fragments are english or not.
> > > > > > > Because the English stemmers in lucene will not modify
your
> Indic
> > > > text,
> > > > > > and
> > > > > > > neither will the LowerCaseFilter.
> > > > > > >
> > > > > > > what you want to do is create a custom analyzer that works
like
> > > this
> > > > > > >
> > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
> nightly
> > > > jar],
> > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Robert
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thank you all.
> > > > > > > > To be frank I was using Solr in the begining half
a month
> ago.
> > > The
> > > > > > > > problem[rather bug] with solr was creation of new
index on
> the
> > > fly.
> > > > > > > Though
> > > > > > > > they have a restful method for teh same, but it was
not
> > working.
> > > If
> > > > I
> > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
dont
> know
> > > his
> > > > > > real
> > > > > > > > name] was trying to help me. I tried many nightly
builds and
> > > > spending
> > > > > a
> > > > > > > > couple of days stuck at that made me think of lucene
and I
> > > switched
> > > > > to
> > > > > > > it.
> > > > > > > > Now after working with lucene which gives you full
control of
> > > > > > everything
> > > > > > > I
> > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
is
> similar
> > > to
> > > > > > > > Window$:Linux, its my view only, though]. Coming back
to the
> > > point
> > > > as
> > > > > > Uwe
> > > > > > > > mentioned that we can do the same thing in lucene
as well,
> what
> > > is
> > > > > > > > available
> > > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > > I request Uwe to give me some more ideas on using
the
> analyzers
> > > > from
> > > > > > solr
> > > > > > > > that will do the job for me, handling a mix of both
english
> and
> > > > > > > non-english
> > > > > > > > content.
> > > > > > > > Muir, can you give me a bit detail description of
how to use
> > the
> > > > > > > > WordDelimiteFilter to do my job.
> > > > > > > > On a side note, I was thingking of writing a simple
analyzer
> > that
> > > > > will
> > > > > > do
> > > > > > > > the following,
> > > > > > > > #. If the webpage fragment is non-english[for me its
some
> > indian
> > > > > > > language]
> > > > > > > > then index them as such, no stemming/ stop word removal
to
> > begin
> > > > > with.
> > > > > > As
> > > > > > > I
> > > > > > > > know its in UCN unicode something like
> > > > \u0021\u0012\u34ae\u0031[just
> > > > > a
> > > > > > > > sample]
> > > > > > > > # If the fragment is english then apply standard anlyzing
> > process
> > > > for
> > > > > > > > english content. I've not thought of quering in the
same way
> as
> > > of
> > > > > now
> > > > > > > i.e
> > > > > > > > mix of non-english and engish words.
> > > > > > > > Now to get all this,
> > > > > > > >  #1. I need some sort of way which will let me know
if the
> > > content
> > > > is
> > > > > > > > english or not. If not english just add the tokens
to the
> > > document.
> > > > > Do
> > > > > > we
> > > > > > > > really need language identifiers, as i dont have any
other
> > > content
> > > > > that
> > > > > > > > uses
> > > > > > > > the same script as english other than those \u1234
things for
> > my
> > > > > indian
> > > > > > > > language content. Any smart hack/trick for the same?
> > > > > > > >  #2. If the its english apply all normal process and
add the
> > > > stemmed
> > > > > > > token
> > > > > > > > to document.
> > > > > > > > For all this I was thinking of iterating earch word
of the
> web
> > > page
> > > > > and
> > > > > > > > apply the above procedure. And finallyadd  the newly
created
> > > > document
> > > > > > to
> > > > > > > > the
> > > > > > > > index.
> > > > > > > >
> > > > > > > > I would like some one to guide me in this direction.
I'm
> pretty
> > > > > people
> > > > > > > must
> > > > > > > > have done similar/same thing earlier, I request them
to guide
> > me/
> > > > > point
> > > > > > > me
> > > > > > > > to some tutorials for the same.
> > > > > > > > Else help me out writing a custom analyzer only if
thats not
> > > going
> > > > to
> > > > > > be
> > > > > > > > too
> > > > > > > > complex. LOL, I'm a new user to lucene and know basics
of
> Java
> > > > > coding.
> > > > > > > > Thank you very much.
> > > > > > > >
> > > > > > > > --KK.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> rcmuir@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > yes this is true. for starters KK, might be good
to startup
> > > solr
> > > > > and
> > > > > > > look
> > > > > > > > > at
> > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > > >
> > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter
> is
> > > the
> > > > > > piece
> > > > > > > > you
> > > > > > > > > will want for your text, mainly for punctuation
but also
> for
> > > > format
> > > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > > >
> > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler
<
> > uwe@thetaphi.de
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > You can also re-use the solr analyzers,
as far as I found
> > > out.
> > > > > > There
> > > > > > > is
> > > > > > > > > an
> > > > > > > > > > issue in jIRA/discussion on java-dev to
merge them.
> > > > > > > > > >
> > > > > > > > > > -----
> > > > > > > > > > Uwe Schindler
> > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > > http://www.thetaphi.de
> > > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18
PM
> > > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > > Subject: Re: How to support stemming
and case folding
> for
> > > > > english
> > > > > > > > > content
> > > > > > > > > > > mixed with non-english content?
> > > > > > > > > > >
> > > > > > > > > > > KK, ok, so you only really want to
stem the english.
> This
> > > is
> > > > > > good.
> > > > > > > > > > >
> > > > > > > > > > > Is it possible for you to consider
using solr? solr's
> > > default
> > > > > > > > analyzer
> > > > > > > > > > for
> > > > > > > > > > > type 'text' will be good for your case.
it will do the
> > > > > following
> > > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > > 2. handle both indian language and
english punctuation
> > > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > > 4. stem the english.
> > > > > > > > > > >
> > > > > > > > > > > try a nightly build,
> > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK
<
> > > > dioxide.software@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > > I'm indexing indian language web
pages which has got
> > > > descent
> > > > > > > amount
> > > > > > > > > of
> > > > > > > > > > > > english content mixed with therein.
For the time
> being
> > > I'm
> > > > > not
> > > > > > > > going
> > > > > > > > > to
> > > > > > > > > > > use
> > > > > > > > > > > > any stemmers as we don't have
standard stemmers for
> > > indian
> > > > > > > > languages
> > > > > > > > > .
> > > > > > > > > > > So
> > > > > > > > > > > > what I want to do is like this,
> > > > > > > > > > > > Say I've a web page having hindi
content with 5%
> > english
> > > > > > content.
> > > > > > > > > Then
> > > > > > > > > > > for
> > > > > > > > > > > > hindi I want to use the basic
white space analyzer as
> > we
> > > > dont
> > > > > > > have
> > > > > > > > > > > stemmers
> > > > > > > > > > > > for this as I mentioned earlier
and whereever english
> > > > appears
> > > > > I
> > > > > > > > want
> > > > > > > > > > > them
> > > > > > > > > > > > to
> > > > > > > > > > > > be stemmed tokenized etc[the standard
process used
> for
> > > > > english
> > > > > > > > > > content].
> > > > > > > > > > > As
> > > > > > > > > > > > of now I'm using whitespace analyzer
for the full
> > content
> > > > > which
> > > > > > > > > doesnot
> > > > > > > > > > > > support case folding, stemming
etc for teh content.
> So
> > if
> > > > > there
> > > > > > > is
> > > > > > > > an
> > > > > > > > > > > > english word say "Detection" indexed
as such then
> > > searching
> > > > > for
> > > > > > > > > > > detection
> > > > > > > > > > > > or
> > > > > > > > > > > > detect is not giving any results,
which is the
> expected
> > > > > > behavior,
> > > > > > > > but
> > > > > > > > > I
> > > > > > > > > > > > want
> > > > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > > > I hope I made it clear. Let me
know any ideas on
> doing
> > > the
> > > > > > same.
> > > > > > > > And
> > > > > > > > > > one
> > > > > > > > > > > > more thing, I'm storing the full
webpage content
> under
> > a
> > > > > single
> > > > > > > > > field,
> > > > > > > > > > I
> > > > > > > > > > > > hope this will not make any difference,
right?
> > > > > > > > > > > > It seems I've to use language
identifiers, but do we
> > > really
> > > > > > need
> > > > > > > > > that?
> > > > > > > > > > > > Because we've only non-english
content mixed with
> > > > english[and
> > > > > > not
> > > > > > > > > > french
> > > > > > > > > > > or
> > > > > > > > > > > > russian etc].
> > > > > > > > > > > >
> > > > > > > > > > > > What is the best way of approaching
the problem? Any
> > > > > thoughts!
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > KK.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM,
Robert Muir <
> > > > > rcmuir@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > KK, is all of your latin
script text actually
> > english?
> > > Is
> > > > > > there
> > > > > > > > > stuff
> > > > > > > > > > > > like
> > > > > > > > > > > > > german or french mixed in?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And for your non-english
content (your examples
> have
> > > been
> > > > > > > indian
> > > > > > > > > > > writing
> > > > > > > > > > > > > systems), is it generally
true that if you had
> > > > devanagari,
> > > > > > you
> > > > > > > > can
> > > > > > > > > > > assume
> > > > > > > > > > > > > its hindi? or is there stuff
like marathi mixed in?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Reason I say this is to invoke
the right stemmers,
> > you
> > > > > really
> > > > > > > > need
> > > > > > > > > > > some
> > > > > > > > > > > > > language detection, but perhaps
in your case you
> can
> > > > cheat
> > > > > > and
> > > > > > > > > detect
> > > > > > > > > > > > this
> > > > > > > > > > > > > based on scripts...
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Robert
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15
AM, KK <
> > > > > > > dioxide.software@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > I'm indexing some non-english
content. But the
> page
> > > > also
> > > > > > > > contains
> > > > > > > > > > > > english
> > > > > > > > > > > > > > content. As of now I'm
using WhitespaceAnalyzer
> for
> > > all
> > > > > > > content
> > > > > > > > > and
> > > > > > > > > > > I'm
> > > > > > > > > > > > > > storing the full webpage
content under a single
> > > filed.
> > > > > Now
> > > > > > we
> > > > > > > > > > > require
> > > > > > > > > > > > to
> > > > > > > > > > > > > > support case folding
and stemmming for the
> english
> > > > > content
> > > > > > > > > > > intermingled
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > non-english content.
I must metion that we dont
> > have
> > > > > > stemming
> > > > > > > > and
> > > > > > > > > > > case
> > > > > > > > > > > > > > folding for these non-english
content. I'm stuck
> > with
> > > > > this.
> > > > > > > > Some
> > > > > > > > > > one
> > > > > > > > > > > do
> > > > > > > > > > > > > let
> > > > > > > > > > > > > > me know how to proceed
for fixing this issue.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > KK.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Robert Muir
> > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > > java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message