lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Analyzer
Date Wed, 26 Nov 2008 13:54:36 GMT
Why not index into different fields based upon the file type? That would be
MUCH easier and PerFieldAnalyzerWrapper would be your friend. For
example.....

Doc1
text_html: "some text here"
type: html

Doc2
text_jsp: "some jsp text here'
type: html

etc...

Now your PerFieldAnalyzerWrapper has your html extractor for the text_html
field and your jsp extractor for the text_jsp field.

Searching would require you to search across all your different types, but
that's do-able. Or you'd have to restrict your searching to one of the
text_* fields if you really only wanted to search that type.

But I still think you're not clear on the difference when searching. Your
search query analyzer will probably have no relation to your indexing
analyzer since the type of input is irrelevant to your query (I guess
anyway). There you're probably using one of the analyzers provided out of
the box.

Best
Erick


On Wed, Nov 26, 2008 at 7:30 AM, Allahbaksh Mohammedali Asadullah <
Allahbaksh_Asadullah@infosys.com> wrote:

> Hi Erik,
> Thanks a lot for your reply. You are right I want different analyzer on
> same field depending upon the other fields in the document.
>
> For Example
> Doc1
> Text:"Some text here"
> type:html
>
> Doc2
> Text: "Some jsp text here"
> type:jsp
>
> Now depending upon the type I wanted to use different analyzer for
> extracting and searching. Is there any way to do this?
>
> I am using the way which you suggested.
>
> Warm Regards,
> Allahbaksh
>
> ________________________________________
> From: Erick Erickson [erickerickson@gmail.com]
> Sent: Tuesday, November 25, 2008 9:38 PM
> To: java-user@lucene.apache.org
> Subject: Re: Analyzer
>
> I'm assuming that you want a different analyzer to handle extracting
> the relevant information to put into a "text" field of the Lucene document.
> I know of no way you can attach different analyzers to a single field.
> You can certainly attach different analyzers to *different* fields...
>
> The first thing I'd try would be to write your own analyzer that keeps
> track of what kind of file it's currently analyzing and knows how to
> "do the right thing" to extract the next token for the text field.
>
> A cruder way would be to detect the type of document yourself,
> extract the text into a string (or some such) then feed that into
> your document.
>
> Best
> Erick
>
>
> On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah <
> Allahbaksh_Asadullah@infosys.com> wrote:
>
> > HI All,
> > I am indexing a set file type (html, js,jsp,xml etc). All the file type
> > have a common field called as text. This field contains all the file
> data.
> > Can I have different analyzer for depending upon file type.
> >
> > Note: I am indexing all file type with same indexer.
> >
> > Regards,
> > Allahbaksh
> >
> > Allahbaksh Mohammedali Asadullah
> >
> > http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
> >
> >
> >
> >
> > **************** CAUTION - Disclaimer *****************
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS******** End of Disclaimer ********INFOSYS***
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message