lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Big Ducument Indexing Limit?
Date Fri, 15 Sep 2006 13:17:36 GMT
First, You really must undestand analyzers and what they do. If you haven't
seen the book Lucene in Action, I highly recommend it.

Second, get a copy of Luke (google luke and lucene). It is a graphical tool
that lets you examine an index and fire queries at it. It'll show you
exactly what was indexed with the particular analyzer you eventually use. It
will also show you exactly what the result of a query is after it has been
processed by the query parser (again, using different analyzers). I
guarantee that the hour or two you spend understanding Luke will be time
*very* well spent.

About the content you indicated... what do you want to search? Are you going
to search on "\A1" "Frank" "\Paul" "\A1;Frank\Paul"? That is, would you
expect searching on \A1 to produce a hit or not?

You have to decide what parts of this text are to be searchable (i.e. how to
tokenize the input) and then use an analyzer/tokenizer that breaks the
stream up as you wish (for both indexing and querying). If none of the
supplied analyzers do what you want, Lucene in Action presents an example of
making your own (under synonym injection) that you can use as a model for
making your own.

It would be much more helpful if you could give an example of what searches
you would expect to generate hits on a given example input.... Again, using
Luke to look at your index after indexing would help you a lot to understand
what's possible.

Hope this helps
Erick

On 9/15/06, aslam bari <iamaslamok@yahoo.co.in> wrote:
>
> Thanks for response,
>   I have again a small problem.
>   I have some text in a xml tag like
>   <Contents>\A1;Frank\PPaul</Contents>
>
>   Does lucene can not index it using SimpleAnalyzer or
> TextContentExtractor.
>   Thanks...
>
> Catalin Mititelu <catalinmititelu@yahoo.com> wrote:
>   One more hint for 2) and 3): use SimpleAnalyzer on your xml (give up at
> XmlContentExtractor). In this manner you can index all "words" from xml file
> at lower case (tag name, attribute name, attribute value and content).
> Of course, you should use the same analyzer for searching.
>
> Simon Willnauer wrote: On 9/15/06, aslam bari wrote:
> > Dear Mititelu,
> > Thanks for reply. Can you help me on some samll issue related to it.
> > 1) I am new to Lucene. Can you tell me where is this
> DEFAULT_MAX_FIELD_LENGTH variable available and how to set it and for my
> purpose like 6-10MB file, how much i should set.
> DEFAULT_MAX_FIELD_LENGTH is a constant of IndexWriter.
> to set your own value use IndexWriter.html#setMaxFieldLength(int)
>
> JavaDoc:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)
>
> > 2) how can i index all the words of XML file as Case Insensitive means
> either in lowercase or in Uppercase. So that i can search case insensitive.
> How your text is processed depends on the analyzer you use for
> indexing a certain field. A lot of the available analyzers do
> lowercase processing using a lowercasefilter like the
> (whitespaceanalyzer)
> see (Analysis) at:
> http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
>
>
> You should have a look at
> > 3) I am using XmlContentExtractor. Actually it extract only Content of
> the xml tag. If i want every thing (including XmlTag or contents or
> properties - attributes ) of xml file to be indexed, where should i do
> change the code.
>
> I'm not 100% sure but isn't XmlContentExtractor a part or jakarta
> slide? If so please contact the slide mailing list. But there are many
> xml apis available to extract every content of your document. If you
> really want to index the whole xml file just read the file using java
> io anyway I would not suggest doing that at all.
>
> best regards simon
> >
> > Thanks...
> > Catalin Mititelu wrote:
> > Yes. The default max limit for indexing tokens is 10,000.
> > Look here
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
> >
> > aslam bari wrote: Dear all,
> > I am trying to index a Xml file which has 6MB size. Does lucene support
> the big document size. What is the limit of lucene Max file size to index.
> > Because when i check and trying to search in the indexed file. I am not
> able to get all the results. It gives me some results but not others. I
> think Lucene has done partially indexed and left the remaining part which it
> could not processed. IS IT RIGHT?. If not , then what will be the problem.
> > Thanks...
> >
> >
> > ---------------------------------
> > Find out what India is talking about on - Yahoo! Answers India
> > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> >
> >
> > ---------------------------------
> > Stay in the know. Pulse on the new Yahoo.com. Check it out.
> >
> >
> > ---------------------------------
> > Find out what India is talking about on - Yahoo! Answers India
> > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------
> How low will we go? Check out Yahoo! Messenger's low PC-to-Phone call
> rates.
>
>
> ---------------------------------
> Find out what India is talking about on  - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get
> it NOW
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message