lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <rstave...@seseit.com>
Subject RE: Problems indexing large documents
Date Sat, 10 Jun 2006 07:01:42 GMT
The answer was of course in the FAQ -
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc0
22d889484a9248b71

Breaking large documents into manageable chunks isn't ideal. I need to index
e-mail and with attachments which are frequently large. Currently each
message part corresponds to a Lucene Document, but that means I am
discarding terms > maxFieldLength. It is ugly having to span message parts
across multiple Lucene Documents for various reasons - e.g. a search returns
multiple Documents with different relevance, but more than one of these
Documents refer to the same message part.

Two thoughts:

(1) If the sentence "XX YY XX ZZ XX" was indexed, does that count as 3 terms
in this context or 5? If repeat terms are not counted, I can probably cope
with increasing the size of the heap and increasing maxFieldLength to deal
with realistic vocabularies, and I ought to be able to cope with most large
documents.

(2) Lucene wishlist thought... Would it be realistic to have an option for
Field indexing, which isn't entirely in RAM? The client code knows when the
Field is going to be a big one, because it can look at the file size before
passing the Field the java.io.Reader. If we could have a flag in Field that
says "do this the slow way because the calling code already knows that it is
a big one" and Otis, Eric & Co could work their magic, we could perhaps have
large Lucene Documents without running out of heap space. maxFieldLength =
-1 could perhaps denote what's needed??

-----Original Message-----
From: Rob Staveley (Tom) [mailto:rstaveley@seseit.com] 
Sent: 10 June 2006 07:22
To: java-user@lucene.apache.org
Subject: RE: Problems indexing large documents

I'm trying to come to terms with
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
tml#setMaxFieldLength(int) too. I've been attempting to index large text
files as single Lucene documents, passing them as java.io.Reader to cope
with RAM. I was assuming (like - I suspect - manu mohedano) that an unstored
field could be of any length and that maxFieldLength was only applicable to
stored fields. Do we in fact need to break the document into manageable
parts?

-----Original Message-----
From: Pasha Bizhan [mailto:lucene-list@lucenedotnet.com] 
Sent: 09 June 2006 21:35
To: java-user@lucene.apache.org
Subject: RE: Problems indexing large documents

Hi, 

> From: manu mohedano [mailto:manumohedano@gmail.com] 

> Hi All! I have a trouble... When I index text documents in 
> english, there is no problem, buy when I index Spanish text 
> documents (And they're big), a lot of information from the 
> document don't become indexed (I suppose it is due to the 
> Analyzer, but if the documents is less tahn 400kb it works 
> perfectly). Howewer I want to Index ALL the strings in the 
> document with no StopWords. Is this possible??

Read javadoc about  DEFAULT_MAX_FIELD_LENGTH at
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
tml#setMaxFieldLength(int) 

Pasha Bizhan



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message