lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Jose" <rjos...@comcast.net>
Subject Re: Index Size
Date Thu, 19 Aug 2004 16:44:35 GMT
Stephane

Thanks for your response.  I have thought that same question.  If fact after
I went home last night that is exactly what I thought I was doing.  But I
just used Luke to go through all of my documents, and I don't see any
duplicates.  But I will go check again just to make sure.

Rob
----- Original Message ----- 
From: "Stephane James Vaucher" <vauchers@cirano.qc.ca>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, August 19, 2004 9:34 AM
Subject: Re: Index Size


Stupid question:

Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.

sv

On Thu, 19 Aug 2004, Rob Jose wrote:

> Paul
> Thank you for your response.  I have appended to the bottom of this
message
> the field structure that I am using.  I hope that this helps.  I am using
> the StandardAnalyzer.  I do not believe that I am changing any default
> values, but I have also appended the code that adds the temp index to the
> production index.
>
> Thanks for you help
> Rob
>
> Here is the code that describes the field structure.
> public static Document Document(String contents, String path, Date
modified,
> String runDate, String totalpages, String pagecount, String countycode,
> String reportnum, String reportdescr)
>
> {
>
> SimpleDateFormat showFormat = new
> SimpleDateFormat(TurbineResources.getString("date.default.format"));
>
> SimpleDateFormat searchFormat = new SimpleDateFormat("yyyyMMdd");
>
> Document doc = new Document();
>
> doc.add(Field.Keyword("path", path));
>
> doc.add(Field.Keyword("modified", showFormat.format(modified)));
>
> doc.add(Field.UnStored("searchDate", searchFormat.format(modified)));
>
> doc.add(Field.Keyword("runDate", runDate==null?"":runDate));
>
> doc.add(Field.UnStored("searchRunDate",
>
runDate==null?"":runDate.substring(6)+runDate.substring(0,2)+runDate.substri
> ng(3,5)));
>
> doc.add(Field.Keyword("reportnum", reportnum));
>
> doc.add(Field.Text("reportdescr", reportdescr));
>
> doc.add(Field.UnStored("cntycode", countycode));
>
> doc.add(Field.Keyword("totalpages", totalpages));
>
> doc.add(Field.Keyword("page", pagecount));
>
> doc.add(Field.UnStored("contents", contents));
>
> return doc;
>
> }
>
>
>
> Here is the code that adds the temp index to the production index.
>
> File tempFile = new File(sIndex + File.separatorChar + "temp" +
sCntyCode);
>
> tempReader = IndexReader.open(tempFile);
>
> try
>
> {
>
> boolean createIndex = false;
>
> File f = new File(sIndex + File.separatorChar + sCntyCode);
>
> if (!f.exists())
>
> {
>
> createIndex = true;
>
> }
>
> prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
> StandardAnalyzer(), createIndex);
>
> }
>
> catch (Exception e)
>
> {
>
> IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
> sCntyCode, false));
>
> CasesReports.log("Tried to Unlock " + sIndex);
>
> prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);
>
> CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
> sCntyCode);
>
> }
>
> prodWriter.setUseCompoundFile(true);
>
> prodWriter.addIndexes(new IndexReader[] { tempReader });
>
>
>
>
>
> ----- Original Message -----
> From: "Paul Elschot" <paul.elschot@xs4all.nl>
> To: <lucene-user@jakarta.apache.org>
> Sent: Thursday, August 19, 2004 12:16 AM
> Subject: Re: Index Size
>
>
> On Wednesday 18 August 2004 22:44, Rob Jose wrote:
> > Hello
> > I have indexed several thousand (52 to be exact) text files and I keep
> > running out of disk space to store the indexes.  The size of the
documents
> > I have indexed is around 2.5 GB.  The size of the Lucene indexes is
around
> > 287 GB.  Does this seem correct?  I am not storing the contents of the
>
> As noted, one would expect the index size to be about 35%
> of the original text, ie. about 2.5GB * 35% = 800MB.
> That is two orders of magnitude off from what you have.
>
> Could you provide some more information about the field structure,
> ie. how many fields, which fields are stored, which fields are indexed,
> evt. use of non standard analyzers, and evt. non standard
> Lucene settings?
>
> You might also try to change to non compound format to have a look
> at the sizes of the individual index files, see file formats on the lucene
> web site.
> You can then see the total disk size of for example the stored fields.
>
> Regards,
> Paul Elschot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message