lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian chen <chenjian1...@gmail.com>
Subject Re: No.of Files in Directory
Date Thu, 30 Jun 2005 16:55:49 GMT
Hi,

My second suggestion is basically to store the user documents (word
docs) directly in lucene index.

1) If you are using Lucene 1.4.3, you can do something like this:

// suppose the word docs are now in byte array
byte[] wordDoc = getUploadedWordDoc();

// add the byte array to lucene index
Document doc = new Document();
doc.add(Field.UnIndexed("originalDoc", getBase64(wordDoc)));

The getBase64 method basically transforms the bytes into ASCII text, as follows:
String getBase64(byte[] wordDoc)
{
      byte[] chars = Base64.encodeBase64(wordDoc);
      String encodedStr = new String(chars, "US-ASCII");
      return encodedStr;
}

You can get the Base64.java from 
http://jakarta.apache.org/commons/codec/apidocs/org/apache/commons/codec/binary/Base64.html

2) Correct me if I am wrong, I think the latest Lucene dev base has
the capability to direct add binary content to the Lucene index.

Looking at 
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup
It has:
/**
   * Create a stored field with binary value. Optionally the value may
be compressed.
   * 
   * @param name The name of the field
   * @param value The binary value
   * @param store How <code>value</code> should be stored (compressed or not.)
   */
  public Field(String name, byte[] value, Store store) {
.....

So, I guess if you use the lastest Lucene dev base, you can do:
byte[] wordDoc = getUploadedWordDoc();
Document doc = new Document();
doc.add(new Field("originalDoc", wordDoc(), Store.YES);

I think Lucene index is pretty good in terms of storing millions of
small documents. However, there are two concerns that you might
address:

1) no transaction support for the index manipulation. I am not sure
what happens when the program is storing the original word document
meanwhile the machine gets shut down. Will the index be corrupted?

2) Since Lucene index is basically files in a physical directory, the
index file size could eventually hit a hard limit and then you have to
have another way to get around it. (Split up the index into two
indexes, or, you could configure Lucene for the
IndexWriter.DEFAULT_MAX_MERGE_DOCS?)

For example, I think some version of windoze (e.g., using FAT file
system), has a file size limit of 2GB.

Let me know if this helps.

Cheers,

Jian


On 6/29/05, bib_lucene bib <bib_lucene@yahoo.com> wrote:
> Thanks Jian
> 
> I need to retrive the original document sometimes. I did not quite understand your second
suggestion.
> Can you please help me understand better, a pointer to some web resource will also help.
> 
> jian chen <chenjian1227@gmail.com> wrote:
> Hi,
> 
> Depending on the operating system, there might be a hard limit on the
> number of files in one directory (windoze versions). Even with
> operating systems that don't have a hard limit, it is still better not
> to put too many files in one directory (linux).
> 
> Typically, the file system won't be very efficient in terms of file
> retrieval if there are nore than couple thousand files in one
> directory.
> 
> There are some ways to tackle this issue.
> 
> 1) Use a hash function to distribute the files to different sub
> directories based on the file name. For example, use the MD5 algorithm
> in Java or CRC algorithm in java, hash the file name to a number, use
> this number to construct directory. For example, if the number you
> hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as
> the sub-sub dir name, so forth.
> 
> I think the SQUID web proxy server uses this approach to do the file caching.
> 
> 2) Why not use Lucene's indexing algorithm and store binary files with
> lucene index?! I love the indexing algorithm, in that, you don't need
> to manage the free space like that in a typical file system. Because
> the merge process will take care of reclaiming the free space
> automatically.
> 
> Should these two advices be good?
> 
> Jian
> 
> On 6/29/05, bib_lucene bib wrote:
> > Hi All
> >
> > In my webapp i have people uploading their documents. My server is windows/tomcat.
I am thinking there will be a limit on the no of files in a directory. Typically apllication
users will load 3-5 page word docs.
> >
> > 1. How does one design the system such that there will not be any problem as the
users keep uploading their files, even if a million files are reached.
> > 2. Is there a sample application that does this.
> > 3. Should I have lucene update index after each upload or should I do it like once
a day.
> >
> > Thanks
> > Bib
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message