lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Sibert" <chrissib...@attbi.com>
Subject Re: Creating indexes
Date Wed, 19 Jun 2002 07:14:42 GMT
Thanks.

----- Original Message -----
From: "Nader S. Henein" <nsh@bayt.net>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, June 19, 2002 2:54 AM
Subject: RE: Creating indexes


> just store the whole thing into the indexc .. it'll make the index bigger
> but then it'll allow you to find method in madness, manually parsing a
forty
> meg file everytime you need to display search results is too intensive.
>
> Nader Henein
>
> -----Original Message-----
> From: Chris Sibert [mailto:chrissibert@attbi.com]
> Sent: Wednesday, June 19, 2002 10:47 AM
> To: Lucene Users List
> Subject: Re: Creating indexes
>
>
> The file that I have is big, about 40 MB. And it's got a whole lot of
> smaller documents in it - about 15 thousand - too many to separate into
> individual files. These individual documents are actually similar to
emails
> stored in a large text file. The file is structured to an extent, with a
> number before each document - (ex: __10001__, __10002__, etc.), with the
> date, etc. Kind of like email headers.
>
> In the Lucene index, it seems like I'll have to:  1) use a DocumentNumbers
> field to index all of the document numbers, 2) a Dates field to index the
> document dates, 3) and a TextBody field to index all of the document text
> together. I'll have to write an InputStreamFilter or something to parse
the
> data as it's coming in to the lucene IndexWriter, create a new document
> every time I hit a new number, and parse out the numbers - like
__10001__ -
> so I can separate them out in the DocumentNumbers field, the dates into a
> Dates field, and the text in a TextBody field. It won't be pleasant
writing
> that parser, but...
>
> My other issue at this point is how to then display the documents that
> relate to the search hits. I have to be able to open that 40 MB file and
go
> to the document(s) that correspond to the hits in the index, for display
to
> the user. Does Lucene keep a location stored in the index of where each
word
> is found in the original file ? How do I know at what point in the
original
> data file to find the offset to display the original document ? Is this
> something that I have to store myself in each document object in the index
?
> Is this why you create separate document objects in the Lucene index ? -
> Each new document object in the index will contain the file offset to the
> original data file ? And if Lucene doesn't put that file offset in there
> automagically, I would have to store that myself as I create the index, in
> someting like a FileOffsetLocation field, for each document. Am I on the
> right track here ?
>
> Whew.
>
> ----- Original Message -----
> From: "none none" <korfut@lycos.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Wednesday, June 12, 2002 11:56 AM
> Subject: Re: Creating indexes
>
>
> > Lucene doesn't know where a file start or ends, actually it knows, but
in
> your case 1 Docuemtn contains more small documents.If you want to split
your
> big file in small files you must to that by yourself, Take a look at the
> Document class and you will see that Lucene use a Reader to index the body
> of a file, so may be you should build a class that return a Reader for
each
> sub-document you want.
> > But i think is easier split your main document in small document, index
> this small documents with a common "keyword" that is the actual Big file
> name, so when you'll search you can understand where this "sub" document
is
> allocated. After you index those files you can delete them. What you need
is
> a BigDocumentManager that:
> >
> > 1.split your big file/s
> > 2.index them. (don't forget the keyword => big doc name)
> > 3.delete those "sub" documents (are like temp docs).
> >
> > Hope this helps.
> >
> >
> > --
> >
> > On Wed, 12 Jun 2002 02:26:58
> >  Chris Sibert wrote:
> > >I have a big ( 40 MB or so) file to index. The file contains a whole
> bunch
> > >of documents, which are each pretty small, about a few typewritten
pages
> > >long. There's a title, date, and author for each document, in addition
to
> > >the documents' actual text.
> > >
> > >I'm not quite sure how you index this in Lucene. For each document in
the
> > >original file, I assume that I create a separate Lucene Document object
> in
> > >the index with author, date, title, and text fields. If so, my question
> is
> > >that when I'm reading in the original file for indexing, does Lucene
know
> > >where each document begins and ends in the original file ? Or do I have
> to
> > >write a parser or filter or something for the InputStream that's
reading
> the
> > >file ?
> > >
> > >Chris Sibert
> > >
> > >
> > >
> > >--
> > >To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > >For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > _______________________________________________________
> > WIN a first class trip to Hawaii.  Live like the King of Rock and Roll
> > on the big Island. Enter Now!
> > http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message