lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: storing the contents of a document in the lucene index
Date Wed, 30 Jul 2008 18:56:27 GMT
I thought of one more thing you should be aware of. The
the default field length for any field (no matter which of the
two forms you use) is 10,000 tokens.

This can be easily changed, see
IndexWriter.setMaxFieldLength().

Best
Erick

On Thu, Jul 24, 2008 at 9:25 AM, starz10de <farag_ahmed@yahoo.com> wrote:

>
> Dear Erick ,
>
>  Thnaks for your answer, I tryed other way ,  where I read the text files
> before i index them. I will try also your solution here.
>
> best regards
>
>
> Erick Erickson wrote:
> >
> > OK, I'm finally catching on. You have to change the demo code to
> > get the contents into something besides an input stream, so you
> > can use one of the alternate forms of the Field constructor. For
> > instance, you could read it all into a string and use the form:
> >
> > doc.add(new Field("content", <string with all the file contents in it>,
> >                Field.Store.YES, Field.Index.TOKENIZED))
> >
> >
> > Or, you can do something like this, which produces identical results
> > to the above
> >
> > while (more text to read) {
> >      String line = read a line of text from the file
> >      doc.add(new Field("content", line, Field.Store.YES,
> > Field.Index.TOKENIZED))
> > }
> >
> > You can add to the same field as often as you want and it just appends
> the
> > content of calls 2 to N to the same field.
> >
> >
> > Best
> > Erick
> >
> >
> > On Wed, Jul 23, 2008 at 3:42 AM, starz10de <farag_ahmed@yahoo.com>
> wrote:
> >
> >>
> >> Hi Erik,
> >>
> >>  I don't remove the stop words, as I index parallel corpora which is
> used
> >> for learning the translations between pair of languages. so every word
> is
> >> important. I even develop my own analyzer for Arabic which is just
> remove
> >> punctuations and special symbols and it return only Arabic text.
> >>
> >> I guess in the   FileDocument.java   the whole text is already stored
> >>
> >> doc.add(Field.Text("contents", IN));
> >>
> >> where IN is
> >>
> >> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
> >>
> >> if this is not the case yould you please how to store the whole text
> >> inside
> >> the index ?
> >>
> >> I am new to lucene and I don't know how to use this "Field.Store.YES" to
> >> store whole text.
> >>
> >>
> >>
> >> Best regards
> >> Farag
> >>
> >>
> >>
> >> starz10de wrote:
> >> >
> >> >   Could any one tell me please how to print the content of the
> document
> >> > after reading the index.
> >> > for example if i like to print the  index terms then i do :
> >> >
> >> > IndexReader ir = IndexReader.open(index);
> >> > TermEnum termEnum = ir.terms();
> >> > while (termEnum.next()) {
> >> >                       TermDocs dok = ir.termDocs();
> >> >                       dok.seek(termEnum);
> >> >                       while (dok.next()) {
> >> > System.out.println(termEnum.term().text().trim());
> >> >                               }
> >> >
> >> > I can print the text files before indexing them, but because of
> >> encoding
> >> > issues i like to print them from the index.
> >> > As i know the content of the document(whole text) is also stored in
> the
> >> > index, my question how to print this content.
> >> >
> >> > so at the end i will print the path of the current document , index
> >> terms
> >> > and the content of the document
> >> >
> >> >
> >> > thanks in advance
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18631887.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message