lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ashwin kumar" <gv.ash...@gmail.com>
Subject Re: indexing pdfs
Date Fri, 09 Mar 2007 02:47:52 GMT
hi sachin the link wat u gave me only a zip file and an exe file for
downoad. and this zip file also contains no class files.but wouldn't we be
requiring a jar file or class file ???

On 3/8/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
>
> Hi,
>
> Here it is:
>
> http://www.seekafile.org/
>
> -----Original Message-----
> From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> Sent: 08 March 2007 13:07
> To: java-user@lucene.apache.org
> Subject: Re: indexing pdfs
>
> hi again
> do we have to download any jar files to run this program if so can u
> give me the link pls
>
> ashwin
>
> On 3/8/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
> >
> > Well you don't need to actually save the text to disk and then index
> > the saved index file, you can directly index that text in-memory.
> >
> > The only other way I have heard of is to use Ifilters.  I believe
> > SeekAFile does indexing of pdfs.
> >
> > Sachin
> >
> > -----Original Message-----
> > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > Sent: 08 March 2007 11:35
> > To: java-user@lucene.apache.org
> > Subject: Re: indexing pdfs
> >
> > Is the only way index pdfs is to convert it into a text and then only
> > index it ???
> >
> >
> >
> > On 3/8/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
> > >
> > > Hi Aswin,
> > >
> > > You can try pdfbox to convert the pdf documents to text and then use
>
> > > Lucene to index the text.  The code for turning a pdf to text is
> > > very
> > > simple:
> > >
> > > private static string parseUsingPDFBox(string filename)
> > >         {
> > >             // document reader
> > >             PDDocument doc = PDDocument.load(filename);
> > >             // create stripper (wish I had the power to do that -
> > > wouldn't leave the house)
> > >             PDFTextStripper stripper = new PDFTextStripper();
> > >             // get text from doc using stripper
> > >             return stripper.getText(doc);
> > >         }
> > >
> > > Sachin
> > >
> > > -----Original Message-----
> > > From: ashwin kumar [mailto:gv.ashwin@gmail.com]
> > > Sent: 08 March 2007 09:37
> > > To: java-user@lucene.apache.org
> > > Subject: indexing pdfs
> > >
> > > hi can some one help me by giving any sample programs for indexing
> > > pdfs and .doc files
> > >
> > > thanks
> > > regards
> > > ashwin
> > >
> > >
> > > This message has been scanned for viruses by MailControl - (see
> > > http://bluepages.wsatkins.co.uk/?6875772)
> > >
> > >
> > > This email and any attached files are confidential and copyright
> > > protected. If you are not the addressee, any dissemination of this
> > > communication is strictly prohibited. Unless otherwise expressly
> > > agreed in writing, nothing stated in this communication shall be
> > legally binding.
> > >
> > > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > > Registered in England No. 1885586.  Registered Office Woodcote
> > > Grove, Ashley Road, Epsom, Surrey KT18 5BW.
> > >
> > > Consider the environment. Please don't print this e-mail unless you
> > > really need to.
> > >
> > > --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message