lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Messer <bmes...@apache.org>
Subject Re: Queries Lucene 1.3
Date Thu, 18 Nov 2004 10:16:08 GMT

check the lucene users list. There are many threads talking about how to 
index PDF documents with lucene.

Bernhard

PROYECTA.Fernandez Garcia, Ivan schrieb:

>Good morning everybody,
>
>	Are there anyone that was indexed PDF files?
>	If yes, could you say us how do you make it?
>
>Thanks you or your attention.
>
>-----Mensaje original-----
>De: Xiaozheng Ma [mailto:Xiaozheng.Ma@redwood.com]
>Enviado el: miércoles, 17 de noviembre de 2004 16:52
>Para: Lucene Developers List
>Asunto: RE: Queries Lucene 1.3
>
>
>You are right, of course the most important thing is to extract the text
>file and index it. something like:
>
>	document.add(Field.Text("contents", ifile.getTextContents()));
>
>
>I do a search using:
>
>	public static Hits search(String queryString, String indexFilePath)
>			throws Exception {
>		IndexSearcher searcher = new IndexSearcher(indexFilePath);
>		Query query = QueryParser.parse(queryString, "contents",
>				new StandardAnalyzer());
>		return searcher.search(query);
>	}
>
>One comment: if you postfix "*" at the end of search pattern, you will have
>problem for some advance search for example: 
>1. phrase search: 
>If you search on "a pretty cat", may get exception for "a pretty cat"*
>2. you search on group for a field: for example:
>Body:(pet AND "pretty") or are actually search on Body:(pet AND "pretty")*
>The parser will give you an error. 
>3. in general if you have a sentence to search of use ) " etc
>
>If you index them page by page, it should not have an OOME.
>
>Hope this helps.
>
>--
>Xiaozheng
>
>-----Original Message-----
>From: PROYECTA.Fernandez Garcia, Ivan [mailto:proyecta.ifernandez@iberia.es]
>
>Sent: Wednesday, November 17, 2004 10:32 AM
>To: Lucene Developers List
>Subject: RE: Queries Lucene 1.3
>
>First of all, Xiaozheng  thanks for your attention.
>I have tested it but we have no results.
>
>I explain in detail:
>
>	We would like search text in a pdf file.
>      I think we must index the content of each page to search text, isn´t
>it?
>	So we must use sentence document.add(Field.Text()). isn´t it?
>	We search text using following sentences:
>
>		Query q = QueryParser.parse(m_texto + "*",
>CValoresGlobales.M_CONTENIDO_PAGINA, analizador);
>		q = q.rewrite(indexReader);
>		hits = searcher.search(q);
>
>	is O.K.?
>
>Tnaks for your help.
>
>-----Mensaje original-----
>De: Xiaozheng Ma [mailto:Xiaozheng.Ma@redwood.com]
>Enviado el: miércoles, 17 de noviembre de 2004 16:21
>Para: Lucene Developers List
>Asunto: RE: Queries Lucene 1.3
>
>
>I used the following to index and it works fine.
>		document.add(Field.Text("author", ifile.getAuthor()));
>		document.add(Field.Text("title", ifile.getTitle()));
>		document.add(Field.Text("extension", ifile.getExtension()));
>
>-----Original Message-----
>From: PROYECTA.Fernandez Garcia, Ivan [mailto:proyecta.ifernandez@iberia.es]
>
>Sent: Wednesday, November 17, 2004 10:08 AM
>To: Lucene Developers List
>Subject: RE: Queries Lucene 1.3
>
>If we don´t update IndexWriter.minMergeDocs attribute, Lucene not found
>anything (We don´t know why?)
>When we change value for IndexWriter.minMergeDocs attribute and file has a
>lot of pages. OutofMemory Exception ocurred.
>
>
>-----Mensaje original-----
>De: Xiaozheng Ma [mailto:Xiaozheng.Ma@redwood.com]
>Enviado el: miércoles, 17 de noviembre de 2004 15:59
>Para: Lucene Developers List
>Asunto: RE: Queries Lucene 1.3
>
>
>A bit confused if the first problem is solved (i.e. the break point at 10).
>For Out of memory exception(OOME), You need to increase the JVM MAX momoery
>size. IF you use tomcat 5, run tomcat5w.exe to reset this value ( or do it
>by editing registry, or if you wish change JAVA_OPTIONs of the carolina.bat
>or Carolina.sh).
>
>Hope this works.
>
>Xiaozheng 
> 
>
>-----Original Message-----
>From: PROYECTA.Fernandez Garcia, Ivan [mailto:proyecta.ifernandez@iberia.es]
>
>Sent: Wednesday, November 17, 2004 9:49 AM
>To: lucene-dev@jakarta.apache.org
>Subject: Queries Lucene 1.3
>
>Good afternoon everybody,
>
>	First of all thanks for your attention.
>
>	We are using Lucene1.3 api to index and search text in pdf files.
>	We have two environment to develop with it: Windows, using Apache
>Tomcat 5.0 and Sun Solaris, using Oracle Aplication Server.
>	First we extract text pages from pdf file using Multivalent API
>(this process seems run O.K.).
>	Then we search text in new index created before. At this moment we
>have the following problem:
>		- If pdf file number page is 10, text is found.
>		- If pdf file number page is more than 10, text is not
>found.
>	We modify IndexWriter.minMergeDocs attribute assign two values:
>Total number document pages and "1" value.
>	In both cases:
>		- if document is not big, index process seems run O.K. and
>text search seems run O.K.
>		- if document is big (600 pages), index process run K.O
>raising OutofMemory exception.
>
>	We send you our source code file where index a pdf file and search
>text if you can see some error.
>	We don´t know what more have we do with this problem.
>	Can you help us , please?
>
>Thanks you for your help.
>
> <<search_text.txt>>  <<index_lucene.txt>> 
>
>
>  
>
>>Iván Fernández García
>>Proyecta Sistemas de Información
>>
>>
>>
>>
>>
>>    
>>
>---
>Outgoing mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>----------------------------------------------
>Has decidido el mejor precio.  Has decidido IBERIA.com 
>You´ve chosen the best price. You´ve chosen  IBERIA.com 
>----------------------------------------------
>http://www.iberia.com 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>---
>Incoming mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>---
>Outgoing mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>----------------------------------------------
>Has decidido el mejor precio.  Has decidido IBERIA.com 
>You´ve chosen the best price. You´ve chosen  IBERIA.com 
>----------------------------------------------
>http://www.iberia.com 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>---
>Incoming mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>---
>Outgoing mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>----------------------------------------------
>Has decidido el mejor precio.  Has decidido IBERIA.com 
>You´ve chosen the best price. You´ve chosen  IBERIA.com 
>----------------------------------------------
>http://www.iberia.com 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>---
>Incoming mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
> 
>
>---
>Outgoing mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.797 / Virus Database: 541 - Release Date: 15/11/2004
> 
>
>----------------------------------------------
>Has decidido el mejor precio.  Has decidido IBERIA.com 
>You´ve chosen the best price. You´ve chosen  IBERIA.com 
>----------------------------------------------
>http://www.iberia.com 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>  
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message