lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Words not found, large file indexing
Date Fri, 09 Mar 2007 17:25:47 GMT

are you perhaps exceding this...

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)


: Date: Fri, 09 Mar 2007 12:14:38 -0500
: From: "Walker, Keith 1" <keith.1.walker@lmco.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Words not found, large file indexing
:
: I'm having problems with queries not returning a hit when a document
: does in fact have those terms.  (I'm not worried about the ranking, just
: whether or not it's a hit.)
:
: Is anything wrong with the query syntax? (see below)  Also, words in the
: document's index (not the Lucene index) seemed less likely to be
: recognized.   I'm also wondering if anyone's run into problems with
: large files, since the one I'm using is 161MB, but boils down to 472KB
: as text.  The smaller file had no problems.
:
: Thanks for any advice,
: Keith
:
: Here are some of my test results on 2 different documents, with the test
: code below.
: query	location of words in document (src: Acrobat)	Test 2
: http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
: text)
: +content:("Research-based")	310 instances	positive
: +content:("Organize Information Clearly")	4 instances	positive
:
: +content:("partitioning")	3 instances	negative
: +content:("distinguishing required")	1 instance in index	negative
:
: +content:("evaluators")	14 instances	negative
: +content:("distinguishing required" AND "evaluators")	(see above)
: negative
: +content:("partitioning" AND "evaluators")	(see above)	negative
:
:
: automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
: latest)
: +content:("tentative")	several instances	positive
: +content:("tentative hits")	several instances	positive
: +content:("tentative" AND "hits")	several instances	positive
:
: +content:("tentative hits" AND "identification")	several
: instances	positive
:
:
: public static void testLuceneIndexing() throws EraException,
: IOException, ParseException {
: 		File indexDir = new
: File("D:/kcw/test_data/gate_test/huge_files/index");
: 		String filename =
: "D:/kcw/test_data/gate_test/huge_files/hhs.txt";
: 		File file = new File(filename);
: 		if (indexDir.exists()){
: 			deleteDirectory(indexDir);
: 		}
: 		IndexWriter writer = new IndexWriter(indexDir, new
: SimpleAnalyzer(),
: 				true);
: 		Document doc = new Document();
: 		doc.add(Field.Text("content", new FileReader(file)));
: 		doc.add(Field.Keyword("filename",
: file.getCanonicalPath()));
: 		System.out.println("before addDocument()");
: 		long start = System.currentTimeMillis();
: 		writer.addDocument(doc);
: 		System.out.println("# docs indexed: " +
: writer.docCount());
: 		writer.optimize();
: 		writer.close();
: 		System.out.println("Done indexing.  Duration(ms): " +
: (System.currentTimeMillis() - start));
:
: 		IndexSearcher search = new
: IndexSearcher(indexDir.getCanonicalPath());
:
: 		Query luceneQuery = null;
:
: 		luceneQuery =
: QueryParser.parse("+content:(\"Research-based\")", "body",
: 				new SimpleAnalyzer());
: 		System.out.println("Query= " +
: luceneQuery.toString("body"));
:
: 		Hits hits = search.search(luceneQuery);
: 		int resultLength = hits.length();
: 		System.out.println("hit result = " + resultLength);
: 	}
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message