lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walker, Keith 1" <keith.1.wal...@lmco.com>
Subject Words not found, large file indexing
Date Fri, 09 Mar 2007 17:14:38 GMT
I'm having problems with queries not returning a hit when a document
does in fact have those terms.  (I'm not worried about the ranking, just
whether or not it's a hit.)

Is anything wrong with the query syntax? (see below)  Also, words in the
document's index (not the Lucene index) seemed less likely to be
recognized.   I'm also wondering if anyone's run into problems with
large files, since the one I'm using is 161MB, but boils down to 472KB
as text.  The smaller file had no problems.

Thanks for any advice,
Keith

Here are some of my test results on 2 different documents, with the test
code below.
query	location of words in document (src: Acrobat)	Test 2	
http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
text) 			
+content:("Research-based")	310 instances	positive	
+content:("Organize Information Clearly")	4 instances	positive

+content:("partitioning")	3 instances	negative	
+content:("distinguishing required")	1 instance in index	negative

+content:("evaluators")	14 instances	negative	
+content:("distinguishing required" AND "evaluators")	(see above)
negative	
+content:("partitioning" AND "evaluators")	(see above)	negative

			
automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
latest)			
+content:("tentative")	several instances	positive	
+content:("tentative hits")	several instances	positive	
+content:("tentative" AND "hits")	several instances	positive

+content:("tentative hits" AND "identification")	several
instances	positive	


public static void testLuceneIndexing() throws EraException,
IOException, ParseException {
		File indexDir = new
File("D:/kcw/test_data/gate_test/huge_files/index");
		String filename =
"D:/kcw/test_data/gate_test/huge_files/hhs.txt";
		File file = new File(filename);
		if (indexDir.exists()){
			deleteDirectory(indexDir);
		}
		IndexWriter writer = new IndexWriter(indexDir, new
SimpleAnalyzer(),
				true);
		Document doc = new Document();
		doc.add(Field.Text("content", new FileReader(file)));
		doc.add(Field.Keyword("filename",
file.getCanonicalPath()));
		System.out.println("before addDocument()");
		long start = System.currentTimeMillis();
		writer.addDocument(doc);
		System.out.println("# docs indexed: " +
writer.docCount());		
		writer.optimize();
		writer.close();
		System.out.println("Done indexing.  Duration(ms): " +
(System.currentTimeMillis() - start));

		IndexSearcher search = new
IndexSearcher(indexDir.getCanonicalPath());

		Query luceneQuery = null;
		
		luceneQuery =
QueryParser.parse("+content:(\"Research-based\")", "body",
				new SimpleAnalyzer());
		System.out.println("Query= " +
luceneQuery.toString("body"));

		Hits hits = search.search(luceneQuery);
		int resultLength = hits.length();
		System.out.println("hit result = " + resultLength);
	}


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message