lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From puffm...@darksleep.com (Steven J. Owens)
Subject Re: excluding files / refining search
Date Fri, 15 Feb 2002 01:41:20 GMT
Brian Rook <brian.rook@xor.com> writes:
> The site I'm working on has a lot of small html files that are used for page
> construction (nav bars, footers, etc) and they're being returned high in the
> results because they contain the search term(s) I'm looking for and are
> small so they rank higher than larger documents.
> 
> I want to exclude them from the index and I've come up with two ideas:
> 
> 1) move them to a directory, which I will exclude from the index, but I'll
> have to change a bunch of links
> 
> 2) detect them with some sort of flag and exclude them from the index.  We
> were thinking that we could have a fake tag that lucene would detect and not
> index those pages.

    Why not just have an exclude list of some sort?  In the code you
wrote to select files for indexing, just have it check against a list
of files you want to exclude.  In the demo application, you would edit

     jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java


     The quick and dirty method would be to edit this section of code:

       public static void indexDocs(IndexWriter writer, File file)
            throws Exception {
         if (file.isDirectory()) {
           String[] files = file.list();
           for (int i = 0; i < files.length; i++)
             indexDocs(writer, new File(file, files[i]));
         } else {
           System.out.println("adding " + file);
           writer.addDocument(FileDocument.Document(file));
         }
       }
     
     
     To something like this:
     
     
       public static void indexDocs(IndexWriter writer, File file)
            throws Exception {
         if (file.isDirectory()) {
           String[] files = file.list();
           for (int i = 0; i < files.length; i++)
             indexDocs(writer, new File(file, files[i]));
         } else {
           if (checkFileName(file)) {
             System.out.println("skipping " + file) ;
           } else {
             System.out.println("adding " + file);
             writer.addDocument(FileDocument.Document(file));
           }
         }
       }
     
       public static boolean checkFileName(File file) {
         String name = file.getName() ;
         if (name == "footer.html" || 
             name == "header.html" || 
             name == "menu.html" || 
             name == "navbar.html") {
     	return false ;
         } 
         return true ;
       }
     
     
     A more realistic implementation would use an "exclude file" of
filenames to ignore, load them into a collection (probably a HashSet)
and keep that collection around as an instance variable.  Then
checkFileName() just returns !excludedSet.contains(name).

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message