lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <li...@ehatchersolutions.com>
Subject Re: excluding files / refining search
Date Fri, 15 Feb 2002 15:32:38 GMT
Or could be handled by the Ant index task I submitted:

    <index index="${build.dir}/index">
      <fileset dir="docs">
        <exclude name="**/header.html"/>
      </fileset>
    </index>


----- Original Message -----
From: "Steven J. Owens" <puffmail@darksleep.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, February 14, 2002 8:41 PM
Subject: Re: excluding files / refining search


> Brian Rook <brian.rook@xor.com> writes:
> > The site I'm working on has a lot of small html files that are used for
page
> > construction (nav bars, footers, etc) and they're being returned high in
the
> > results because they contain the search term(s) I'm looking for and are
> > small so they rank higher than larger documents.
> >
> > I want to exclude them from the index and I've come up with two ideas:
> >
> > 1) move them to a directory, which I will exclude from the index, but
I'll
> > have to change a bunch of links
> >
> > 2) detect them with some sort of flag and exclude them from the index.
We
> > were thinking that we could have a fake tag that lucene would detect and
not
> > index those pages.
>
>     Why not just have an exclude list of some sort?  In the code you
> wrote to select files for indexing, just have it check against a list
> of files you want to exclude.  In the demo application, you would edit
>
>      jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java
>
>
>      The quick and dirty method would be to edit this section of code:
>
>        public static void indexDocs(IndexWriter writer, File file)
>             throws Exception {
>          if (file.isDirectory()) {
>            String[] files = file.list();
>            for (int i = 0; i < files.length; i++)
>              indexDocs(writer, new File(file, files[i]));
>          } else {
>            System.out.println("adding " + file);
>            writer.addDocument(FileDocument.Document(file));
>          }
>        }
>
>
>      To something like this:
>
>
>        public static void indexDocs(IndexWriter writer, File file)
>             throws Exception {
>          if (file.isDirectory()) {
>            String[] files = file.list();
>            for (int i = 0; i < files.length; i++)
>              indexDocs(writer, new File(file, files[i]));
>          } else {
>            if (checkFileName(file)) {
>              System.out.println("skipping " + file) ;
>            } else {
>              System.out.println("adding " + file);
>              writer.addDocument(FileDocument.Document(file));
>            }
>          }
>        }
>
>        public static boolean checkFileName(File file) {
>          String name = file.getName() ;
>          if (name == "footer.html" ||
>              name == "header.html" ||
>              name == "menu.html" ||
>              name == "navbar.html") {
>      return false ;
>          }
>          return true ;
>        }
>
>
>      A more realistic implementation would use an "exclude file" of
> filenames to ignore, load them into a collection (probably a HashSet)
> and keep that collection around as an instance variable.  Then
> checkFileName() just returns !excludedSet.contains(name).
>
> Steven J. Owens
> puff@darksleep.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message