lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: skip document header while indexing
Date Fri, 29 Apr 2005 13:50:52 GMT
On Apr 29, 2005, at 8:30 AM, Pablo Gomes Ludermir wrote:
> Could you give me some pointers (example or website) to how I could do 
> that?

Lucene's own source code has several analyzers that are worth 
investigating.  We also include several in Lucene in Action that 
demonstrate additional features like incorporating synonym lookup with 
WordNet and metaphone (soundex-like) replacements.  
http://www.lucenebook.com to grab the source code download.

The trick would be to add a TokenFilter that dropped Tokens until N 
number of tokens had been dropped.

For an example, here's the Analyzer I wrote for the lucenebook.com site:

public class LiaAnalyzer extends Analyzer {
   private Set stopSet;
   boolean stem = true;

   public LiaAnalyzer() {
     stopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

     // just a few words that would not be queried on
     stopSet.add("isn");
     stopSet.add("xyz");
     stopSet.add("bcd");
     stopSet.add("blt");
     stopSet.add("dhb");
     stopSet.add("ttc");
     stopSet.add("you");
     stopSet.add("our");
   }

   public LiaAnalyzer(boolean stem) {
     this();
     this.stem = stem;
   }

   public TokenStream tokenStream(String fieldName, Reader reader) {
     TokenFilter filter = new DashSplitterFilter(
               new HyphenatedFilter(
                 new DashDashFilter(
                   new LiaTokenizer(reader))));

     filter = new LengthFilter(3, filter);
     filter = new StopFilter(filter, stopSet);

     if (stem) {
       filter = new SnowballFilter(filter, "English");
     }

     return filter;
   }
}

	Erik


>
> On 4/29/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>>
>> On Apr 29, 2005, at 7:50 AM, Pablo Gomes Ludermir wrote:
>>
>>> Hello all,
>>>
>>> Is it possible to skip the first "xx" words while indexing a 
>>> document?
>>> For instance, on the code bellow, I would like to skip the "xx" first
>>> words of "file" on the "CONTENTS_FIELD". Is that possible?
>>>
>>> Document doc = new Document();
>>> FileInputStream is = new FileInputStream(file);
>>> Reader reader = new BufferedReader(new InputStreamReader(is));
>>> doc.add(Field.Text(PATH_FIELD, artifactModel));
>>> doc.add(Field.Text(CONTENTS_FIELD, reader, true));
>>
>> I believe your best bet will be to put in a custom Analyzer that does
>> this.  It wouldn't be too hard to code a wrapper around an analyzer
>> that did this.
>>
>>        Erik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> -- 
> Pablo Gomes Ludermir
> gomesp@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message