forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From g4 <>
Subject Re: meta thoughts
Date Thu, 21 Aug 2003 12:07:22 GMT

On Thursday, Aug 21, 2003, at 10:45 Europe/London, <>  

> Jason,
>> As you can see there are some words that shouldn't be there (these,
>> makes, etc...). So I think managing keywords words by frequency is not
>> really the way to go with something like this, a definitive list of
>> excluded words would be needed, this would also have the benefit of
>> being accessible and manageable.  I will continue with this anyway, at
>> least I'm getting to know awk ;)
>> Jason Lane
> What about if we get the keywords only from <section> ?

That sounds about right :) if you look at my previous posts regarding  
the treatment of the header you'll see I've been pulling in the <meta  
type="description"/> content from the page titles, although I have  
created a additional <tagline>, so this this is cool because titles  
tend to be descriptive anyway, no?

Keywords would have to be in my mind treated differently. I haven't  
looked yet but I know I have familiarise  myself more with the inner  
workings of the build system, were if a gawk was going to the run would  
it hook into the build system?

The other point that I think is important, if we are to exclude all  
common words from keyword generation (with, to and, always,  
happen.....etc) this needs also to be accessible to the user. So a  
common exclusion list needs to be in a separate  text file that can be  
easily modified.

I'm trying to find out more info on igawk for this, i.e @includes in  
gawk. igawk is actually a shell script that comes as part of the gawk  

Here is a quick test I did using gawk-3.1.0 (I think gawk is common  
enough on most systems, should be part of cywin?)

I run this like so:

g4% sudo /sw/bin/gawk-3.1.0 -f strip forrest.html

#! /bin/awk -f

     $0 = tolower($0) # remove case distinctions
9_ \t]/," ",$0) # Strip XML & remove punctuation
     for (i = 1; i <= NF; i++)

     sort = "sort +1 -nr"
     for (word in freq)
       if ((length(word) > 4) && (freq[word] > 0)) {
         printf "%s\t%d\n", word, freq[word] | sort
     # timestamp
     now = systime()
     mesg = strftime("Process ended at %a/%b/%Y %H:%M:%S", now)
     print mesg

I'll try and devote more time to this ASAP....... but I'd like to hear  
your feed back.


P.S I've only tested this on OS X at the moment.

> Cheers
> Cheche
Jason Lane

Root10 developments

View raw message