forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From g4 <ja...@root10.net>
Subject Re: meta thoughts
Date Thu, 21 Aug 2003 12:07:22 GMT

On Thursday, Aug 21, 2003, at 10:45 Europe/London, <cheche@che-che.com>  
wrote:

> Jason,
>>
>> As you can see there are some words that shouldn't be there (these,
>> makes, etc...). So I think managing keywords words by frequency is not
>> really the way to go with something like this, a definitive list of
>> excluded words would be needed, this would also have the benefit of
>> being accessible and manageable.  I will continue with this anyway, at
>> least I'm getting to know awk ;)
>>
>> Jason Lane
>>
>
> What about if we get the keywords only from <section> ?

That sounds about right :) if you look at my previous posts regarding  
the treatment of the header you'll see I've been pulling in the <meta  
type="description"/> content from the page titles, although I have  
created a additional <tagline>, so this this is cool because titles  
tend to be descriptive anyway, no?

Keywords would have to be in my mind treated differently. I haven't  
looked yet but I know I have familiarise  myself more with the inner  
workings of the build system, were if a gawk was going to the run would  
it hook into the build system?

The other point that I think is important, if we are to exclude all  
common words from keyword generation (with, to and, always,  
happen.....etc) this needs also to be accessible to the user. So a  
common exclusion list needs to be in a separate  text file that can be  
easily modified.

I'm trying to find out more info on igawk for this, i.e @includes in  
gawk. igawk is actually a shell script that comes as part of the gawk  
distro

Here is a quick test I did using gawk-3.1.0 (I think gawk is common  
enough on most systems, should be part of cywin?)

I run this like so:

g4% sudo /sw/bin/gawk-3.1.0 -f strip forrest.html

#! /bin/awk -f
{

     #
     $0 = tolower($0) # remove case distinctions
      
gsub(/ 
about|addition|which|almost|amongst|another|anything|appreciate|availabl 
e|aware|before|body|both|buget|bugs|built|carefully|changes|changelogs|c 
omparable|complete|considered|constantly|contents|copy|copyright|cost|de 
fine|dirty|drop|edit|effort|ensuring|execwb|expand|fairly|forms|function 
|generic|goals|gone|gradually|hence|historical|however|ideal|important|i 
nclude|incompatible|index|intention|involved|irregularly|issues|just|kno 
w|known|letting|license|login|manually|message|mind|menus|much|name|navi 
gate|need|needed|neutral|notably|note|notes|observed|once|other|over|pag 
e|particularly|path|platform|populate|possibilities|possible|prior|proce 
ss|processes|produces|prompting|quality|quick|quite|rapid|rapidly|rather 
|read|related|release|releases|requires|reserved|resources|retaining|rev 
iew|rights|said|search|separating|series|setting|shelved|shifted|should| 
simple|simply|some|something|soon|source|start|startup|<[^>]*>|[^a-z0- 
9_ \t]/," ",$0) # Strip XML & remove punctuation
     for (i = 1; i <= NF; i++)
         freq[$i]++
}

END {
     sort = "sort +1 -nr"
     for (word in freq)
       if ((length(word) > 4) && (freq[word] > 0)) {
         printf "%s\t%d\n", word, freq[word] | sort
       }
     close(sort)
     # timestamp
     now = systime()
     mesg = strftime("Process ended at %a/%b/%Y %H:%M:%S", now)
     print mesg
}


I'll try and devote more time to this ASAP....... but I'd like to hear  
your feed back.

Cheers

P.S I've only tested this on OS X at the moment.

>
>
> Cheers
> Cheche
>
>
>
Jason Lane

Developer
Root10 developments


Mime
View raw message