perl-docs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Moseley <mose...@hank.org>
Subject Re: [Fwd: Re: perl filters for swish-e]
Date Thu, 21 Feb 2002 13:36:31 GMT
At 06:14 PM 2/21/2002 +0800, Stas Bekman wrote:
>hmm, I read in swish docs that you cannot index ':'.

That might have been true once, but I never found a reason for that.  (And
it should not be in the 2.2 docs - is it?)

~/swish-e/src > cat c
wordcharacters $|abcdefghi:
begincharacters $|abcdefghi:
endcharacters $|abcdefghi:

~/swish-e/src > cat 1
a b c d $| abc:def

~/swish-e/src > ./swish-e -i 1 -T indexed_words -v0 -c c
Indexing Data Source: "File-System"
    Adding:[1:swishdefault(1)]   'a'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'b'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'c'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'd'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '$|'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'abc:def'   Pos:6  Stuct:0x1 ( FILE )
Indexing done!

~/swish-e/src > ./swish-e -w '$|' -H0
1000 1 "1" 19

So I can search for "$|".

>I guess I wasn't clear to myself and others about what I meant by 
>searching for Perl code. I don't care much about search the code 
>sections per se. I care much about perl string found in the text. So I 
>want Apache::Registry to be found and I want $| to be found.
>
>If I understand correctly if I search for a sub-pattern it'll be found, 
>right? So if I search for $|, I'll find $| and $|++, no?

No.  Swish generates a reverse index.  Therefore it must tokenize the text
into words.  Swish does create two types of indexes, so that you can do
wildcard searches.  So:

    $|* (where * is a wildcard operator) will find $|; $|++ and so on, but
that's not a sub string search, but rather finding words that start with $|.

That's why grep would work better in some situations.  But, as I mentioned
before, grep can also be less effective in some cases.

But if you want one search to find both text and perl code then you run
into trouble because with text you want remove punctuation to make
searching work correctly.  But those punctuation characters are also used
as perl code, which you want to index.

>Therefore we want most if not all chars to be indexed. Or at least 
>$%@:-> (search for '$r->args' should be successful).
>
>And we don't want to search for Apache AND Registry, nor Apache OR 
>Registry when I ask for Apache::Registry. I think that's what most 
>people will expect without knowing the internals of the search engine.

Apache::Registry is a bit easier than perl code, because you can say that
":" is ok in the middle of a word, but not at the end.  Then

    "see Apache::Registry"  -- is indexed as a single word, but

    "rules for using foo:"  - "foo" is indexed without ":"

But perl code is not that simple.

Now, the advantage of NOT including ":" in words is that you can then
search for "registry" and find places where it's "Apache::Registry"
(because that's indexed as two words), and you can still use a phrase
search and find only places where "registry" follows right after "apache",
which would typically find Apache::Registry.  That's more flexible.

The problem is teaching people how to search.  Nobody would expect to
search for (with quotes)  "apache::registry".  I'm trying to modify swish
so that ranking is adjusted for how close words are together, so a
multi-word search (such as [apache registry]) would rank the phrases
highest.    



Bill Moseley
mailto:moseley@hank.org

---------------------------------------------------------------------
To unsubscribe, e-mail: docs-dev-unsubscribe@perl.apache.org
For additional commands, e-mail: docs-dev-help@perl.apache.org


Mime
View raw message