How to write a search engine in 9 lines of Shell

The following CGI script is a fully working search engine for your web pages:


#!/bin/sh

echo "Content-type: text/html"
echo

echo '<html> <head> <title> Search results </title> </head> <body>'

argument=`echo "$QUERY_STRING" | sed "s|q=||"`

cd /users/homes/me/public_html

echo '<pre>'
grep -i "$argument" *html */*html | sed -e 's|<|\&lt;|g' -e 's|>|\&gt;|g'
echo '</pre>'

Notes:

  1. The "sed" command is there because there are HTML tags in the results returned by grep. Unfortunately, these will be interpreted by your browser. To just print the tag without interpreting it, the search engine pipes the results through a program that converts all   < characters to   &lt; and   > to   &gt;

  2. q= assumes that your input variable is called "q" in the HTML form that sends data to this CGI script.

  3. Your web directories need to be readable for the wildcard to work.




Further enhancements you might make:

  1. Some extra security would be wise, e.g. process the argument with a C++ script before passing it to grep, check your PATH, etc.

  2. Consider also where there are spaces in the argument (multiple search words), etc.

  3. If you have more than 2 levels of web pages you may write them out explicitly as   */*/*html etc., or get a recursive grep, or use recursive find first to build the filespec:
    cd /users/homes/me/public_html
    
    filespec=`find . -type f -name "*html" | tr '\n' ' '`
    
    grep -i "$argument" $filespec
    
    Since each search will be using the same file list, it would be more efficient to pre-build the list once, and cache it in a file, and then:
    read filespec < filelist.txt
    
    grep -i "$argument" $filespec
    
    (I hope you realise that a heavy-duty search engine would go further and pre-index all the files in advance, rather than grep-ing them on the spot. But simple grep is alright for a personal website.)

  4. You might of course like to tidy up the output, in particular so that someone can actually click on the page(s) returned.

  5. The pages are not ranked in order of relevance, but only in the order in which grep finds them. How would you solve this?



But the principle is that in Shell you can rustle up a quick search engine for your personal pages, or any subset of them, in an afternoon. e.g. My search engine in about 55 lines of Shell (with a C++ input pre-processor for security) has all of the above enhancements and searches only a carefully defined subsection of my site.


Exercise - Make this an ordinary offline script with args (not a CGI script) that constructs an offline output web page.