Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org

Missing
DCU student

CASE3 student Paul Bunbury is missing since Thur 2 Feb 2012.
See appeals on crime.ie and garda.ie and facebook.

He is a great coder. See DCU page and boards.ie page.
He won major coding contests in 2010 and 2011.
He is author of the brilliant "FloodItWorld".
DCU can confirm that in Jan 2012 he passed all 6 modules comfortably.


String searching



grep

grep defines a worldview on UNIX:
  1. Files in text format, not binary.
  2. Not forced to go through other people's applications.
  3. Can directly search the files yourself on the command line.
My search engine based on grep

find on DOS.

But how does grep (or other string-finding programs) work?



Introduction



Naive (brute force) algorithm

string[0] ... string[n-1]
search for:
pattern[0] ... pattern[m-1]


i = 0  	// start by trying to match from LHS of string
j = 0

repeat
{
 if ( string[i] == pattern[j] )
 {
  i++
  j++
 }
 else
 {
  i = i - j + 1   	// (*)
  j = 0    			// reset pattern to start
 }
}
until ( (j >= m) or (i >= n) )

if ( j == m ) return match  		// matches string[i-m] .. string[i-1]
 else return no match

(*) say i = 50 and we have been successfully matching pattern for a bit
started at i = 50, j = 0
now i = 57, j = 7, when suddenly it fails
go to i = 57 - 7 + 1 = 51 and try with j = 0 from there


Performance of naive algorithm

  1. Large alphabet (e.g. Alphanumeric)
    Imagine if pattern = "STING"
    string = "A STRING SEARCH CONSISTING OF SOME TEXT"
    S gets triggered 3 times before start of the match
    ST gets triggered 1 time before the match
    j++ executed 4 times before match
    in typical text search, the if condition above is true only rarely
    pointless j++'s (for aborted matches) are few
    running time is O(m+n)

  2. Small alphabet (e.g. Binary)
    However, if alphabet is small, this scales worse.
    e.g. pattern = 000000001
    string = 000000000000000000000000000000000000000000000000000000000000000100
    worst case is O(mn)



Knuth–Morris–Pratt string search algorithm




Boyer-Moore string search algorithm



What does grep use?


Boyer-Moore is used, for example, in programming language libraries for methods without regular expressions, things like:
substring(pattern,string) - return if pattern is found within string



Links



Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.