Mark Humphrys - Undergrad project ideas

The Internet

Computers and Genealogy

Masters project ideas


Undergrad project ideas



The Internet


  1. Link fixer - DONE BEFORE, WOULD NEED SOME NEW IDEAS

    This would be more than just a program to tell you that links on your web page were broken. It would go out and, using a variety of heuristics, attempt to actually fix the links.
    Input would be a web page. First it would find broken links. Then it would search to see if the missing page had moved somewhere, e.g. hacking off bits from the RHS and then reconstructing the URL.
    If it could not reconstruct the URL, it would turn to search engines using linkto: and other techniques. If it still could not find the URL, it might suggest an alternative based on keywords.
    Finally, it should actually output a copy of the web page with the links fixed, that the user may simply copy it over the old one, or selectively cut and paste.


    Possible enhancements:

    1. Take as input an entire website. Run overnight. Generate a report of the broken links listed in order of the site with the most broken links to it. Because often a change must be made globally through many of your pages. e.g. "http://akebono.stanford.edu/yahoo/" has moved to "http://www.yahoo.com/" - change needed throughout the following 53 of your web pages.
    2. Build a script that, when run, makes these changes.
    3. Nice user interface output, showing old page, new page, and selective buttons to press to "Fix link" on disk. User in control at all times.
    4. Use archive.org to find old page. Then search in Google to find where it has gone to.
    5. Suggest alternative links. e.g. Page on Shakespeare has gone. Suggest wiki/Shakespeare instead. "Click this to make change". User always in control.

    
    
    
    
  2. "Who links to me?" Web agent - DONE BEFORE, WOULD NEED SOME NEW IDEAS

    Search engines have a linkto: facility, so you can see who links to you, but it takes forever to browse the list so you don't bother.
    This would be a standalone program to find all pages that link to the user or reference them, download all these pages (will take a long time), perhaps sort them by the topic or page referenced, and present them all in a nice readable output, with all the references highlighted (as in Google's cache).

    How to highlight a phrase using     the bold tag     (view source code to see how this is done).

    How to highlight a phrase using     tables     (view source code to see how this is done).

    
    
    
    
  3. "What is like this?"

    Extract keywords from page. (How? Need idea of dictionary frequency.)
    Use search engines to find similar pages on Web.

    Implement as CGI script so that I can automatically add a "What is like this?" link at the top of every URL.

    
    
    
    
  4. Web page enhancer - DONE BEFORE, WOULD NEED SOME NEW IDEAS

    
    


Computers and Genealogy


  1. Program to query a person in Brian Tompsett's Genealogy of the British Royal Family and automatically extract the most recent Royal Descent for them.

    Need multiple database queries / multiple remote CGI queries, to recurse upwards through the target's parentage.

    Maybe make it a CGI script.

    To be done in cooperation with Tompsett at Hull. Debugged offline on separate data. End product is a script Tompsett could add to his site.

    
    
    
    
  2. Tree matcher.

    Takes trees which are in a structured HTML format (e.g. GEDCOM 2 HTML), and tries to match up fragments of them with other trees in structured HTML format on the Web, looking for overlaps.

    Start with matching surname lists. Then look for overlaps round each individual.

    Similar to "What is like this?" above.

    
    
    
    
  3. "GEDCOM 2 narrative" family tree converter - DONE BEFORE, BUT I HAVE SOME NEW IDEAS

    The standard format for computerised family trees is the GEDCOM format. Historically, the standard format for paper family trees has been the Burke's Peerage narrative format. The aim of the project is to write a converter between the two.
    The converter would take as input a family tree in GEDCOM format (there are many sample GEDCOM trees on the Web) and output the information in the Burke's narrative format in HTML (which is illustrated on my own Web pages).
    One of the main challenges for the software would be to automatically detect where to break the narrative and start a new narrative, something Burke's (and I) currently do by hand. The result would be a more flexible output than the databases provide.

    Perhaps to be done in cooperation with Tompsett at Hull. Debugged offline on separate data. End product is a script Tompsett could add to his site, so that we can see all of Tompsett's data in condensed Hypertext Narrative format.