The Web
Many archives existed on the Internet before the Web.
You accessed them as follows:
ftp
Run the ftp program.
c ftp.cs.ucla.edu
Connect to some ftp site that you know of.
There is no easy way of bookmarking
or linking to these sites.
People have to build and maintain their own lists
of sites and passwords.
enter userid "anonymous"
enter password (your full email address)
Typing or Pasting all this in every time was VERY tedious.
ls
List files. Plain format, showing list of filenames.
Little or no idea what is in these files.
get index.txt
Get a master file that will explain
what is in the archive.
You have to read it offline and then find what you are
interested in - say a collection of Shakespeare plays.
cd Shakespeare
Go into a sub-category.
get index.txt
Find out what is in there.
get macbeth.txt
Finally get what you are looking for (possibly).
All these files you get end up in random places
on your disk. They are not all stored in a place
like the browser cache, periodically wiped.
Instead, you have to manage them all.
With this user interface, isn't it no wonder that the Internet
never took off!
In fact, there were even worse interfaces.
Some archives were accessible only by commands
embedded in email messages!
There was lots of information and resources online before the Web,
but it simply wasn't "browsable".
You couldn't casually follow links, and move on.
You had to invest lots of effort in everything you looked at.
So it was only used by those who were basically interested
in the technology.
Mass adoption had to wait for a "browsable", "memory-free" user interface.
Mosaic
- Perhaps the most revolutionary program of all time.
Combines all of that complexity in a single address:
ftp://ftp.cs.ucla.edu/index.txt
(Berners-Lee's idea),
but crucially, Mosaic makes it mouse-driven.
The above is the address
of a file.
You can bookmark it privately,
or provide a link on your page for others to follow.
No passwords, no typing, just an address.
To view it, you click on this address.
It downloads into temporary file space (browser cache).
Browser maintains this space - you don't have to manage it.
And the final act of beauty:
This file contains within it a description
of what is in the archive, including a link to:
ftp://ftp.cs.ucla.edu/Shakespeare/index.txt
which contains a link to:
ftp://ftp.cs.ucla.edu/Shakespeare/macbeth.txt
You can browse, "graze", and move on,
with no clean-up to be done afterwards.
You can casually follow links with little effort,
no typing, just mouse clicks.
Note: Strictly speaking, Mosaic wasn't the first
mouse-driven web browser.
It was the first that was widely used.
This seems to be because it was the first that:
- ran on Windows, Mac and UNIX
(Berners-Lee's browser was for NeXT)
- was easy to install (a single file)
- was easy to use
(looked like a normal modern app)
- had inline images (Mosaic invented the IMG tag)
For instance, I had heard about mouse-driven (UNIX) web browsers before Mosaic,
but never got around to downloading them
because I didn't see the point of the Web
until I saw Mosaic.
- Hide addresses (hypertext).
- Share the work (people construct links for you to follow).
- Browsable (cache, no passwords).
Index files browsable and discardable.
- Clickable (there were text-based Web browsers before Mosaic,
but they involved
typing numbers to say which link to follow).
A mouse is perfect for once-off,
"discardable" selections like this.
- Readable texts - The text-based browsers
filled the text with
intrusive numbers, so many didn't see the point of the system.
Mouse-clicking on underlined words restores the readability of the text.
- Distributed hypertext - Hypertext had been around for years.
But when hypertext meant you could click
on words in a help system, many said "cute"
but didn't really see the point.
When the click could take you to points in new systems,
suddenly everyone saw the point
and hypertext finally became popular.
- Browsers allow handy organisation
and editing of private bookmarks.
- Not restricted to one interaction - Browsers allow you to spawn off
multiple simultaneous window sessions
while waiting for slow downloads.
(Even today, how many web users have yet to discover Ctrl-N?)
-
Frames
break this model (can't bookmark, have to go through a laborious
process of clicking to get to a page).
- Providing info through Flash or Javascript pop-ups
breaks this model
(again, can't bookmark, have to follow a process to see data
rather than being able to go to it direct).
- In general, any page you can't bookmark or link to
breaks this model.
- Using
<FORM METHOD=POST>
in a CGI script
breaks this model
(you can't link to a filled-in version of the form,
instead you have to follow a process
of filling in the form each time to see the results).
Of course, you might do this deliberately,
e.g. if the form is meant to contain a password.
To allow someone link to the form filled in with arguments, use:
<FORM METHOD=GET>
- In general, temporary URLs
and changing URLs
break this model
(have to follow a process to find the data again).
- Referencing things online without providing a working hyperlink to them
breaks this model.
-
Listing an email address without making it a mailto: link
breaks this model.
Unfortunately this is becoming increasingly essential
because of spam.
- Sending email attachments
(instead of having the file online with a password)
also breaks this model
(have to start up a new application,
not so easily browsable,
can't link to it).
- Having to register to get into a site breaks this model
(again, back to old ftp anonymous passwords).
- Not linking to other sites, not linking creatively within your site,
and in short, just using hypertext to present a series of menu options,
breaks this model
(back to hypertext as it was used before the Web,
to just present a few menu options within a site).
If somebody who was unimpressed by hypertext back before the Web
could time travel forward to see most commercial sites today,
I think they would still be unimpressed by hypertext,
and not see the point of it.
- Many P2P file-sharing systems
break this model,
by having temporary "web sites",
that you can't link to.
p2p is an important model for distributing CPU load, bandwidth,
addressing and routing data,
and so on, as the Internet shows.
You could argue that the 1970s-80s applications
email,
usenet
and
DNS,
are all p2p.
What I am looking at here is p2p for publishing
(data or programs).
Sharing data with p2p just seems worse than the Web,
since there is (usually) no permanent address you can link to.
So does p2p have any function
other than sharing data that it is illegal
to set up
a website for (such as copyrighted files
or child porn)?
Some possible legal uses:
- An individual
sharing massive files (multimedia, movies)
that you find hard to find a host for anywhere,
so you just share these gigabyte-files direct from your PC when you are online.
- e.g. Sharing your digital camcorder home movies
with your extended family on the Internet worldwide.
Could put them on web server with password,
but too expensive to hire permanent web host for gigabytes of files.
Leave them on PC.
Family can access them when I'm online.
Q. What software would you use for this?
- A small site (e.g. a blog)
sharing large files (e.g. video)
that are the subject of massive topical interest (flash crowds).
P2P could be used to distribute the load.
- An organisation sharing
thousands of gigabyte-files.
e.g. the
BBC archive.
- Using p2p to distribute multi-megabyte releases
of very popular downloads, e.g. a new Windows update
or a new Linux release.
To stop the main server getting overloaded.
- In this case you link, say, to a permanent .torrent file
that performs a collective download.
People distributing large binaries for free (e.g. Microsoft update)
don't use this, but perhaps will someday to save their bandwidth.
-
One of the problems is would you trust
the person you are downloading from?
What if they altered the data?
With the fixed client-server model you can trust
a download from microsoft.com.
Would need watertight
error-checking and detection to prevent any client being able to interfere with the data.
- BitTorrent
Summary of legal uses?
-
Small-to-medium bandwidth legal data: p2p no use, use website.
-
Small-to-medium bandwidth legal software: p2p no use, use website.
-
Massive bandwidth legal data: p2p could be useful -
doesn't matter so much if data is corrupted - few people will do that anyway.
-
Massive bandwidth legal software: p2p could be useful
- but dangerous if software is corrupted
- and lot of incentive to do so.
Of course, as it stands today,
over 90 percent of all actual use of p2p
is illegal.
p2p for other things
i.e. different problems to
the Web-like model of downloading a piece of info
- p2p used to negotiate a temporary comms session:
p2p network to connect directly
2 players for online game
-
Skype
VoIP
uses p2p to distribute traffic load of the calls
(some of your bandwidth may be used to route other people's calls,
just as your Internet host may be asked to route other people's email).
HTTP client
Web browser
Uses MIME types.
(a) Plug-in - Runs inside browser process.
(b) Helper application - Separate process.
HTTP server
Doesn't make separate disk access for every file request - too slow.
Instead maintains cache in memory of frequently accessed files.
Multi-threaded.
Site spread over multiple disks
to help many reads going on at once.
For high-demand sites:
Multiple copies of entire site -
"server farm"
- front end routes requests to different CPUs.
Problem: OK to have all (small size) requests come in through one front end
and get routed to searching nodes.
Not OK to have all (large size) replies go back through one front end - bottleneck.
Solution: TCP handoff
- trick to have the searching node reply directly
in a manner that is invisible to client.
The reply load is therefore distributed over all the nodes.
URLs
Some URL formats.
gopher - port 70
- not used any more - pre-web system
ftp (no password) hardly used any more - pre-web system
ftp (with password) still important
file:
very useful
mailto: very useful
- but spammers search for these
telnet useful - but rarely do it by clicking a link
news:
not as important as used to be:
- read on
Google
- so many other places to talk now - discussion websites, blogs
Keeping state
Relating one client-server stateless request
with other client-server requests.
Identify user (pay-to-view, register, personalisation).
Shopping carts.
- cookies
- Users may turn cookies off (for good reason).
Structuring content
- Binary v. Plain text
- XML
- XML
- describe content in structured way.
e.g. Write program to find the price of a book on its website.
<price> $29.99 </price>
- Displaying XML
- unlike HTML, here full separation of formatting from content.
- Semantic web
- RSS
XHTML
- XHTML
- Existing HTML is forgiving
- can skip end tags,
etc. and it will still display.
Any case.
-
XHTML
is trying to change this for the future
- make HTML unforgiving and case-sensitive.
The idea is to make it:
- Easier for programs to process content.
- Easier to process/display on small, low-memory devices (tiny browsers).
- I am sceptical of the XHTML vision:
-
Yes, it is true that if we all migrated to XHTML,
it would make it easier for programs to process content.
But are you going to re-write 10 billion web pages?
Good luck with that.
-
XHTML:
"XML requires user-agents to fail when encountering malformed XML".
Question - Would you use such a browser?
i.e. One that wouldn't allow you view a favourite site
because it had "malformed" XHTML.
Or would you (as the whole history of the Web shows)
simply move quietly to a different, more tolerant browser?
Anyone selling an unforgiving browser will lose a lot of money.
- "The recommendation for browsers to post an error
rather than attempt to render malformed content
should help eliminate malformed content."
- Yeah, right.
Because authors have nothing better to do.
- As for the idea of making it easier to display on small devices:
Well, my PDA
displays malformed HTML beautifully with no problem,
and so does
even my WAP phone!
- Joel Spolsky
on HTML standards
- Maybe "the way the web "should have" been built would be to have very, very strict standards and every web browser should be positively obnoxious about pointing them all out to you and web developers that couldn't figure out how to be "conservative in what they emit" should not be allowed to author pages that appear anywhere until they get their act together.
But, of course, if that had happened, maybe the web would never have taken off like it did, and maybe instead, we'd all be using a gigantic Lotus Notes network operated by AT&T. Shudder."
- About the idea that old web pages need to "change" to conform to standards:
"Those websites are out of your control. Some of them were developed by people who are now dead.
...
The idealists don’t care: they want those pages changed.
Some of those pages can’t be changed. They might be burned onto CD-ROMs. Some of them were created by people who are now dead. Most of them created by people who have no frigging idea what’s going on and why their web page, which they paid a designer to create 4 years ago, is now not working properly."
- Again, if the browser doesn't display the old pages, what will most people do?
That's right.
Dump the browser.
Performance (client-side)
Caching
- Browser maintains cache.
- Site-wide (or ISP-wide) cache
via proxy server.
- On IE in DCU labs see:
Tools - Options - Connection - LAN - Proxy
- wwwproxy.computing.dcu.ie
- proxy.dcu.ie
- main proxy on port: 8080
- Squid proxy on port: 3128
- $ ftp proxy.dcu.ie
gives the message:
220-+-------------------------------------------+
220-| Welcome to the DCU FTP proxy. |
220-| |
220-| *NOTE* For anonymous FTP, please use |
220-| the Squid proxy (port 3128) to save |
220-| resources. All actions from this proxy |
220-| are logged and monitored. |
220-| |
220-| Please login as username@remote.host |
220-+-------------------------------------------+
To set proxy (precise menus may vary depending on version):
- Firefox - Tools - Options - Network - Settings
- IE - Tools - Options - Connections - LAN settings