Files


File - A named section of disk.
Normally both user and programmer never deal with disk directly, but only by calling named files. In performance-critical application, may need to implement your own file system, but this is obviously very dangerous.


File Types




Binary v. ASCII (plain text)

Traditionally, program data would be in very efficient binary format:

 (2-byte-number)(4-byte-number)(1-byte-character)(2-byte-number)....
Program needs to know structure of the file to display it. Otherwise it doesn't know where to put boundaries - might display:
 (4-byte-no)(1-byte-char)(2-byte-no)(4-byte-no)....
There has been a trend more recently towards program data that humans can read in a text editor:
 (1-byte-character)(1-byte-character)(1-byte-character)....
which displays characters that express the contents. - html, xml, and (sort of) ps, tex

Example: To store a 2 byte short integer in a file:


Readable "binary"

With the readable format, an expert human can debug, tweak the data in a simple text editor, if they know what they are doing.

It is a much less efficient format - might take 9 1-byte characters to display the 2-byte-number - and this is why binary was so popular in the past. But can think about adopting such schemes now because machines getting more powerful, disk space bigger, bandwidth better than used to be.

HTML showed it could be done. XML is now taking idea one step further.



Secret-format binary

Especially hard to work with are totally (or partially) secret binary formats like MS Word doc, where you have to use somebody else's application to modify the data. No self-respecting computer scientist uses Microsoft Word, because you do not have access to the raw data of your document file. You cannot write scripts and small utilities to manipulate your documents yourself (e.g. search utilities). Instead you have to sit inside other people's menus like a baby.

  1. Q. Say you have 1,000 MS Word documents on disk. How do you search for a string in them? You can't do:
    $ grep string *doc
    
    The built-in Microsoft Windows File Explorer search function can search for strings in Microsoft *doc files, but only returns the list of files that match, not the detailed output that grep returns.
    Q. Can Windows Explorer search be called as part of a DOS script? (Probably not. Would need to return text output list of files.)

  2. Q. Say you have 1,000 MS Word documents on disk. How do you go through them, changing string S1 everywhere it occurs into string S2?
    This is easy (a few lines of code) if you have text formats, UNIX and shell scripts (grep and sed).



File system divisions

Windows file system can spread over multiple pieces of hardware. Each given its own (single-letter) drive. Can partition a single piece of hardware into multiple drives too:

 drive:\dir\file
UNIX file system can spread over multiple pieces of hardware too. But these all just appear as sub-directories of a single file hierarchy:
 /drive/dir/file
e.g. recall /floppy.
The hierarchy entirely hides how many pieces of hardware are involved.




Hierarchical file system

Can organise files in separate dirs (Many web authors seem not to have discovered sub-dirs!).
Crucial to keep user files separate from system files (Why?). Hence the excellent invention of   C:\My Documents,   to match its UNIX equivalent,   $home
Can reuse same file names in different sub-dirs (like index.html).




Long file names

Currently files need names. Maybe this is a flaw - Real-world papers don't need names, because they look different visually, feel different, and are found in different places on your desk or on your shelves. Whereas all computer files (of same type) look the same in a File Manager. In future, maybe OS displays a representation of what is in the file (e.g. Windows File Manager preview is a step in the right direction - just not quick enough), or OS displays older files yellowed-with-age, etc., and lots of other visual clues, so it doesn't need a name. (But then how about programming?)

Anyway, currently files need names. Users and programmers are terrible at picking them. OS often doesn't help. UNIX and Mac allow long filenames (recall UNIX filenames). So does Windows post 95. Before that: You want to call a file:

 photos.kenya.apr.1963.html
but Windows pre-Win 95 forces you to use 8 char name, 3 char extension, so you have to call it:
 phka0463.htm
1 year later, you have no idea what this filename means.
Some people on the Web still use these type of filenames!

Still, at least old Windows had sub-directories. For years I used VM/CMS which had 8 char filenames and no sub-directories!

Short file names are good for ..

Short file names are good, though, for:
- Stuff you must type.

Utility names at the command-line (i.e. the program you call has a short filename). sed, grep, ls, cut, etc.

Some people say also URLs?
I would say: You should never type URLs. (I never do.) At most you type the host name that you saw on an ad. For everything else you cut and paste, or click.

Backward compatibility: Perhaps you want to allow your website to be downloaded and browsed offline on a Windows 3.1 machine. My website cannot be. (You can still browse it online on a Windows 3.1 machine though.)




Symbolic link (cross-link, breaking the hierarchy, "shortcut") in UNIX

Can selectively break the hierarchy with shortcuts.

 ln -s dir shortcut
or in Windows see "Create Shortcut"

e.g.

$ ls -l /bin

lrwxrwxrwx   1 root     root           9 Apr 14  1997 /bin -> ./usr/bin
Can also just give a file multiple names:
 ln -s file secondname
e.g. (StarOffice is a Windows-compatibility suite on UNIX):
lrwxrwxrwx   1 humphrys staff       10 Dec 13 21:03 excel -> staroffice
-rwxr-xr-x   1 humphrys staff       10 May  4  1999 staroffice
lrwxrwxrwx   1 humphrys staff       10 Dec 13 21:03 win -> staroffice
lrwxrwxrwx   1 humphrys staff       10 Dec 13 21:03 word -> staroffice
"staroffice" itself contains:
soffice &
Can do this on Windows as well (have multiple shortcuts to a data file or program).




Problems with cross-links

With shortcuts, if doing a recursive search of disk, can get infinite loop problems, or at least duplication. e.g. List all files on disk. If follow symbolic links may list files twice.

Q. Also, if delete file, do you delete symbolic link? If so, how do you find them - do you have reverse directory of them? Also, I make symbolic link to other user's file. They delete file. They can't delete my link.
A. If link doesn't work, so what. Might even leave it dangling as reminder.

Security

If your directory is readable by others on your local machine, someone on your machine can make it readable by the world on the Web (either maliciously or accidentally):

cd     /homes/your-userid/public_html
ln -s  /homes/other-userid/dir          shortcut
The world can then read other user's directory through:
http://host/~your-userid/shortcut/
Has valid uses too. Might want to make one of your own dirs visible without having to have it under public_html, e.g. public_html disk is full, dir is on another disk.

Another example - SAMBA or read-write ftp may only drop you in home directory rather than root directory and you may not be able to go upwards. What you do is put symbolic links in your home directory and you can access any directory through them:

  ln -s /var/mail  email
  ln -s /htdocs    ht




"Hierarchy with some cross-links" a very powerful model

General conclusion is that a basic hierarchy, with some cross-links for difficult points, is excellent way to structure complex data (e.g. Yahoo directory, Google directory) - rather than total cross-link free-for-all on one hand (e.g. the Web with just search engines and no directories), or rigid hierarchy on other (e.g. Dewey library system).

Interestingly, family trees are also basically hierarchical, with arbitrary cross-links, rather than strictly hierarchical as many people seem to think.




Backup

If it's data (1's and 0's), there's no real excuse for losing it. You can make a million copies and store them all over the world. Disk space is big and cheap. Machines are often idle. The network is always on. Backups can be automated across comms. links at night.

e.g. I currently have backups of different things on 4 machines in 3-4 sites in 2 countries, plus some old backups on floppy and some partial backups in 3 different sites. And that's not even including wherever the UNIX machine here is backed up to, nor in fact does it include where the machine in the other country is backed up to. Possibly 5-6 different sites in 2-3 countries in total. And there are also electronic copies of some of my public data in electronic archives (at least 3 more sites in other countries) and search engine databases (e.g. I can recover my page from Google's cache). And there are also electronic copies of my data in the Internet Archive.

OK I think I'm finished. But this is the modern world. If it's 1's and 0's, everyone can have their own copy. And you can keep your backups in foreign countries.

In future, backup and long-term storage will be essential part of "ISP" or "Network Computer" service.

  1. Removable media - diskettes (real nuisance), Zip drives, tapes, removable hard disk.
    v.
  2. 2 different machines permanently linked by comms. - serial/parallel cable, local network, distributed file system, Internet read-write ftp, automated scripts, mirrors.




Backup policy

  1. Periodically dump entire file system to backup.
    v.
  2. Keep a running "mirror", and only backup things that have changed since last time they were synch-ed.
Perhaps only backup user files.
OS, system and application files can be recovered from install CDs / disks / tapes.

Which of these is the most dangerous:

  1. Keep 1 synchronised copy of your files. Backup the changes every night.
  2. Keep 1 synchronised copy of your files. Backup the changes every hour.
  3. Take a copy of all of your files once a week. Keep all these old copies. Do no backups at all during the week.
  4. Take a copy of all of your files once a month. Keep all these old copies. Do no backups at all during the month.
Remember - it may take days or even months before an intrusion and destruction, or accidental damage, is noticed.
User may realise 2 years later that he has deleted some file and needs it back.