Files
File - A named section of disk.
Not necessarily a
contiguous
section of disk
(but that fact may be hidden from users and programs).
Normally both user and programmer never deal with disk directly,
but only by calling named files.
In some high-performance application (e.g. writing a high-speed search engine),
you may need to implement your own
file system, but this is obviously difficult and full of dangers.
- List of file formats
- Alphabetical list of file extensions
- Programs (machine readable)
- Program source code (human readable)
- java, c, cxx, h, hxx,
js, pas, asm, etc.
- Programs (human readable) - interpreted scripts
- Program data (machine readable). Often strictly formatted.
Precise length of each field pre-defined
(for ease of machine reading, and so data can be read into
pre-defined fixed-size program variables).
- Database files.
-
Documents for display.
e.g. Word docs (doc), ps, pdf, rtf,
tex, dvi, etc.
- Multimedia files - images, audio, video.
- gif, jpg, jpeg, mpg, mpeg, ram, avi, qt, au.
- Program data (human readable).
Often variable size, free-form text.
- Preferences files, rc.
- Documents for display.
e.g. HTML docs (htm, html, shtml), xml, txt,
latex, etc.
- Log files.
- Archive files - tar
- Compressed files - zip, arc, gz, Z.
Binary v. ASCII (plain text)
Traditionally, program data would be in very efficient binary format:
(2-byte-number)(4-byte-number)(1-byte-character)(2-byte-number)....
Program needs to know structure of the file to display it.
Otherwise it doesn't know where to put boundaries
- might display:
(4-byte-no)(1-byte-char)(2-byte-no)(4-byte-no)....
There has been a trend more recently towards program data
that humans can read in a text editor:
(1-byte-character)(1-byte-character)(1-byte-character)....
which displays characters that express the contents.
- html, xml, and (sort of) ps, tex
Display whether things are binary or text:
file *
See "man file".
Example
To store a 2 byte short integer in a file:
- Binary format: First 2 bytes of the file are:
10010000 10000000
You just have to "know" that the first 2 bytes are to be read together
as defining an integer.
If we interpret them that way,
they define the number:
36992
- Readable text format: First 9 bytes of the file are:
01101110 01110101 01101101 00111101 00110011 00110110 00111001 00111001 00110010
which translates byte-by-byte as the characters:
110 117 109 61 51 54 57 57 50
that is (see character list
constructed using this shell script)
the characters:
'n' 'u' 'm' '=' '3' '6' '9' '9' '2'
i.e. when displayed in a text editor this file will read:
num=36992
In fact, this may not always be the first 9 bytes of the file.
In this case, it could be the first 5 to 9 bytes of the file.
In general, it is the first line of the file.
What does that mean? It means read up until the carriage return character.
Readable "binary"
With the readable format,
an expert human can debug, tweak the data
in a simple text editor,
if they know what they are doing.
It is a much less efficient format
- might take 9 1-byte characters to display the 2-byte-number
- and this is why binary was so popular in the past.
But can think about adopting such schemes now
because machines getting more powerful, disk space bigger,
bandwidth better than used to be.
HTML showed it could be done.
XML
is now taking idea one step further.
"Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises."
Secret-format binary
Especially hard to work with are totally
(or partially)
secret binary formats like MS Word doc,
where you have to use somebody else's application to modify the data.
If you use Microsoft Word,
you do not have access to the raw data of your document file.
You cannot write scripts and small utilities to manipulate
your documents
yourself (e.g. search utilities).
Instead you have to point and click inside other people's menus.
-
Q. Say you have 1,000 MS Word documents on disk.
How do you search for a string in them?
You can't do:
$ grep string *doc
-
The default search
in
Windows Explorer
can search for strings in Word files, but only returns the list of files that match,
not the detailed output
that grep returns.
Q. Can Windows Explorer search be called from a DOS script?
(Would need to return text output list of files.)
- Desktop search programs
will pre-index your files and search them with a GUI or Web interface.
Some can search Word files.
Some can be called through a programming API.
Q. Can any of these be called from a DOS script like grep?
- Google Desktop
- Windows Search
-
Q. Say you have 1,000 MS Word documents on disk.
How do you go through them,
changing string S1 everywhere it occurs into string S2?
This is easy (a few lines of code) if you have text formats, UNIX and shell scripts
(grep
and sed).
-
What's So Bad About Microsoft?
-
Alternative View of the Microsoft Monopoly
- Argument that the government should force Microsoft to open their formats,
though some people point out that of course these are really bad formats:
-
"The file formats of MS Office were designed by Microsoft to be difficult to reverse
engineer and to be as closely tied as possible to the Microsoft platform. This does not translate to a good standard. If a standard is to be decided for Word
Processing it should be human readable, easily understandable, cross platform, and leave room for upgrades with bidirectional compatibility.
The Office formats have no concept of expandability and are neither forward nor backward compatible because Microsoft always intends to replace the
format with something incompatible in the next release to force users to upgrade. The Office file formats have no concept of interoperability because
Microsoft's primary concern is forcing people to use Microsoft Office on Microsoft Windows. The Office file formats are not easy to implement or
understand because part of their purpose is to delay competitors from reverse engineering them."
-
Plaintext was an essential reason why HTML succeeded
- people could see how it was structured, and do it themselves.
-
Microsoft-compatible open formats:
-
Microsoft offering open (readable) XML file formats in
Microsoft Office.
File system divisions
Windows file system can spread over multiple pieces of hardware.
Each given its own (single-letter) drive:
drive:\dir\file
Can also partition a single piece of hardware into multiple drives.
UNIX file system can spread over multiple pieces of hardware too.
But everything appears as sub-directories of a single file hierarchy.
Path may indicate hardware, something equivalent to:
/drive/dir/file
or may hide hardware entirely:
/dir/file
Hierarchical file system
Can organise files in separate dirs
(Many web authors seem not to have discovered sub-dirs!).
Crucial to keep user files separate from system files (Why?).
Hence the excellent invention of
C:\My Documents,
to match its UNIX equivalent,
$home
Can reuse same file names in different sub-dirs (like index.html).
All modern OS's allow
long filenames:
photos.kenya.apr.1963.html
Legacy systems:
Short file names are good for ..
Short file names are good, though, for:
- Stuff you must type.
e.g. If you are typing file names at command-line.
All-lower-case is easiest to type.
-
Program names at the command-line
(i.e. the program you call has a short filename).
sed, grep, ls, cut, etc.
All-lower-case is easiest to type.
-
Some people say also URLs?
Maybe you should never type URLs.
At most you type the host name that you saw somewhere.
For everything else you cut and paste, or click.
However maybe
short URLs
make the web a more pleasant experience than
long URLs.
It is nice to have short, "guessable" URLs.
See "URL as UI"
Symbolic link (cross-link, breaking the hierarchy, "shortcut") in UNIX
Directory
Can selectively break the hierarchy with shortcuts.
ln -s dir shortcut
or in Windows see "Create Shortcut"
e.g. on one system I used:
$ ls -l /bin
lrwxrwxrwx 1 root root 9 Apr 14 1997 /bin -> ./usr/bin
File
Can also just give a file multiple names:
ln -s file secondname
e.g. on DCU Linux:
$ ls -l /usr/bin/grep
lrwxrwxrwx 1 root root 9 2007-07-31 12:59 /usr/bin/grep -> /bin/grep
e.g.
StarOffice
Windows-compatibility suite on Solaris:
lrwxrwxrwx 1 humphrys staff 10 Dec 13 21:03 excel -> staroffice
-rwxr-xr-x 1 humphrys staff 10 May 4 1999 staroffice
lrwxrwxrwx 1 humphrys staff 10 Dec 13 21:03 win -> staroffice
lrwxrwxrwx 1 humphrys staff 10 Dec 13 21:03 word -> staroffice
"staroffice" itself contains:
soffice &
Can do this on Windows as well (have multiple shortcuts to a data file
or program).
Problems with cross-links
With shortcuts, if doing a recursive search of disk,
can get infinite loop problems,
or at least duplication.
e.g. List all files on disk. If follow symbolic links
may list files twice.
Q. Also, if delete file, do you delete symbolic link?
If so, how do you find them - do you have reverse directory of them?
Also, I make symbolic link to other user's file.
They delete file. They can't delete my link.
A. If link doesn't work, so what.
Might even leave it dangling as reminder.
Security
If your directory is readable by others on your local
machine, someone on your machine can make it readable by
the world on the Web (either maliciously or accidentally):
cd /homes/your-userid/public_html
ln -s /homes/other-userid/dir shortcut
The world can then read other user's directory through:
http://host/~your-userid/shortcut/
Has valid uses too.
Might want to make one of your own dirs
visible without having to have
it under public_html,
e.g. public_html disk is full,
dir is on another disk.
Another example -
SAMBA or read-write ftp
may only drop you in home directory
rather than root directory
and you may not be able to go upwards.
What you do is put symbolic links in your home directory
and you can access any directory through them:
ln -s /var/mail email
ln -s /htdocs ht
"Hierarchy with some cross-links" a very powerful model
General conclusion is that a basic hierarchy, with some cross-links
for difficult points, is excellent way to structure complex data
(e.g. Yahoo directory,
Google directory)
- rather than total cross-link free-for-all
on one hand (e.g. the Web with just search engines and no directories),
or rigid hierarchy on other
(e.g. Dewey
library system).
Interestingly, family trees are also
basically hierarchical,
with arbitrary cross-links,
rather than strictly hierarchical
as many people seem to think.
If it's data (1's and 0's), there's no real excuse for losing it.
You can make automated copies and store them all over the world.
Disk space is big and cheap. Machines are often idle.
The network is always on.
Backups can be automated across comms. links at night.
In future, backup and long-term storage
will be essential part of
"ISP" or "Network Computer" service.
- Removable media - DVDs, CDs,
tapes,
USB keys,
external hard disk.
v.
- 2 different machines permanently linked by comms.
Distributed file system.
Internet read-write ftp, automated scripts, mirrors.
Other people back you up
Even if you back up nothing, your web pages are being backed up by other people:
-
Google cache
(click on "Cached")
-
Yahoo cache
(click on "cache")
-
Microsoft cache
(click on "Cached page")
-
Internet Archive
Backup policy
- Periodically dump entire file system to backup.
v.
- Keep a running "mirror", and only backup things that have changed
since last time they were synch-ed.
Perhaps only backup user files.
OS, system and application files
can be recovered from install CDs / tapes.
Which of these is the most dangerous:
- Keep 1 synchronised copy of your files.
Backup the changes every night.
- Keep 1 synchronised copy of your files.
Backup the changes every hour.
- Take a copy of all of your files once a week.
Keep all these old copies.
Do no backups at all during the week.
- Take a copy of all of your files once a month.
Keep all these old copies.
Do no backups at all during the month.
Remember - it may take days or even months before an intrusion
and destruction, or accidental damage, is noticed.
User may realise 2 years later that he has deleted some file
and needs it back.
- VAX/VMS (DEC)
could be set to keep all drafts of a file
since created.
- The equivalent of
ls
would hide all except the latest one
by default.
Unless explicitly asked otherwise.
- Programming with
DCL
would work with the latest one by default.
Unless explicitly asked otherwise.
- Lot to be said for such an approach,
now that disk space is cheap.