CHAPTER 5 Addendum

Data Processing in the '70s and Later

Analysing size-density data, when I conducted the international study in the early 1970s, was much different than it would be today. It was much more labor intensive in terms of data retrieval and analysis. It usually involved many more people (volunteer student asssistants, staff in the campus Computer Center, perhaps staff at the campus Bureau for Faculty Research). And it meant a lot of running around from the library to my office to the comuter center.

Here is the data for Algeria, Fig 5-3, from the Britannica World Atlas, a bulky 11-by-16-inch volume of maps and statistical tables.

5-3.AlgeriaEB.gif (8k)

The standard procedure was to write out the needed data on a data entry form, a ruled sheet which indicated the fields needed in the keypunching process (e.g., a row for each territorial unit, with columns 1-20 for the unit's name, 21-30 for its area, 31-40 for its population). I included the name of each division, both as an aid in later possible error correction and in order to readily identify any peculiar "outliers" or other oddities. The only problem here was to chose between the populations shown for 1954 and 1960; I invariably chose the most recent data, so I picked the 1960 census. Writing it all out was simply tedious.
first lines of a typical hand-written data sheet, to be keypunched later

The filled-in data entry forms were then taken to the campus Computer Center for keypunching. If you had a grant you could pay people to do such work, if not (my case) you punched the data yourself on a machine like that shown here.

IBM 026 Keypunch Machine IBM 360 Mainframe Computer

Blank cards were stacked in a hopper on the right of the keyboard and fed, one at a time, into the "punching station" just behind the keyboard. Data were keyed in from the appropriate row (territorial division) on the data entry sheet. When the card for a given territorial division was finished, it was shoved through a "reading station" and into the card hopper on the left.

IBM card for Algeria's territorial division, Alger, with area and 1960 population

Ordinarily with such data, you punched the whole set twice, then ran both sets through a verifier to check for punching errors. Once the entire set of data cards (1764 territorial units, 98 nations) was ready you then wrote the program -- a series of statements in FORTRAN, PL/1 or BASIC -- and punched those onto cards as well, one card for each line of the program. A couple of "job entry" cards (identifying you, your project, etc.), the program deck and the data deck were then turned into the receiving desk of the Computer Center. The staff of the center would then process your deck on the IBM 360 mainframe computer.

Turn-around time might be several hours or perhaps a whole day. Often what you got back was a list of errors (project identification errors, programming errors, unreadable card errors), any of which would terminate your job. This meant going back through everthing, discovering the problem, then going back to the Computer Center to repunch the offending cards, and resubmitting. If you were lucky, sooner or later you could pick up results such as these, printed on "greenbar paper" (helped the eye follow lines across). The printer was really a glorified monotype typewriter - no font variations, no pictures, just text like this:

 

ALGERIA

B  =         -0.73383                  MEAN Y  =     4.46488

A  =          5.38013                  MEAN X  =     1.24723

R  =         -0.91830                  VAR Y  =      0.47198

R-SQUARE  =   0.84328                  VAR X  =      0.73912

T  =         -8.36370

DIVISION       AREA   POPULATION      DENSITY      LOG A      LOG D

ALGER          3393      1114362    328.42971    3.53058    2.51644
ANNABA        25367       735966     29.01273    4.40427    1.46259
BATNA         38494       478165     12.42181    4.58539    1.09418
CONSTANTIN    19899      1218410     61.22971    4.29883    1.78696
EL ASMAN      12258       631109     51.48548    4.08842    1.71168
MEDEA         50331       630573     12.52852    4.70184    1.09790
MOSTAGANAM   117432       613748      5.22641    5.06979    0.71820
ORAN          16438       872314     53.06692    4.21585    1.72482
SAIDA         60114       191963      3.19332    4.77898    0.50424
SETIF         17405      1005789     57.78736    4.24067    1.76183
TAGDERMPT     25997       267409     10.28615    4.41492    1.01225
TIZI-OUZOU     5806       803693    138.42456    3.76388    2.14121
TIEMCEN        8100       375531     46.36185    3.90849    1.66616
LASAOURA     789660       156664      0.19839    5.89744   -0.70247
OASIS       1297050       333830      0.25738    6.11296   -0.58943

While the computer's printer couldn't produce actual graphic results (I later did have access to a graphics plotter), it could be programmed to approximate a scatter diagram. You had it determine the maximum and minimum values for X and Y (log density, log area), then assign these to the rows and columns at the top, left, bottom and right of the typing which would appear on a single page of greenbar paper. Next you had it convert (a simple ratio problem) calculated X and Y values for each territorial unit into an appropriate cell determined by row and column of the printing surface. If two or more units occupied the same cell, you had to show that through incrementation (like the "3" in the case of Algeria):

   

With a pencil, ruler and little further effort you could estimate the approximate locations on the axes for integral unit values (marked here in blue). By locating two other points, the midpoint (MEAN X, MEAN Y) and the intercept (X=0, A), you could also draw a regression line (shown in red here). These niceties were simply drawn on with a felt-tip or ballpoint pen.

   

Finally, you could refer the computed value of T (-8.36370 in the printout above) to a table like this one, from the CRC Standard Mathematical Tables,27th ed., p 546:

The "degrees of freedom" for 15 cases is 13 (N - 2), and the t-value (ignoring the minus sign) exceeds the largest value shown, so we conclude that Algeria "rejects the null hypothesis" (no relation between size and density) with probability < .0005; essentially, this means Algeria strongly supports the size-density hypothesis, which is obvious from the scatter diagram.

If I wanted a fancy version of my Algeria scattergram, say for publication, I took my sketch (the greenbar printing with the added blue and red ink info) to the campus Bureau for Faculty Research. There a professional graphic artist (Joy Dabney) would place it on a light table and trace a chart with india ink on vellum. The result looked something like this:

Beyond the '70s

In the late 1970s I got a $195 Hewlett-Packard HP25 scientific programmable calculator (shown here half size). It enabled me to do a lot of my size-density work independently from the keypunch and mainframe of the campus Computer Center, in my office or while working on a tan in the backyard.

You can read about the joys of programming this wonderful toy here. And there is a very clever java simulation of the device, by Larry Leinweber, here.

The main drawback to using this handy little (6 oz) machine was that you had to enter all the area and population figures of a country's terrritorial divisions each time you ran the analysis. There was no way to store entered data for correction or further use. Another problem was that there was no printer. The program would compute logs, sum the squares and cross products, compute means and variances, then store these in various statistical registers. You had to write down these results or lose them.

With such results you could compute the t-test (also programmed) for the null (b = 0) or theoretical (b = -2/3) hypothesis. This freed me from having to look up computed t-values in tables like the one above. The program, which involved evaluating the definite integral for the specified t-value under the given degrees of freedom, resulted in an exact probability -- i.e., not just the "p < .0005" for Algeria, but "p = .0000014" (I liked this, but it threw many reviewers for sociological journals who knew only the ".05" way of reporting results; most editors made me report results the old way).

The next major technological change came when I obtained, with the help of the National Science Foundation, a pair of the very first model IBM PCs in 1982. With these I could store data once and for all. I could easily write and correct BASIC programs and store these. I could transfer information from the machine in my office to the one at home, and back, with floppy disks. It was wonderful.

Initially I had no graphics display, so I still had to pretend, in text mode only, that I was creating scatter diagrams as on the Computer Center's printer, scaling computed XY-values to the number of lines and columns of screen display. Later, with a graphics card/monitor, I could write programs which would do virtually anything I could think of. The only thing tedious at this point was initial data entry. Fortunately, as my poor vision proceded to get worse, I had more and more student help.

In 1984 the fellow who succeded me as department chair offered me a new computer, the very first Macintosh. He loved it (didn't have to learn all those pesky DOS commands) and thought I would, too. It was awful. All the memory (128k!) was taken up with graphic display of text. I couldn't run any data analyses because it simply couldn't handle the data sets and programs, which were not very demanding. He later showed up with a 512k model which would do the job, for the most part, so I accepted his (the department's) generosity. Thus began a long-term fondness for Macs, reflected in this display of the machines I've used. The last one the department supplied was the IIfx which, ten years later, is still functioning as my office machine (though I don't ask much of it).
1984 - Mac 512k1987 - SE
1995 - PowerMac 7200
1990 - Mac IIfx1998-PowerMac G3
2000 - PowerMac G4 Cube

Most of the programming I've done since the days of the IBM 360 has been in BASIC. It was quite adequate for all my size-density research and for most other projects as well. But with each advance in the Macintosh computer, the BASIC application has become less and less stable (I presume because no one was writing new versions). Now it's as likely to crash the system as not.

In contrast, graphics programs have improved dramatically since I started using Macs. I have gone through successive versions of Deneba's "Canvas" - a bit-map and object-based application which enabled me to take graphics generated from BASIC programs or Excel spreadhseet charts (and still more recently, scanners) and fix them up a great deal. GraphicConverter, another Mac (shareware) application, has also been very useful.

The arrival of the internet has dramatically changed methods of obtaining data. You can get lots of area/population figures now, and maps, directly from their sources, without any of the tedious data entry problems of the old keypunch era. The internet also permits you to disseminate results easily, as I am doing here, using whatever graphics you wish (even animations) and without going through the endless and usually pointless process of submitting your work to editors who often know less of your subject and techniques than you and your readers know.