ICL 2966 restoration during 2009
Post this page to popular social media
Each TNMOC project has either a working group or project team assigned to do the work. Working groups are either managed in association with the CCS (Computer Conservation Society) or solely within the Museum.
Below are updates on the 2966 restoration project during 2009.
Progress made during 2010.
20/12/2009 update from Delwyn Holroyd
After the excitement of booting earlier in the week, we immediately hit a setback on Saturday. The one working EDS80 drive has developed spindle bearing problems, and this makes it too risky to load the bootable packs. If the spindle were to develop any play it would result in a catastrophic head crash, destroying both the drive heads and the pack. We will now need to remove a spindle assembly and do some investigation.
Also an appeal - if anybody out there has any test and alignment equipment relating to this type of drive please get in touch with the museum. The drive is a Control Data BK5 SMD type, also sold by DEC as an RM03. Specifically we are looking for a head alignment card, spindle alignment tool, TB303/304 or later FTU (field testing unit), and a CE head alignment pack.
17/12/2009 update from Delwyn Holroyd
The 2966 booted from disk this afternoon for the first time in 15 years!
Most of the day was spent testing the 2900 PI to PC interface I talked about in the last update. Once a bug in the header block I was returning to the DCU had been corrected, I was able to load a short test program that did nothing more than display a code on the DCU display. Later I loaded a program based closely on the disk bootstrap code from DCU ROM, but with some additional dumping of failure status to the DCU display. This information suggested the problem reading data from the scratch pack I had been using for testing could be down to the formatting, or lack of it, rather than a fault.
The most direct way to prove this was to try a pack known to be bootable. Since we have very few of these, I have so far avoided loading them for obvious reasons. To my surprise the DCU was able to load it's bootstrap code before stopping with an error when trying to contact the SCP (the system control processor which drives the operators screen and keyboard). This immediately confirmed the DCU, disk controller and disk drive were all working correctly! And of course it confirmed that the boot disk still contained readable data, not exactly a given after 15 years in storage.
It was necessary to replace the interface boards at both ends to persuade the DCU to talk to the SCP, these being borrowed from other units. In the next stage of the boot process the DCU reads the initial control programs for the SCP and the microcode bootstrap for the OCP from disk and sends them to the SCP. The SCP then takes over the process and starts initialising the other parts of the system via dedicated diagnostic links. The screenshot below shows where it got to - a failure initialising the main store. This occurs before the OCP is taken out of reset, so we still don't know what problems may lie in wait there. Since the SCP is now running control software, it can be switched into engineers mode and various low-level diagnostic commands can be entered (which proves the keyboard works).
The next job for the PC interface will be to load a program to read and dump the contents of important diskpacks at the lowest possible level. This will provide us with the means to recreate these packs in the future, or indeed to emulate the disk hardware. It will also be able to function as a virtual card reader, line printer or magtape deck, providing a useful means to get data in and out of the system.
See ICL 2966 console showing the boot messages
14/12/2009 update from Delwyn Holroyd
Over the last couple of weeks I've been developing a 2900 PI (peripheral interface) to PC parallel port interface, and Saturday saw the first testing on the machine. The 2900 interface is quite straightforward: electrically it is differential TTL, and logically consists of in and out 9-bit data buses plus some control signals. The handshaking is asynchronous, so there are no special timing requirements. This makes it very easy to interface to a PC parallel port. I had done only very limited testing prior to Saturday using my DCU board testing rig and a P247 board from the DCU to drive the 2900 end of the interface. This revealed a data bit stuck at 1, caused by a failed receiver IC on the P247 board. Luckily a working spare was located, and the failed one has been added to the large pile of faulty boards awaiting attention.
Several different types of peripheral use the 2900 PI interface, but for booting purposes the DCU expects an MT (magnetic tape) deck with a bootable tape loaded, so this is what the software running on the PC emulates. Testing at full speed revealed a problem with the interface board: glitches on the strobe line causing stray characters to be sent to the DCU. However prior to hitting this problem the expected "autoread" command was received, a "tape mark" returned to the DCU and then another "autoread" received, so the behaviour was generally as expected. Hopefully this issue with the interface board has now been resolved, and it will be possible to load a test program into the DCU next time.
The first test program will be used to diagnose why it is still not possible to boot from disk. The entire disk controller was swapped out for two different spares. Both of these get further into the boot process, but both fail in the same way during the data transfer phase. There are various status words that might reveal the cause of the problem: the test program will read these and pass the results back over the 2900 PI interface to the PC.
01/12/2009 update from Delwyn Holroyd
After a successful test of two more repaired DCU boards, this week was all about disks.
There are several types of disk drive used with the system: the two removable types are EDS80 and EDS200 (holding 80Mbyte and 200MByte respectively). We have useful system and engineering software on both types of pack, although at this stage the engineering software will be more useful to assist in fault-finding the other components of the system, and this is on EDS200. Sadly all the EDS200 drives are in a much worse state than the EDS80s!
Most have worn bearings, but the most serious problem is rust and decomposed foam from the drive panels, both of which have found their way into the forced air path inside the drive. In operation the drive heads fly above the surface of the disk on a cushion of air, so it's vitally important this air is clean. The first job with these will be to remove the foam debris and any flaking rust and paint in the air path.
By contrast the EDS80 drives are in a much better mechanical state. At some stage the foam in these will need replacing too, but for now it isn't causing a problem. One of these drives in particular is known to be in a good state, and after a precautionary period of air scrubbing (allowing air to circulate through the internal filters to catch any foreign particles) we loaded a spare pack.
Although this spare pack didn't contain a bootstrap, the machine should have been able to read whatever was on the first few sectors before stopping to say it didn't find anything valid to run. We were not entirely surprised to find it didn't do that! Instead we still had error code 20 on the DCU display panel - meaning the machine couldn't communicate with the disk drive. So progress of a sort: instead of suspecting there would be a problem we now know there is one!
Lead by the intrepid Adam an excursion was made into our other main storage building. We were hoping to locate a missing tester unit for the EDS80 drives, which would help to prove the drive functionality independently of the rest of the system. Nobody at the museum recalls ever seeing it, but our system is thought to have had one. We didn't find it this time, but there are still a couple more storage areas to search so we remain hopeful.
Luckily we do have a tester for the EDS200 drives, and the cables for this have now been repaired.
22/11/2009 update from Delwyn Holroyd
A lot of progress has been made today. The first job was to test two repaired P270 boards in the machine. This is the type of board responsible for the '4C' fault code I talked about in the last update. Prior to repair these two boards had been causing different faults, but in each case dead chips were found and replaced, and both boards now check out on the test rig so I was fairly confident both would work in the machine.
Surprisingly a lot of the parts used in the DCU are held in stock at major electronics suppliers. These are actually the original parts and not a more modern equivalent. One of the parts required for the repair is no longer manufactured, but was easily located on the internet. Sometimes when restoring electronic equipment the view is taken that only exactly matching parts should be used, even down to matching the manufacturing date codes on chips. We rejected this approach for a number of reasons, primarily because it would make repair of a machine on this scale impossible, but also because boards were routinely repaired at the factory and re-used throughout the working lives of these machines, so it isn't unusual to find boards with newer replaced components.
The machine was powered up with the first repaired board and straight away we advanced to a new fault code - 18! This was expected, because we'd actually been here a month ago. The P270 previously in use had come from spares and worked for a short while before developing a fault. Disappointingly the other repaired board still doesn't work in the machine, so further diagnostics will be required on the test rig.
I had developed a theory that another board type, P245, would prove to be the cause of fault 18, but once again we only had one "working" board. All the others tested from spares had other faults that stopped the self-test earlier in the sequence. However another search was made and an apparently unused P245 was found! This was fitted in the machine and now finally the DCU display stops at 20, which is the expected behaviour and means that the boot device is unavailable.
We have a lot of work to do before we can make a disk drive available for boot. There is only one engineer's disk pack, and no way of copying it other than on the machine itself. There will probably be additional faults to locate in the disk controller and the disk drives - any serious fault will prevent the machine booting, and we need two working drives to copy the pack.
To minimise the risk of damage to the engineer's pack, we are developing a strategy to properly test the disk controller and a disk drive using a spare disk which can be safely overwritten, or in the case of a head crash, destroyed A head crash occurs if a disk head comes into contact with the surface of the rapidly spinning disk, and usually ruins both the head and the disk. In order to do this, we'll need to load a test program into the DCU instead of the normal bootstrap code. I'll go into how this will be achieved in a future update.
Also today the third processor cabinet was recommissioned. One of it's power supplies had been borrowed to keep the DCU running, but in the meantime Phil H has repaired several, so these needed to be tested. First of all we pulled all the logic boards from the racks just in case! This cabinet hasn't been run for some time, and the fan motor which had been in a poor condition decided not to run at all. It has several failed windings which means the motor will only start if it previously stopped in the correct place. Rebuilding a spare fan tray took a couple of hours and was incredibly fiddly. A number of museum visitors witnessed the frustration at first hand!
Once this was installed work could continue on testing and adjusting the power supplies. All the repaired units proved to work and the output voltage could be adjusted - this was something Phil H was unable to fully test outside of the machine, because the sensing and control logic is in the cabinet and not in the power supply itself. With all the logic boards back in place and repaired supplies installed, the cabinet was powered up with fingers crossed. The 5V power supply consists of several units connected in parallel onto thick metal busbars, providing in total up to 375A. These need to be balanced out so that each supply unit shares the load. The procedure involves tweaking the voltage adjustment on each unit until the currents are roughly equal - the current is estimated from the voltage drop across the short link from the supply to the busbar - at 100A or so this voltage drop is easily measurable!
20/11/2009 update from Delwyn Holroyd
Slow but steady progress has been made on the 2966 over the past several months. Four of the five main processor cabinets have been cleaned out and can now be powered up. Pete H has made a start on replacing the bearings in the fan motors, as these are all in a sorry state. Phil H has recently fixed several of the large 5V power supplies. A few of these have failed since we started working on the machine, and we had completely run out of spare ones. Pete H has built an air duct to direct as much heat as possible from the processor cabinets out of the window. It has a flap at one end that can be opened to allow heat into the room on cold days!
Work has been started on a number of the disk drives. The foam used to line the panels has rotted with age, and all needs to be cleaned out. Particles of foam and disk heads do not make a happy combination!
One of the two System Control Processors (SCP) has now been repaired. These are the pedestal units with a video monitor and keyboard on top. The SCP is a 16-bit minicomputer in it's own right, running at about 10MHz. It functions both as an operator's console when the system is running, and as a diagnostic tool when it isn't. The SCP is intimately involved in the IPL process for the machine so it's vital to have it working before anything else can be done (IPL = Initial Program Load, ICL's term for booting up). Only one is required for operation, so the repair of the other is a lower priority.
The main processor cabinets hold several different functional units: OCP (Order Code Processor = CPU), SCU (Store Control Unit), several DCU (Device Control Unit), and SM64 (Store Module, or RAM). Of these, only one DCU is required during the first stage of the boot-up process. The DCU contains another 16-bit minicomputer and other specialized hardware to handle peripheral transfers between the main store and disks, magnetic tape, card readers, line printers and so on. The first stage of the boot process is handled by ROM code in the DCU, and it simply loads a 2K bootstrap from one of the disk drives into the program memory of the DCU. Later on it loads a program into the SCP, which then in turn initialises the OCP and other units. Our machine has 8MB of Main Store (RAM), up to 16MB could be fitted. 'SM64' means it is built from 64kbit chips. The OCP is microcoded and can behave either as a 2900 machine, or as one of the older 1900 series machines. In fact it can do both at once and run two completely separate operating systems, sharing access to peripherals. In this respect it is not dissimilar to today's trend for partitioning servers into several virtual machines using a hypervisor.
The current goal is to get through the self-tests in the DCU, which verify the functionality of most of the DCU hardware. These must all be passed before the DCU attempts the bootstrap. The progress is indicated on a two digit hexadecimal display on the DCU control panel - currently we are stuck at 4C, which is the last but one test. We are lucky that we have more DCU boards than anything else, since there are three in the machine. Unfortunately, most spare boards we have tried also have faults! We are now at the stage of needing to repair individual boards. I have built a rig to enable a board to be powered up and have signals driven to and from it via an FPGA (a type of modern programmable logic chip). The logic diagrams for all the boards in the DCU have been located in the ICL documentation archive at the museum, along with a number of low-level technical descriptions. Without this documentation it wouldn't have been feasible to attempt this kind of repair.
An extra complication is ICL's practice of referring to off-the-shelf parts using an internal part code. The DCU is built using standard 7400 series TTL and STTL logic, a very common technology throughout the 1970s and 1980s, and still easily available today. However on a logic diagram, a 74S74 D-type flip-flip is referred to as a 'DZ', and in many cases this is stamped on the chip instead of the normal code. Luckily some of the boards contain conventionally marked parts here and there, and so with some detective work an equivalence list has been built-up that now covers almost everything.
Most faulty boards that have been diagnosed so far have proven to have one or two faults, either faulty chips or dry-joints (where a soldered joint is not making reliable electrical contact). Most of these are probably due to natural ageing: the boards date from around 1977 to 1984 and each one contains typically around 100 chips.
As the discussion above probably makes clear, there is still a long way to go. The operational state of most of the other processor units is currently unknown: the OCP itself will do nothing other than generate a lot of heat until we get further into the boot process. Similarly the disk controller and disk drives have not yet been tested: this area could prove to be problematic.
Once we have a working DCU, SCP and disk drive, fault-finding on the remainder of the system should become a lot easier. ICL designed the hardware with extensive diagnostic facilities, and we have a suite of engineering test software on disk that can in theory pin-point faults to board level or lower.
Work has now started on the long and difficult task of restoring an ICL 2966 mainframe from the early 1980's to its former glory in the large systems room. This involves stripping down all the control cabinets, disk drives, consoles, mag tape units and punch card readers to assess what work is required. Once all the units have been assessed, work will begin on getting some of the core systems working until we will eventually have enough to starting running diagnostic software on it.
The project involves several museum volunteers, with invaluable help from 3 ex ICL/Fujitsu engineers, one of which we found out recently, actually decommissioned the system 15 years ago when it was being used by Tarmac Plc. This is very much a long term project which could easily take several years to complete.
The most recent work was the installation of a separate 3 phase supply and several 16A and 32A mains outlets which the main cabinets require. It has been estimated that a running system would require between 10 and 15KW of power!