Below you will see updates on the 2966 restoration project during 2010. To see the progress made in 2009 click here.
23/08/2010 update from Delwyn Holroyd
Having proved the route for capturing data from a diskpack last week, it was now time to transfer the diagnostic pack. This contains all the diagnostic test software for the system and is one of the few bootable diskpacks we have. Securing the data on this pack was one of the main reasons for building the interface to the drive.
Thankfully the transfer went without incident, and with no CRC errors, so we now have a perfect copy of the data. To understand the format it's been necessary once again to piece together snippets of information from various sources. The main section of interest is the SCF (System Control File), containing not only the usual components required for IPL (Initial Program Load - or booting) but also the CTS (Customer Test Software) suite. The format appears straightforward and it should be easy to extract all the various components.
The next task is to produce a virtual magnetic tape containing the CTS components, similar to the virtual tape we have been using to boot the system from a laptop. For a normal IPL, only the level one bootstrap is specific to tape or disk, and subsequently loaded components contain all the necessary code to continue the process from either medium. We know that it was possible to run CTS from tape, so the hope is that the CTS software is structured in a similar way and that we have all the necessary components from the copy on disk.
Once the diagnostics can be run from a laptop, we can get started on fault-finding the OCP.
19/08/2010 update from Delwyn Holroyd
During the week I implemented a memory buffer and UART interface in the FPGA of the EDS80 interface. I was able to partially test this without access to the real drive. Once at the museum, after clearing a few simple bugs I was able to issue seek and read track commands to the drive. A download command is then used to read the data for an entire track to the PC, very slowly due to the 115kbps UART! It took several hours to capture the whole of one of the scratch packs.
I then post-processed the captured data to make sure it was correct, checked using the CRC code. I had no idea what data we would find on this scratch pack (the term meaning it contains no useful data and can be overwritten), but looking at binary dump it clearly contained some non-random looking data. Security erasure of disks, in other words overwriting with zeroes or random data, was not common in the commercial mainframe world at the time.
Looking at the data in ASCII revealed nothing that looked like a string - this was expected since 2900 machines use the EBCDIC character set natively, and 1900 code when emulating a 1900 machine. No strings turned up in an EBCDIC dump either, and since the machine ran in 1900 emulation mode at Tarmac I strongly suspected the pack would be in 1900 format. However we don't have a full specification for the format, which is different to a native 1900 format disk pack. On a 2900 there is a 'wrapper' around the 1900 data in the form of reserved cylinders used for IPL (booting the machine) and flaw management (allocating alternate tracks if some tracks go bad).
I pieced together enough details from various documents to figure out which cylinders contain the 1900 format data, and this seemed to correspond to the data I was looking at. I then started trying various 24-bit unpacking methods to extract 6-bit characters in 1900 code. Nothing worked! No sensible strings emerged no matter what I tried.
I had almost given up when I found an #XPJK dump of an EDS80 system disk from the 2966 machine in Russia. XPJK is a system utility program that prints information on the geometry of a disk pack and shows where files are physically allocated. The printout revealed the number of 1900 blocks per cylinder used on an EDS80, and by comparing this with the amount of data I had captured per cylinder, I realised that the data on the disk pack was not actually packed at all. Instead, each byte contains a 6-bit 1900 character with random values in the top two bits (which is why this wasn't immediately obvious!).
When I decoded the data again using this information, I could immediately see the file area table, which contains some recognisable keywords, and scrolling further down sensible strings emerged!
Considering the cost of disk storage at the time, the amount of space wasted in this format is astonishing! The only explanation is that ICL intended emulation of 1900 on 2900 to be a short term solution allowing customers time to migrate their software onto 2900 natively, and so efficiency wasn't high priority. The 6-in-8 bit format is also used in the main store of the OCP when emulating 1900 code.
All the original files on the diskpack had been deleted (but not erased). The data left appears to consist mainly of transaction logs, including messages which would have been output to the operators screens including the VDU formatting codes. We didn't know a great deal about how this machine was used at Tarmac, but the logs reveal a financial application with screens for entering expenses, checking supplier accounts and so on. It may prove possible to reconstruct what some of these screens would have looked like.
08/08/2010 update from Delwyn Holroyd
During the week I further developed the firmware for the EDS80 disk interface board armed with the information gained from testing last weekend. It now has functions similar to the FTU (field testing unit) which would originally have been used to exercise these drives. Firstly I tested the seek functionality - track to track between defined cylinders, between cylinder 0 and a fixed cylinder and random seek are the options. Seeking between the first and last cylinders repeatedly causes the drive to make a tremendous noise, and it becomes clear why the power amplifier driving the voice coil which moves the heads is so large!
Systems programmers have been known to write programs which would deliberately cause drives like these to 'walk' across the machine room floor by carefully chosen seek operations.
Going beyond the FTU functionality, I've also implemented a full track format decode for ICL format disk packs. The format uses variable length sectors, unlike almost all modern hard drives which use fixed 512 byte sectors. The track decoder reads all the sector headers and checks that the cylinder and head number within matches the expected values. It also checks CRC to make sure the data has been read correctly. This can be done repeatedly on one cylinder/head or in sequence based on the selected seek mode.
The problem of stepping over write splice points mentioned last week was quite easy to overcome as they appear to be positioned consistently. By the end of the weekend I was able to read every track on one of the scratch disk packs with consistent results. There appear to be two CRC errors on this pack, both on the same cylinder but on different heads. The drive has a couple of features to assist in error recovery: the heads can be offset a very small amount from the track centreline, and the read strobe can be set earlier or later than normal. However in this case none of these measures resulted in an error-free read. It's possible that the tracks are actually marked as flawed on the disk pack, but I will need to capture all the data and decode it to find out.
The next step is to add a memory buffer to store the data from each track, and an RS-232 UART interface to allow it to be read out to an attached laptop.
01/08/2010 update from Delwyn Holroyd
I constructed a comms loop-back plug for the 7501 terminal with the aid of the recently found and scanned schematic diagrams, and the ROM based loop-back test now works. This gives reasonable confidence that the synchronous comms interface, when built, will be able to talk to the terminal. Now armed with all the necessary information I am also developing an emulator for the terminal. This will enable the teleload process that loads the control program into the terminal from a mainframe to be investigated, using my existing George 3 emulator. Longer term the emulation of the Minicom processor will be useful as a diagnostic aid when repairing processor boards and as a base for emulations of the SCP or DCU if these become necessary.
A lot of progress has been made with the EDS80. Having cleaned the heads, the drive and the (correct) scratch disk pack it was time to try loading the heads on this drive for the first time in eleven years. Thankfully this went without incident and the drive became ready. Once the interface board was hooked up it became clear quite quickly that some faults existed on the drive: one of two clock signals was not present, and when I attempted to make the drive seek more than one track at a time the carriage moved all the way to the end stop (with a tremendous thunk that caused some momentary panic!) and the drive returned a seek error.
The clock fault seemed like it would be straightforward to diagnose so I looked at that first. Referring to the diagrams and wire lists in the manuals it seemed the signal was present on the wire-wrap backplane at the source, but not at the other end, and there was no continuity between the two pins, indicating a backplane fault. At this point I decided it might be easiest to swap the entire logic chassis from the previously working drive. However there was no continuity between these two pins on the other backplane either! Closer examination showed there was no wire between the pins and the signal was actually sourced from a different pin on the clock generation board. The schematic for the clock board did indeed show that pin as being connected to something, but when I traced the circuit back on the board it didn't correspond to the schematic. This is a real pain because it means we can rely on neither the backplane wiring diagrams nor the board schematics to correspond to our drives! Further research will probably be necessary.
Anyway, it was now clear the clock fault lay on the clock board and simply swapping that board over fixed the issue. Handily this also cured the seek fault and the drive now seeks forwards and backwards to any cylinder.
Using the primitive capture facility implemented so far for the interface board I was able to verify correct looking data coming off all the heads of the drive. The first data written on each track is the 'home address' which contains the cylinder and head number, so I was also able to check that the drive had in fact seeked to the correct cylinder too.
During testing one difficulty came to light which will make the data capture process more complicated than I had hoped. When a track is newly formatted the data is written all at once with no discontinuities, but once sectors have been modified discontinuities exist where the drive starts and stops writing to the track. These discontinuities can cause the decoding logic in the drive to lose sync so it isn't possible to simply read the entire contents of a track at once and analyse it later as I had planned; the firmware on the interface board will need to decode the sector format and turn the read gate off and on around possible write splice points. This is not too difficult to implement but the documentation omits to mention where the nominal splice points are. The actual splice points as read back will in any case vary based on head position tolerances and factors in the controller so some experimentation might be necessary to discover the safe window for reading.
12/07/2010 update from Delwyn Holroyd
Following the head crash last week, I examined the affected head in more detail with a USB microscope, and concluded that it will need to be replaced. However, since we don't currently have the necessary re-alignment tools and special disk pack required this will have to wait. (We are hoping to acquire these tools).
In the meantime, I transferred the reconditioned spindle from this drive into one of the others. I also replaced the drive motor bearings in the new drive and gave it a thorough clean. I checked the heads with the microscope and found they were very dirty - they will need cleaning before use. The drive was run for several hours without heads loaded to run in the new bearings.
Whilst that was happening, I turned my attention once again to the 7501 terminal. After checking the documentation in more detail it turns out it can't support UART style communications on the modem port after all, which means it can't be directly interfaced to a standard PC serial port. The buffer chips on the interface boards do support async operation but the board is strapped for synchronous operation only, without start or stop bits. Instead the SYN character (16h) is used to achieve byte alignment at the beginning of each message. The next step will be to wire up a loopback plug to check that the comms is working, and then construction of a suitable interface board.
05/07/2010 update from Delwyn Holroyd
It's been a while since the last update because of VCF, but nonetheless some progress has been made, and some steps backward....
On the Saturday of VCF the machine decided not to play nicely: the store cabinet indicated a +5V fault (although there wasn't one, luckily), it refused to boot from the laptop interface, and later in the afternoon the OCP overheat warning came on, although once again I am not convinced - it didn't appear to be any warmer than normal. We switched off anyway to avoid any risk of damage.
Things were better on Sunday, with the machine deciding to boot again after I cleaned the contacts on the off-card connector linking the DCU to the laptop. I suspect the real problem here is marginal signal quality due to the construction technique of the interface board (point-to-point mod wire with no ground plane). There was no OCP overheat today, but the +5V fault warning was still present. This will be a fault in the monitoring board.
Just before VCF the power supply for the 7501 terminal was re-tested by Phil H and found to be working - taking it apart seems to have fixed it so possibly just reseating the PCB connectors was all that was necessary. The week after VCF the power connectors onto the backplane were cleaned and the unit re-assembled. To my surprise it appears to be working! It can't do much without having a control program loaded into it, but the ROM code does some self-tests and has store dump and alter functions and these were used to dump the ROM contents to screen. The next step is to set up the interface board for standard async RS232 comms (in a mainframe application it uses synchronous comms). Once this is done it should be possible to interface the terminal directly to a George 3 emulator running on a PC, and George will download the control program which turns it into a functional terminal.
I have now finished building the EDS80 interface board - this is properly constructed on a PCB and even uses surface mount technology: slightly incongruous but it's much easier to obtain 3V3 logic level differential transceivers in surface mount. On Sunday the board was hooked up to the working drive ready for initial testing. After fixing a problem with one of the ribbon cables I was able to issue a 'select' command to the drive, and the drive responded with status information and it's selected signal. The data clocks from the drive were present but free running at around 14MHz since no diskpack was loaded. It was a great relief to find that I hadn't made any errors in the pinout on the cables to the drive.
Here comes the bad news - the next step was to load the scratch pack and see if data could be read. The heads loaded ok and the data clock signals went down to around 9.6MHz, the frequency expected when the PLL is locked to the servo track on the diskpack. Before I could do anything else, I noticed a high-pitched noise from the drive followed immediately by a burning smell - the drive was spun down within seconds but I had just witnessed a head crash, something I've been incredibly paranoid about avoiding at all costs.
As the disk slowed down the cause was immediately apparent - the bottom guard platter was bent, and there was some dust evident on the disk surfaces. At this point I realised it was not the normal scratch diskpack - during VCF they had all been moved around and without thinking I had picked up the wrong one. I expect you can guess how annoyed I am with myself about this!
Examination showed the crash was on the bottom head, closest to the bent guard platter. It's possible this generated enough disturbance in the airflow to cause the problem, or it could have been simply down to the dust on the pack. It was time to follow the procedures in the drive manual for head crash recovery, and the drive and heads are now clean again, but there is a slight mark on the affected head. I will be seeking further advice on this before using this drive again.
The greatest irony was when I noticed the number on the diskpack casing: 666 - truly the devil's diskpack!
31/05/2010 update from Delwyn Holroyd
This week I made a concerted effort to find the missing OCP board that was indicated as possibly faulty in the diagnostic run of several weeks ago. Whilst comparing the contents of the spares box with the actual board numbers that should be in the machine, I discovered one of the board numbers I had written down is not actually part of the machine! Sure enough the bag was mislabelled, and it contained the board we have been looking for. However, there are still only 29 boards in the box out of 30 in the OCP upper platter, so one of the set is missing and nowhere to be found amongst the other spares.
Swapping this board didn't make any difference to the error messages reported on a normal boot, not too surprising as we know there were a number of faults reported in the diagnostic run.
I also spent some time trying to diagnose the store block fault, which is still present - but I was unable to make it fail using the store self-tester. This might indicate an addressing error - the store self-test writes the same data to each location, so would not pick up on this.
Since we acquired the ICL 7501 terminal a month or so back I've been searching in the ICL archive at the museum for schematics, without any success. I had discovered that the related 7502 terminal processor has all it's diagrams grouped into a 'machine logic set' under one document number, which I found referenced in a technical description. I suspected the same thing would apply to the 7501, but how to find it? As luck would have it today I stumbled across the technical description for one of the 7501 boards (document number one higher than one I had already found), and this proved to reference the elusive machine logic set document, which contains the schematics for everything in the unit except for the Farnell made power supply. This will greatly assist any fault-finding that might be necessary.
22/05/2010 update from Delwyn Holroyd
The ICL 7501 power supply has been returned to Phil H for more detailed examination, but we are somewhat hampered by not having any schematics or other information for it since it's a Farnell made unit. I've made contact with a company that specializes in old Farnell power supplies in the hope they can turn up some information on it.
I re-assembled the EDS80 drive motor I took apart last week with new bearings and tried it in a drive. As this seemed to work ok I then removed the noisy motor from the 'good' drive to replace it's bearings. This one was extremely difficult to get apart - after some work with the rubber mallet the brake assembly finally came off the drive shaft (it just lifted free on the first motor...) and after quite a lot more persuasion the other parts were finally separated. Re-assembly was much more straightforward, and the drive now runs like new.
Work is also progressing on the design of a drive interface rig. This will allow the EDS80 drive to be controlled directly to read disk contents at the lowest possible level, in order to secure the data.
16/05/2010 update from Delwyn Holroyd
The working EDS80 drive was run once again for a number of hours without heads loaded to continue the spindle bearing run-in. The bearings in the drive motor are very noisy, and this is the case with most of the other motors too. I've disassembled one and ordered new bearings. Once this job is done the drive should run as quietly as when it was new! This is important because it will enable us to hear any unusual noises coming from the drive which might indicate an impending head crash.
The power supply for the ICL 7501 terminal was refitted to the logic chassis having been checked on the bench, and cables made to power the logic chassis and fans independently from the rest of the monitor. Under normal load the main +5V supply was fine but the +-12V and -5V supplies went out of spec, so it was quickly turned off again. This will now require a more thorough examination with all voltage rails under a representative load.
10/05/2010 update from Delwyn Holroyd
Some major progress to report... a serviceable EDS80 disk drive and the first run of the engineers diagnostic software on the machine.
The bearings have been replaced in one of the most seized up spindles, chosen because there was a risk that trying to dismantle it could have caused damage. This has now been fitted to the drive that the system was booted from just before it's spindle started to make unpleasant rattling noises. At the time we didn't know the construction of the spindle (there are no diagrams because it wasn't intended to be a field serviceable part) and it was unclear exactly what these noises might indicate.
As it turns out the spindles contain two sealed ball-bearings of a standard type which are easy to source. The main difficulty lies in the amount of force required to remove the pulley from the shaft. The lower bearing and pulley resist the pressure of a spring that pre-loads both bearings and eliminates any play in the assembly, and therefore have to be a very tight fit. The rattling noises are due to the bearings starting to break up because the 'for life' lubrication has degraded. If ignored this could lead to a catastrophic bearing failure and probably a head crash.
Another issue with replacing the drive spindle is getting the alignment correct. There was a special alignment tool for this, but we don't have one. The ICL engineer who maintained the system at Tarmac told us the alignment wasn't as critical as the maintenance instructions imply, and this has proven to be the case. I aligned the spindle essentially by eye to score marks on the drive casting marking the original position. The track positions are located by pre-recorded servo information on one of the surfaces, so it is only necessary to ensure that the heads move on a path passing through the centre of the spindle, such that the tracks run perpendicular to the heads.
After a run in of the new bearings without heads loaded, the drive was run with a scratch pack and heads loaded for some time without issue.
The next step was to load the engineers disk pack, which had not been done before. It proved to be in good condition and loaded ok. The machine booted from it happily and started to run the diagnostic test suite. The first part of this does detailed tests on the DCU, which all passed. Further tests identified faults in SCU couplers and in the OCP. One of the store blocks is also failing - a new fault. The fault codes can be checked against a listing which identifies the most likely board responsible. Unfortunately the first OCP fault indicates a board that is mysteriously absent from the box which contains an otherwise full set of spare boards - further searching will be required!
It's too risky to repeatedly load the engineers pack for diagnostics until the data on it has been secured, and this is now the most urgent task. The fact that this pack is readable is very good news indeed for the restoration.
26/04/2010 update from Delwyn Holroyd
The museum has recently acquired an ICL 7501 terminal on loan from the Jim Austin Computer Collection. Once restored, we intend to connect it to the 2966 as a user terminal. It's typical of the type of end user equipment used on ICL mainframe systems in the early 1980s.
ICL mainframes required terminals implementing proprietary communications protocols such as ICLC01 and ICLC03. Unlike Unix systems which use relatively dumb character based terminals, on an ICL system a complete message is constructed by the user on the terminal and then sent to the mainframe. This means the terminal needs to directly support cursor movement and message editing. It also has facilities for dividing the screen into protected and unprotected fields, typically used to display a form with areas for the user to fill in. Messages could even be validated by the terminal prior to sending to the mainframe, for instance checking that only numeric characters are entered in a particular field.
The 7501 is an integrated version of the earlier 7502 comms controller and a 7561 video terminal (the type used on the 2966 SCP operating station): instead of having the separate 7502 cabinet containing the controller logic it's built into the base of the terminal itself, resulting in a somewhat taller unit than the 7561 with a row of switches and LEDs below the screen.
Much of the controller logic is also shared with the SCP, with the familiar Minicom processor also found in the DCU, the modem board and memory boards in common. The main difference is the video display board which supports an 80-column display rather than the 40-column deemed more appropriate for system operators.
The 7500 series terminal controllers required 'teleloading' to obtain their control programs. The built-in ROM code has just enough intelligence to request a teleload from the mainframe, which then downloads the required program. As a consequence these systems do a good impression of being completely non-functional until this has happened. We'll be able to test this procedure under George 3 emulation on a PC: readers of this page will realise the 2966 is not quite up to the job yet! Luckily the required teleload utilities and control programs have survived in a dump of a George 3 filestore.
Very little 7500 series terminal equipment seems to have survived, so we are always on the look out. If you know where there are any of these distinctive orange terminals, or even the older blue and grey 7181 terminals, please get in touch with the museum.
07/03/2010 update from Delwyn Holroyd
The failed DCU power supply gave an opportunity to do some spring cleaning around the 2966 area last week, but this week I was able to resume work on the machine. Many thanks to our resident power supply expert Phil H for examining and testing the spare -5V supply: although it looked bad on the outside thankfully it was clean on the inside and proved to work. This has now been fitted in the machine. Meanwhile armed with some new LM311s Phil was able to repair the other unit and this will now be the spare.
I first of all checked that we hadn't suffered any more regressions: the machine still boots to the same extent it did before, and the store is still working.
The main task of the day was cleaning all the board edge connectors in the OCP (or CPU in today's terminology). It's not clear when this was last done, and the maintenance logs for the system show it was a fairly routine operation which frequently 'cured' faults (although whether this was down to the cleaning or the physical movement of the boards is open to debate). This revealed that the clock distribution board in the scheduler wasn't actually plugged in, which clearly wouldn't have been helping matters! I also confirmed that all the boards were in the correct slots.
Unfortunately, none of this changed the fault condition at all, so no easy short-cut in the diagnostic process!
The OCP is by far the most complex part of the system. Unlike the rest of the system it's built using ECL (emitter coupled logic) technology, and consists of sixty individual boards mounted on two backplanes. ECL is much faster than TTL, but consumes a great deal more power. The OCP doesn't obey 2900 target level instructions directly, instead it has a microcoded instruction set known as MICOS II, aided by the scheduler which breaks down the target level instructions into one or more microcode 'tasks'. This makes it fairly easy to emulate other instruction sets: 1900 and System 4 were supported (our machine has a 1900 decoder board). The basic clock beat is 80ns (12.5MHz) although some steps occur at 40ns. It has a pipelined architecture which allows one microcode instruction to be completed every clock beat. Target level instructions take a variable number of clocks depending on how complex they are. Most data paths are 32-bit, with 36-bit extensions in some places, and also support for efficiently converting to and from the 24-bit 1900 architecture.
Given the current completely non-functional state of the OCP, and without the aid of the diagnostic software it's difficult to know where to start. Over the last couple of months I've been scanning and studying all the detailed reference documentation from aperture cards in the archive. Armed with this knowledge the diagnostic registers are starting to make sense, but there's still a lot to learn!
21/02/2010 update from Delwyn Holroyd
Early in the day there was some difficulty in booting via the laptop interface, with the system indicating parity errors on the interface. This has happened before when the system is cold, but normally clears after a few attempts. Today after a few dozen attempts it was clear it wasn't going to. Cleaning the contacts on the off-card connector for the interface cable eventually cured it, and afterwards it worked reliably.
There were no problems with the store, and all blocks passed self-test again.
Unfortunately the -5V power supply in the DCU cabinet then chose to die, and the only spare looked in a very sorry state, so no more work can be done on the main cabinets until these have been looked at by our PSU expert Phil H.
15/02/2010 update from Delwyn Holroyd
The objective for this week was to get the store working, and I'm happy to say this was achieved.
Following the board replacements last time the store self-tester worked as expected, and soon showed that two of the four sub-stores had different stuck bits when reading back from any memory location. To narrow down the fault I swapped two of the sub-store control boards to see if the fault followed - however the fault actually disappeared! Reseating the control board for the other faulty sub-store also cured that fault. Presumably these are dry joints and I expect we haven't seen the last of them, but at least the cause should be clear if/when they do re-occur.
With a working store, the next task was to start replacing the boards swapped out last time to identify which actually had faults. It became clear that at least one fault was associated with the cabling between the second and third cabinets - the interface between the store and the coupler in the SCU. It now seems likely that some of the symptoms of the mysterious faults last time were cured by reseating the cabling. The cables are very solidly made woven ribbon with a sealed termination onto a standard header, and look to be in good condition. Hopefully contact cleaner will resolve any lingering reliability issues with these, although it's possible the terminations have degraded.
During the board replacement process, another of the sub-stores started to fail intermittently, and then permanently. This time when the control board was exchanged with another the fault followed, and the faulty control board was replaced with a spare.
Although the day ended with a full 8MB of working store, it's likely some of the intermittent faults will return. Hopefully they will become permanent, which makes them a lot easier to find!
03/02/2010 update from Delwyn Holroyd
After swapping many boards in the SCU, the cause of last week's fault was traced to the SM64 control module in the store cabinet. The behaviour during store initialisation is now different in several respects: not only does it take much longer, but the SCP configures the store into 'non-interleaved' mode.
The store module consists of four sub-stores, each divided into logical blocks. The sub-stores are normally operated in parallel, or 'interleaved' to speed up accesses to store - modern server motherboards use a similar scheme to increase memory bandwidth. If there are faults in one or more sub-stores the system can instead fall back to non-interleaved mode. Individual logical store blocks can also be marked bad and the system will avoid using them. A diagnostic status register indicates which store blocks are good.
Prior to the most recent fault, the 'good block' register had a consistent but unexpected collection of bits set. The machine only has a half-populated store module (8Mb, maximum is 16Mb) so all the bits should be set in one half of the register, but this was not the case. This behaviour together with some other oddities had made me suspect the store wasn't previously functioning properly at all.
With the replacement boards, the set bits are now all in one half as expected, and it appears that two of the four sub-stores are not functioning. This is of course entirely believable!
We don't have nearly as many spares for the SCU and Store modules as for the DCU, so repairs on failed boards will be necessary. Away from the museum work is continuing on building an expanded board test rig. The new test rig will also be compatible with DCU boards, but will have a larger number of I/O channels to interface with the SCU boards. Work is also continuing on scanning the relevant technical descriptions and logic diagrams from aperture cards.
25/01/2010 update from Delwyn Holroyd
Last week the boot process started to fail with an error message 'Invalid SCU coupler type', which I thought probably referred to an incorrect entry in the configuration file that had just been sent to the SCP at that stage - the file hadn't changed but possibly it was being corrupted on the way. In view of this and other strange behaviour seen last week I swapped the processor board in the SCP (system control processor), but to no avail. I also swapped the serial interface boards at each of the link between the SCP and the DCU to eliminate that as a source of corruption (somewhat unlikely, given that the SCP's control program had just been loaded successfully via the same route).
A bit more digging using diagnostic commands from the SCP showed that it wasn't possible to access any of the coupler registers in the SCU. This is done via the DCM (Diagnostic Control Module) which is attached to another serial interface on the SCP. In addition to their normal operation, all the registers in the SCU and it's couplers are connected together serially to form a number of loops. To read a register, the relevant loop is 'spun' so that the required bits are loaded serially into a buffer in the DCM. To write a register bits are loaded from the buffer into the loop.
After some fruitless attempts to read from coupler registers, it spontaneously started to work again! The boot now progressed beyond the invalid coupler type error, so this was clearly referring to a failed attempt to read the property code from a coupler (which identifies the coupler type). After a power cycle of the SCU cabinet, the registers were no longer accessible and we were back where we started. I suspect the cause of this fault probably lies within the DCM.
Even whilst the coupler registers were accessible, the boot process still failed at the store initialisation stage, and this time the 'SCU Reset' trick from last week didn't help. Unfortunately I have to conclude this was a red herring, and there is probably another intermittent fault that just happened to be taking a break last week.
I refitted a repaired 5V/150A power supply module in the DCU cabinet, testing it first with no load and then with a partial load (only some of the logic boards plugged in). Finally with all boards plugged in I balanced the three 5V supply modules so that they were sharing the load equally. Thanks to Phil H for the repair, which involved replacing a failed IC in the switching circuit.
18/01/2010 update from Delwyn Holroyd
I arrived at the museum fully prepared for a day of debugging the store, this being the area where the boot process has been failing. Before getting into that, I realised that following power-up the 'store running' indicator light on the SCU control panel was not illuminated, and a manual activation of the 'SCU Reset' control was necessary. When I attempted to boot again the store initialized successfully! The following steps loaded the initial OCP microprogram and started it running. At this point it stopped with another error, saying the OCP is faulty... unfortunately activating the OCP Reset control didn't help, not all problems are so easily cured!
Not having prepared for conducting OCP diagnostics, I decided to verify diagnostic access to the store from the SCP console. Even before the system has fully booted, diagnostic commands can be entered on the SCP in engineers mode. These allow access to the internal state of the SCU and it's couplers, main store and OCP. Enough of the registers were behaving as expected to convince me that the diagnostic interface was working, but a number of things did not behave as documented in the fault-finding and reference guides. It could be that the documentation doesn't match the hardware, or it could be a side-effect of a more subtle fault.
The store is divided into blocks, and part of the initialization procedure runs a self-test to validate each block and mark it as valid in a diagnostic register. From this register it appears that several store blocks are failing. A lot of further investigation will be required.
Towards the end of the afternoon, the boot process stopped working altogether, with a consistent error occurring at a much earlier stage. I suspect the problem is once again the serial interface between the DCU and the SCP: the original boards at both ends of the link had faults and were swapped out just before Christmas.
10/01/2010 update from Delwyn Holroyd
Over the Christmas break I spent some time examining a CME installation tape which I have in virtual form as a file on a PC. CME (Concurrent Machine Environment) allows the system to host VME (the native 2900 operating system), and a 1900 operating system at the same time. The installation tape contains a package of microcode for the machine, and it proved possible to extract all the necessary IPL elements from this to construct a virtual IPL tape. Most of the effort was spent in analyzing the first and second level bootstrap programs for the DCU to figure out what format they expect to find on the tape, and the commands being sent to the tape deck. I then adapted the software I wrote at the end of last year for the PC interface to make it emulate a tape deck in the required fashion. I tested it this week and successfully booted the system to the same point we had reached when booting from disk.
This is great news as it allows fault finding to proceed on the SCU and OCP without needing a working disk drive, and without any risk to our valuable bootable disk packs.
I also removed the spindle from one of the EDS80 drives for further examination.