ICL 2966 restoration during 2010
Post this page to popular social media
Each TNMOC project has either a working group or project team assigned to do the work. Working groups are either managed in association with the CCS (Computer Conservation Society) or solely within the Museum.
Below are updates on the 2966 restoration project during 2010.
21/12/2010 update from Delwyn Holroyd
The intermittent problems reported in the last update have continued to hamper progress during the last month.
The DCU is now operational again - the intermittent problem reported last time came back as predicted, and has been isolated to a board, now swapped for a spare. In addition the Minicom processor board developed a fault causing a spurious interrupt to be reported. This board has also been swapped and the original joins the queue of boards awaiting repair. Two of the DCU power supplies have also failed during the last month - these were repaired by our power supply expert Phil H.
The situation with the OCP has been even worse. A variety of temporary faults have appeared and disappeared: either of their own accord or following reseating of boards. The good news is that every new fault seen has at some point gone away which at least proves they are probably dry joint or connector related rather than component failures. Of course the possibility remains some of these faults may be inside IC packaging or within printed circuit board traces.
Last weekend we were once again in the same position as reported at the end of the last update, with the same remaining fault. The CUTS advice for this fault lists no fewer than 7 possible boards. Almost all of these have now been exchanged to no avail, although for a week or so the fault did disappear without any action on our part! This leaves the probability of a connector or backplane issue. The next step will be to locate and scan the necessary schematics from aperture cards. This will allow the problem signal to be physically traced and probed on it's journey - this particular signal is part of the interface between Engine and Scheduler and so crosses between the two backplanes in the OCP.
Meanwhile the 7501 terminal restoration has made great progress. I have built a USB to synchronous serial interface board to connect the terminal to a laptop and handle the necessary ICL proprietary communications protocol. However the terminal requires a control program to operate, which is normally downloaded or "teleloaded" from a host mainframe. In this case the "mainframe" was a George 3 emulator running on a laptop. George 3 is one of the operating systems our 2966 will be able to run once it's operational. Much more information is available here: http://www.icl1900.co.uk
The picture (below) shows the operational terminal displaying the George 3 login prompt. The filestore containing the necessary control program for the terminal came from the George 3 service operated on an ICL 1902T at Manchester Grammar School until 1986. It's been some 24 years since this program last saw active service, and although we don't know for sure, probably a similar amount of time since the terminal itself was last operational.
21/11/2010 update from Delwyn Holroyd
Following a period of rapid progress, over the past several weeks the machine has clearly decided it's not going to make life easy for us.
Good progress has been made with the store problems. No fewer than three store boards were exhibiting similar faults causing data to be written into either the wrong or multiple addresses, or writes to occur when they shouldn't. These faults were only discovered with the aid of the comprehensive store testing program MHR7: none were causing store self-test to fail. At least one further store fault remains to be resolved, but for the moment we have exhausted our supply of working store boards. Some board level repair will therefore be required, but the addressing faults should be easy enough to diagnose to the chip level. The technology on these boards is 7400 S series TTL, which is still available.
Following resolution of the third store fault, I decided to re-run the whole MHR series of tests as a quick confidence test, and this is when the problems started. At the beginning of the day all these tests passed, but now almost none would run and were mostly failing with an unexpected OCP 'Help' condition. I replaced a couple of likely boards to no avail, in fact following the second replacement we had an even more unexpected test failure, with no less than 3 'Help' conditions at once. On replacing the original board the new even worse fault condition remained! At this stage we gave up for the day and to consider our next move.
The next fault-finding session started off with a full re-run of the diagnostic suite, where one of the early 'static' tests now failed during a test of the Scheduler to Engine interface, with a bit stuck on in a register. This was clearly related to the initial problem seen with the MHR tests the previous week. The signal in question passes through several boards, two of which had already been swapped with the results mentioned above. Since this new test is lower level and gave a definite fault symptom to work on, we repeated the first of last weeks board swaps again, in case the spare board was also faulty. A re-run did indeed produce a different test failure, but this implicated a completely different set of boards. As you will imagine, we were somewhat perplexed by this point. It appeared that every time we swapped a board we ended up with a new and unrelated fault!
Not giving up, we analyzed the possible cause of the new fault and realised that two identical boards in the OCP could be exchanged to see if the fault moved to a different byte within the failing register, which is physically split across four boards (one byte per board). Doing so appeared to remove the fault, so clearly this one is intermittent - and may or may not be related to the boards we moved. A re-run of the MHR tests showed we were now back to the first of last week's faults.
To cap it all off, the DCU, which has worked flawlessly for the previous year, was clearly feeling left out of the intermittent fault fun, and promptly failed during it's ROM self-test. This of course prevents loading of the diagnostic suite. Thankfully it spontaneouly started working again quarter of an hour later but somehow I suspect this will happen again.
It's quite likely that the cold and damp weather is behind this new crop of intermittent issues. They are notoriously difficult to track down, but work will continue.
31/10/2010 update from Delwyn Holroyd
The main task this week was to track down the suspected store problem causing the OCP test program MHR2 to fail. The failure was during a byte write to store, which is implemented in the store module as a read-modify-write operation on a 64-bit word. Afterwards the entire 64-bit word was corrupted in store. After checking the state of various diagnostic registers I was positive the issue lay within the store and probably the board controlling the 'part write' sequence. This board is connected to the store coupler in the SCU by ribbon cables that have already proven problematic as discussed in the previous update. Therefore when disturbing this board it's hard to be sure that any change in fault symptoms is not simply due to moving the cables. However in this case from the contents of the diagnostic registers it did seem that the fault could not be attributed to the cabling.
Swapping this board for the spare did indeed cure the part write problem and the MHR2 test program ran to completion. However, I had to set the store into 'non-interleaved' mode (a simpler but slower configuration), so it appears the spare board has a different fault (or it may be ribbon cable related).
The system now also passes the remainder of the MHR series tests with the exception of the full store test - an initial run showed a problem associated with one of the four sub-stores and a second problem that appears common to all four. Unfortunately there are only two common boards in the store, and we are already using the spares for both of those.... so it seems likely some tricky board level diagnostics and repair will shortly be necessary.
As an unexpected bonus, MHR6, the scheduler/decoder test program, causes interesting noises to come from the speaker in the OCP (listen to the ICL 2966 here). This is the first time the machine has made any sound other than the roar of the fans, and it took the volunteers who were present somewhat by surprise, especially as the speaker was on full volume at the time!
The first part is the sound made during the test of the 2900 decoder, and the second part is the test of the 1900 decoder. These two decoders interpret native 2900 and 'emulated' 1900 machine code instructions respectively, and in conjunction with the scheduler cause the OCP engine to jump to the relevant microcode routines to execute them. In reality both instruction sets are effectively emulated by the microcode running in the OCP engine.
The sound from the speaker is related to the instruction sequence being executed - in this case an artificial sequence designed to test the decoders. During normal operation an experienced engineer or operator would be able to tell from the sound whether the machine was behaving normally - this feature was very common on older mainframes.
We still have another few series of tests to get through, and the remaining store problems to resolve, but we are gradually moving closer to having a working system.
25/10/2010 update from Delwyn Holroyd
The photo below shows the output from the diagnostic test suite (CTS or CUTS) for the new fault we arrived at last week. MHR2 is the failing test program name and 000 093 6 is the fail code: an index on microfiche points to a list of boards that may be faulty for each error code. These lists are sometimes useful but not completely reliable, especially when multiple faults exist.
The test in question is accessing the main store of the machine from the OCP at the time of failure - the first time this path has been exercised. As such a lot of the components of the machine are involved, spread across all three main logic cabinets: the OCP, SAU (store access unit), SCU (store control unit), store coupler and finally the store module itself. The fail code above means incorrect data was received from store, but further testing has shown that sometimes the main store raises a "multi-bit" error instead which rather suggests the fault may lie within the store coupler or store module.
Data in main store is protected by a parity system known as Hamming code, which allows any single-bit error within a 64-bit data word to be detected and corrected, and most multi-bit errors to be detected. When the store was first commissioned some months ago a number of faults disappeared having reseated boards or connectors. More recently we have been seeing the store intermittently failing one of the CUTS diagnostic tests, added to which one of the sixteen 512K store blocks has been failing normal store initialisation for the last few months. It therefore seemed like a good time to investigate.
I first transposed two of the sub-store control boards (the store is divided into four sub-stores) to see if the failing block followed - however both of these sub-stores promptly stopped working altogether! So I removed and cleaned all the ribbon cable connections to the store and this has restored all sixteen blocks to normal operation. I then started the CUTS store test program in continuous run mode and set about the store module with the rubber mallet (gently of course!) in an attempt to tease out any other bad connections. Tapping one of the four sub-stores causes the test program to fail: this will require further work to track down. We know that board tapping was used by ICL engineers to help reveal intermittent faults in these machines so the restoration project is following correct field service procedures!
None of this has altered the symptoms when running MHR2, but different logic paths are used when accessing store from the OCP rather than the store self-test logic utilized by the store test program, so I think it's still likely this fault is store related.
17/10/2010 update from Delwyn Holroyd
This weeks update, with pictures!
The first one shows the type of fail generated by the mis-connected cable, and the second shows the errant cable form.
Before I could get started on fault-finding this week one of the DCU 5V/150A power supplies went down. There is enough capacity to run on the remaining 225A of 5V from the other two supplies in the DCU so luckily this didn't halt operations.
I was quickly able to confirm that the Help fault from last time (see picture) really was originating from the fast multiply unit, and by diagnostically setting it's registers from the SCP console, also ruled out a fault on the fast multiply unit itself. It appeared that something was causing incorrect parity to be loaded into the registers on the fast multiply unit at the start of every test program, and this was causing a Help to be generated as soon as (or if) the test program enabled parity checking. Clearly something was seriously wrong!
I then ran the test program for the fast multiply unit, and this stopped saying there was no such unit fitted to the machine! The engine microcode communicates with the unit via the OCP's "Local Register" mechanism, and I had already observed that the board was connected via off-card connectors to two other OCP boards - the "Local Register Data 2" and "Local Register Control" boards (see picture). I now suspected a problem with these cables, especially as there was an empty socket on the "Local Register Data 1" board right next to the fast multiply unit.... I wasn't able to locate a diagram confirming the correct connection of the cables, and the labels on the cabling have long since faded to the point of being unreadable. By carefully peeling back the tape covering the label, I was able to see an image on the tape showing what the label had originally said - the cable should have been connected to the "Data 1" board!
Having moved the cable to the correct position, all the engine tests including the fast multiply test now run without any failures! It's a mystery how the cable came to be mis-connected, but luckily it hasn't caused any damage.
The CUTS diagnostic suite now gets onto testing the Store Access Unit part of the processor, where it stops with a brand new fault for investigation.
03/10/2010 update from Delwyn Holroyd
Over the last couple of weeks some serious OCP fault-finding has been taking place. Visitors to the museum on Saturday would have seen manuals and microfiche listings spread over every available flat surface: the group of FDS640 disk cabinets make quite a useful table for this purpose!
In total three faults have now been cleared, all of which had been causing diagnostic tests to fail. There are two types of diagnostic test for the OCP: the static logic tests and the OCP resident tests. The static logic tests run in the SCP and attempt to verify as much logic as possible using the serial diagnostic interface to the OCP. The system is now passing all these tests, bar two SCU couplers not required for a minimal system which have been temporarily unplugged.
The OCP resident tests are written in the Micos 2 microcode which is the native language of the OCP. They are designed to progressively validate the microcode instruction set and OCP facilities. The first set of tests exercises the Engine, which is the part of the OCP that executes the microcode. For test purposes the rest of the system remains inactive, so there is no main store or peripheral access, slaving (caching) or ability to execute target level instructions (i.e. the 1900 or 2900 instruction set).
The majority of the Engine tests now work, but several are affected by a tricky remaining problem which causes the 'Help' line to be consistently raised at certain points. I've been able to cause most of the affected tests to continue and thus verify the logic they are intended to test, but the root cause of the spurious Help remains a mystery.
Normally when Help is raised there will be a bit set in one of several registers, usually indicating a parity check failure somewhere in the OCP logic. However in the present case there is only ever one bit set, and it refers to parity in the optional fast multiply unit, a part of the system that isn't tested or activated by any of the tests in question! Sometimes there is also a scheduler help raised at the same time, but I am now fairly sure this is a red herring as none of the scheduler logic has been initialised or is being used at this point (and the help doesn't occur if a normal IPL has been attempted beforehand, thus initialising it).
Since the Help is not a type expected by any of the tests, they stop with indeterminate error codes and the CUTS dictionary doesn't identify any possible cause. I swapped several boards involved in Help generation and the fast multiply unit, but without any change to symptoms. So further drilling down into the source code of the tests that stop will be required to figure out whether there's a common factor, and from there consultation of the OCP logic diagrams should hopefully provide some possible board swaps.
Despite this problem, the tests that are now running exercise all the microcode instructions and operand forms, so I would tentatively say that we're now able to run microcode, even though we're still some way from being able to run target level code.
19/09/2010 update from Delwyn Holroyd
I finally received replacement switches for the control panel of the 7501 terminal, and have refurbished the panel and refitted it. Unfortunately, the monitor has now developed a fault! It should hopefully be straightforward to fix - the normal scan pattern can be seen by turning up the contrast but the actual video signal is not getting through. We have the schematics for the scan board, and it uses commodity parts that should still be available.
Since the last update I've constructed a virtual tape containing the various libraries forming the SCF (System Control File). I discovered the required logical structure from the detailed descriptions we have available on microfiche, in other words the required header records, expected order of the libraries (which is not the same as they were arranged on disk), block sizes and positions of tape marks. It was also clear that no special components were required to load from magtape, other than the magtape bootstrap which I already had from a CME installation tape (luckily as this is not included on the disk copy of the SCF).
However the exact contents of the record and file headers wasn't precisely specified as these are standard and presumably documented elsewhere. Not having a copy of that documentation to hand, I pieced together the required format variously from a copy of a VME installation tape, and Inland Revenue document CA70 which describes the format of data tapes delivered to private pension companies, presumably now historical! The Inland Revenue were one of ICL's largest customers so their tapes were written in ICL 2900 format.
The next task was to improve the magtape emulator program running on the laptop, since loading and running the diagnostic software was clearly going to involve a lot of going back and forth on the tape. I had previously reverse engineered parts of the DCU ROM and the magtape bootstrap to deduce what the commands being sent over the peripheral interface meant. When I started to look at the section involved in loading CUTS I couldn't work out what the code was doing - a sequence of three commands was being sent, one of which required data to be returned from the tape deck. Luckily ICL supplied customer sites with microfiche containing listings of most of these low-level programs. Having found the relevant document it became clear this was the sequence to skip forward to a tapemark, which is handled autonomously by the deck with an attention interrupt being raised when a tapemark is found. The other commands read and cleared the attention status.
Last week, prior to completing this work, I thought it would be wise to check the machine was still working to the extent it had been previously. With all recent work concentrated on the EDS80 it had been some months since the main cabinets were powered up. After some warming up and cajoling of a few power supply modules everything seemed to be running satisfactorily. We diagnosed a fault in the store cabinet supply monitoring causing the 5V fault indicator to light (this had first started to happen during VCF), and Johan is replacing a failed IC on the board.
One of the store blocks has been reporting as failed for a while, but this appears to be less straightforward. The store has a self-test function but I can't persuade the block in question to fail when using it. Checking the source code for the system initialisation program on microfiche I found that it uses the 'Init' function of the store rather than the self-test, so clearly Init thinks the block is bad for some reason. Further investigation will be required, but the system can run without the block so this is not top priority at the moment.
This weekend I set about testing the new SCF virtual tape on the machine. After fixing a few bugs I was able to perform a normal IPL as before. Following a reset and selection of CUTS mode (which runs the diagnostics in sequence) it started loading and executing tests faster than I was able to look up the DCU progress indication codes to figure out what it was doing! However it did soon stop with an error code indicating something likely wrong with my virtual tape emulation rather than a fault in the machine. After a few tweaks to the behaviour of the new skip forward / backward commands it progressed into the main part of the test sequence which runs on the SCP and the failure reports agreed with those from the CUTS run performed from the diagnostic pack some months back.
There wasn't time to look into the faults in any more detail, but the path is now clear for serious fault-finding on the OCP to begin.
23/08/2010 update from Delwyn Holroyd
Having proved the route for capturing data from a diskpack last week, it was now time to transfer the diagnostic pack. This contains all the diagnostic test software for the system and is one of the few bootable diskpacks we have. Securing the data on this pack was one of the main reasons for building the interface to the drive.
Thankfully the transfer went without incident, and with no CRC errors, so we now have a perfect copy of the data. To understand the format it's been necessary once again to piece together snippets of information from various sources. The main section of interest is the SCF (System Control File), containing not only the usual components required for IPL (Initial Program Load - or booting) but also the CTS (Customer Test Software) suite. The format appears straightforward and it should be easy to extract all the various components.
The next task is to produce a virtual magnetic tape containing the CTS components, similar to the virtual tape we have been using to boot the system from a laptop. For a normal IPL, only the level one bootstrap is specific to tape or disk, and subsequently loaded components contain all the necessary code to continue the process from either medium. We know that it was possible to run CTS from tape, so the hope is that the CTS software is structured in a similar way and that we have all the necessary components from the copy on disk.
Once the diagnostics can be run from a laptop, we can get started on fault-finding the OCP.
19/08/2010 update from Delwyn Holroyd
During the week I implemented a memory buffer and UART interface in the FPGA of the EDS80 interface. I was able to partially test this without access to the real drive. Once at the museum, after clearing a few simple bugs I was able to issue seek and read track commands to the drive. A download command is then used to read the data for an entire track to the PC, very slowly due to the 115kbps UART! It took several hours to capture the whole of one of the scratch packs.
I then post-processed the captured data to make sure it was correct, checked using the CRC code. I had no idea what data we would find on this scratch pack (the term meaning it contains no useful data and can be overwritten), but looking at binary dump it clearly contained some non-random looking data. Security erasure of disks, in other words overwriting with zeroes or random data, was not common in the commercial mainframe world at the time.
Looking at the data in ASCII revealed nothing that looked like a string - this was expected since 2900 machines use the EBCDIC character set natively, and 1900 code when emulating a 1900 machine. No strings turned up in an EBCDIC dump either, and since the machine ran in 1900 emulation mode at Tarmac I strongly suspected the pack would be in 1900 format. However we don't have a full specification for the format, which is different to a native 1900 format disk pack. On a 2900 there is a 'wrapper' around the 1900 data in the form of reserved cylinders used for IPL (booting the machine) and flaw management (allocating alternate tracks if some tracks go bad).
I pieced together enough details from various documents to figure out which cylinders contain the 1900 format data, and this seemed to correspond to the data I was looking at. I then started trying various 24-bit unpacking methods to extract 6-bit characters in 1900 code. Nothing worked! No sensible strings emerged no matter what I tried.
I had almost given up when I found an #XPJK dump of an EDS80 system disk from the 2966 machine in Russia. XPJK is a system utility program that prints information on the geometry of a disk pack and shows where files are physically allocated. The printout revealed the number of 1900 blocks per cylinder used on an EDS80, and by comparing this with the amount of data I had captured per cylinder, I realised that the data on the disk pack was not actually packed at all. Instead, each byte contains a 6-bit 1900 character with random values in the top two bits (which is why this wasn't immediately obvious!).
When I decoded the data again using this information, I could immediately see the file area table, which contains some recognisable keywords, and scrolling further down sensible strings emerged!
Considering the cost of disk storage at the time, the amount of space wasted in this format is astonishing! The only explanation is that ICL intended emulation of 1900 on 2900 to be a short term solution allowing customers time to migrate their software onto 2900 natively, and so efficiency wasn't high priority. The 6-in-8 bit format is also used in the main store of the OCP when emulating 1900 code.
All the original files on the diskpack had been deleted (but not erased). The data left appears to consist mainly of transaction logs, including messages which would have been output to the operators screens including the VDU formatting codes. We didn't know a great deal about how this machine was used at Tarmac, but the logs reveal a financial application with screens for entering expenses, checking supplier accounts and so on. It may prove possible to reconstruct what some of these screens would have looked like.
08/08/2010 update from Delwyn Holroyd
During the week I further developed the firmware for the EDS80 disk interface board armed with the information gained from testing last weekend. It now has functions similar to the FTU (field testing unit) which would originally have been used to exercise these drives. Firstly I tested the seek functionality - track to track between defined cylinders, between cylinder 0 and a fixed cylinder and random seek are the options. Seeking between the first and last cylinders repeatedly causes the drive to make a tremendous noise, and it becomes clear why the power amplifier driving the voice coil which moves the heads is so large!
Systems programmers have been known to write programs which would deliberately cause drives like these to 'walk' across the machine room floor by carefully chosen seek operations.
Going beyond the FTU functionality, I've also implemented a full track format decode for ICL format disk packs. The format uses variable length sectors, unlike almost all modern hard drives which use fixed 512 byte sectors. The track decoder reads all the sector headers and checks that the cylinder and head number within matches the expected values. It also checks CRC to make sure the data has been read correctly. This can be done repeatedly on one cylinder/head or in sequence based on the selected seek mode.
The problem of stepping over write splice points mentioned last week was quite easy to overcome as they appear to be positioned consistently. By the end of the weekend I was able to read every track on one of the scratch disk packs with consistent results. There appear to be two CRC errors on this pack, both on the same cylinder but on different heads. The drive has a couple of features to assist in error recovery: the heads can be offset a very small amount from the track centreline, and the read strobe can be set earlier or later than normal. However in this case none of these measures resulted in an error-free read. It's possible that the tracks are actually marked as flawed on the disk pack, but I will need to capture all the data and decode it to find out.
The next step is to add a memory buffer to store the data from each track, and an RS-232 UART interface to allow it to be read out to an attached laptop.
01/08/2010 update from Delwyn Holroyd
I constructed a comms loop-back plug for the 7501 terminal with the aid of the recently found and scanned schematic diagrams, and the ROM based loop-back test now works. This gives reasonable confidence that the synchronous comms interface, when built, will be able to talk to the terminal. Now armed with all the necessary information I am also developing an emulator for the terminal. This will enable the teleload process that loads the control program into the terminal from a mainframe to be investigated, using my existing George 3 emulator. Longer term the emulation of the Minicom processor will be useful as a diagnostic aid when repairing processor boards and as a base for emulations of the SCP or DCU if these become necessary.
A lot of progress has been made with the EDS80. Having cleaned the heads, the drive and the (correct) scratch disk pack it was time to try loading the heads on this drive for the first time in eleven years. Thankfully this went without incident and the drive became ready. Once the interface board was hooked up it became clear quite quickly that some faults existed on the drive: one of two clock signals was not present, and when I attempted to make the drive seek more than one track at a time the carriage moved all the way to the end stop (with a tremendous thunk that caused some momentary panic!) and the drive returned a seek error.
The clock fault seemed like it would be straightforward to diagnose so I looked at that first. Referring to the diagrams and wire lists in the manuals it seemed the signal was present on the wire-wrap backplane at the source, but not at the other end, and there was no continuity between the two pins, indicating a backplane fault. At this point I decided it might be easiest to swap the entire logic chassis from the previously working drive. However there was no continuity between these two pins on the other backplane either! Closer examination showed there was no wire between the pins and the signal was actually sourced from a different pin on the clock generation board. The schematic for the clock board did indeed show that pin as being connected to something, but when I traced the circuit back on the board it didn't correspond to the schematic. This is a real pain because it means we can rely on neither the backplane wiring diagrams nor the board schematics to correspond to our drives! Further research will probably be necessary.
Anyway, it was now clear the clock fault lay on the clock board and simply swapping that board over fixed the issue. Handily this also cured the seek fault and the drive now seeks forwards and backwards to any cylinder.
Using the primitive capture facility implemented so far for the interface board I was able to verify correct looking data coming off all the heads of the drive. The first data written on each track is the 'home address' which contains the cylinder and head number, so I was also able to check that the drive had in fact seeked to the correct cylinder too.
During testing one difficulty came to light which will make the data capture process more complicated than I had hoped. When a track is newly formatted the data is written all at once with no discontinuities, but once sectors have been modified discontinuities exist where the drive starts and stops writing to the track. These discontinuities can cause the decoding logic in the drive to lose sync so it isn't possible to simply read the entire contents of a track at once and analyse it later as I had planned; the firmware on the interface board will need to decode the sector format and turn the read gate off and on around possible write splice points. This is not too difficult to implement but the documentation omits to mention where the nominal splice points are. The actual splice points as read back will in any case vary based on head position tolerances and factors in the controller so some experimentation might be necessary to discover the safe window for reading.
12/07/2010 update from Delwyn Holroyd
Following the head crash last week, I examined the affected head in more detail with a USB microscope, and concluded that it will need to be replaced. However, since we don't currently have the necessary re-alignment tools and special disk pack required this will have to wait. (We are hoping to acquire these tools).
In the meantime, I transferred the reconditioned spindle from this drive into one of the others. I also replaced the drive motor bearings in the new drive and gave it a thorough clean. I checked the heads with the microscope and found they were very dirty - they will need cleaning before use. The drive was run for several hours without heads loaded to run in the new bearings.
Whilst that was happening, I turned my attention once again to the 7501 terminal. After checking the documentation in more detail it turns out it can't support UART style communications on the modem port after all, which means it can't be directly interfaced to a standard PC serial port. The buffer chips on the interface boards do support async operation but the board is strapped for synchronous operation only, without start or stop bits. Instead the SYN character (16h) is used to achieve byte alignment at the beginning of each message. The next step will be to wire up a loopback plug to check that the comms is working, and then construction of a suitable interface board.
05/07/2010 update from Delwyn Holroyd
It's been a while since the last update because of VCF, but nonetheless some progress has been made, and some steps backward....
On the Saturday of VCF the machine decided not to play nicely: the store cabinet indicated a +5V fault (although there wasn't one, luckily), it refused to boot from the laptop interface, and later in the afternoon the OCP overheat warning came on, although once again I am not convinced - it didn't appear to be any warmer than normal. We switched off anyway to avoid any risk of damage.
Things were better on Sunday, with the machine deciding to boot again after I cleaned the contacts on the off-card connector linking the DCU to the laptop. I suspect the real problem here is marginal signal quality due to the construction technique of the interface board (point-to-point mod wire with no ground plane). There was no OCP overheat today, but the +5V fault warning was still present. This will be a fault in the monitoring board.
Just before VCF the power supply for the 7501 terminal was re-tested by Phil H and found to be working - taking it apart seems to have fixed it so possibly just reseating the PCB connectors was all that was necessary. The week after VCF the power connectors onto the backplane were cleaned and the unit re-assembled. To my surprise it appears to be working! It can't do much without having a control program loaded into it, but the ROM code does some self-tests and has store dump and alter functions and these were used to dump the ROM contents to screen. The next step is to set up the interface board for standard async RS232 comms (in a mainframe application it uses synchronous comms). Once this is done it should be possible to interface the terminal directly to a George 3 emulator running on a PC, and George will download the control program which turns it into a functional terminal.
I have now finished building the EDS80 interface board - this is properly constructed on a PCB and even uses surface mount technology: slightly incongruous but it's much easier to obtain 3V3 logic level differential transceivers in surface mount. On Sunday the board was hooked up to the working drive ready for initial testing. After fixing a problem with one of the ribbon cables I was able to issue a 'select' command to the drive, and the drive responded with status information and it's selected signal. The data clocks from the drive were present but free running at around 14MHz since no diskpack was loaded. It was a great relief to find that I hadn't made any errors in the pinout on the cables to the drive.
Here comes the bad news - the next step was to load the scratch pack and see if data could be read. The heads loaded ok and the data clock signals went down to around 9.6MHz, the frequency expected when the PLL is locked to the servo track on the diskpack. Before I could do anything else, I noticed a high-pitched noise from the drive followed immediately by a burning smell - the drive was spun down within seconds but I had just witnessed a head crash, something I've been incredibly paranoid about avoiding at all costs.
As the disk slowed down the cause was immediately apparent - the bottom guard platter was bent, and there was some dust evident on the disk surfaces. At this point I realised it was not the normal scratch diskpack - during VCF they had all been moved around and without thinking I had picked up the wrong one. I expect you can guess how annoyed I am with myself about this!
Examination showed the crash was on the bottom head, closest to the bent guard platter. It's possible this generated enough disturbance in the airflow to cause the problem, or it could have been simply down to the dust on the pack. It was time to follow the procedures in the drive manual for head crash recovery, and the drive and heads are now clean again, but there is a slight mark on the affected head. I will be seeking further advice on this before using this drive again.
The greatest irony was when I noticed the number on the diskpack casing: 666 - truly the devil's diskpack!
31/05/2010 update from Delwyn Holroyd
This week I made a concerted effort to find the missing OCP board that was indicated as possibly faulty in the diagnostic run of several weeks ago. Whilst comparing the contents of the spares box with the actual board numbers that should be in the machine, I discovered one of the board numbers I had written down is not actually part of the machine! Sure enough the bag was mislabelled, and it contained the board we have been looking for. However, there are still only 29 boards in the box out of 30 in the OCP upper platter, so one of the set is missing and nowhere to be found amongst the other spares.
Swapping this board didn't make any difference to the error messages reported on a normal boot, not too surprising as we know there were a number of faults reported in the diagnostic run.
I also spent some time trying to diagnose the store block fault, which is still present - but I was unable to make it fail using the store self-tester. This might indicate an addressing error - the store self-test writes the same data to each location, so would not pick up on this.
Since we acquired the ICL 7501 terminal a month or so back I've been searching in the ICL archive at the museum for schematics, without any success. I had discovered that the related 7502 terminal processor has all it's diagrams grouped into a 'machine logic set' under one document number, which I found referenced in a technical description. I suspected the same thing would apply to the 7501, but how to find it? As luck would have it today I stumbled across the technical description for one of the 7501 boards (document number one higher than one I had already found), and this proved to reference the elusive machine logic set document, which contains the schematics for everything in the unit except for the Farnell made power supply. This will greatly assist any fault-finding that might be necessary.
22/05/2010 update from Delwyn Holroyd
The ICL 7501 power supply has been returned to Phil H for more detailed examination, but we are somewhat hampered by not having any schematics or other information for it since it's a Farnell made unit. I've made contact with a company that specializes in old Farnell power supplies in the hope they can turn up some information on it.
I re-assembled the EDS80 drive motor I took apart last week with new bearings and tried it in a drive. As this seemed to work ok I then removed the noisy motor from the 'good' drive to replace it's bearings. This one was extremely difficult to get apart - after some work with the rubber mallet the brake assembly finally came off the drive shaft (it just lifted free on the first motor...) and after quite a lot more persuasion the other parts were finally separated. Re-assembly was much more straightforward, and the drive now runs like new.
Work is also progressing on the design of a drive interface rig. This will allow the EDS80 drive to be controlled directly to read disk contents at the lowest possible level, in order to secure the data.
16/05/2010 update from Delwyn Holroyd
The working EDS80 drive was run once again for a number of hours without heads loaded to continue the spindle bearing run-in. The bearings in the drive motor are very noisy, and this is the case with most of the other motors too. I've disassembled one and ordered new bearings. Once this job is done the drive should run as quietly as when it was new! This is important because it will enable us to hear any unusual noises coming from the drive which might indicate an impending head crash.
The power supply for the ICL 7501 terminal was refitted to the logic chassis having been checked on the bench, and cables made to power the logic chassis and fans independently from the rest of the monitor. Under normal load the main +5V supply was fine but the +-12V and -5V supplies went out of spec, so it was quickly turned off again. This will now require a more thorough examination with all voltage rails under a representative load.
10/05/2010 update from Delwyn Holroyd
Some major progress to report... a serviceable EDS80 disk drive and the first run of the engineers diagnostic software on the machine.
The bearings have been replaced in one of the most seized up spindles, chosen because there was a risk that trying to dismantle it could have caused damage. This has now been fitted to the drive that the system was booted from just before it's spindle started to make unpleasant rattling noises. At the time we didn't know the construction of the spindle (there are no diagrams because it wasn't intended to be a field serviceable part) and it was unclear exactly what these noises might indicate.
As it turns out the spindles contain two sealed ball-bearings of a standard type which are easy to source. The main difficulty lies in the amount of force required to remove the pulley from the shaft. The lower bearing and pulley resist the pressure of a spring that pre-loads both bearings and eliminates any play in the assembly, and therefore have to be a very tight fit. The rattling noises are due to the bearings starting to break up because the 'for life' lubrication has degraded. If ignored this could lead to a catastrophic bearing failure and probably a head crash.
Another issue with replacing the drive spindle is getting the alignment correct. There was a special alignment tool for this, but we don't have one. The ICL engineer who maintained the system at Tarmac told us the alignment wasn't as critical as the maintenance instructions imply, and this has proven to be the case. I aligned the spindle essentially by eye to score marks on the drive casting marking the original position. The track positions are located by pre-recorded servo information on one of the surfaces, so it is only necessary to ensure that the heads move on a path passing through the centre of the spindle, such that the tracks run perpendicular to the heads.
After a run in of the new bearings without heads loaded, the drive was run with a scratch pack and heads loaded for some time without issue.
The next step was to load the engineers disk pack, which had not been done before. It proved to be in good condition and loaded ok. The machine booted from it happily and started to run the diagnostic test suite. The first part of this does detailed tests on the DCU, which all passed. Further tests identified faults in SCU couplers and in the OCP. One of the store blocks is also failing - a new fault. The fault codes can be checked against a listing which identifies the most likely board responsible. Unfortunately the first OCP fault indicates a board that is mysteriously absent from the box which contains an otherwise full set of spare boards - further searching will be required!
It's too risky to repeatedly load the engineers pack for diagnostics until the data on it has been secured, and this is now the most urgent task. The fact that this pack is readable is very good news indeed for the restoration.
26/04/2010 update from Delwyn Holroyd
The museum has recently acquired an ICL 7501 terminal on loan from the Jim Austin Computer Collection. Once restored, we intend to connect it to the 2966 as a user terminal. It's typical of the type of end user equipment used on ICL mainframe systems in the early 1980s.
ICL mainframes required terminals implementing proprietary communications protocols such as ICLC01 and ICLC03. Unlike Unix systems which use relatively dumb character based terminals, on an ICL system a complete message is constructed by the user on the terminal and then sent to the mainframe. This means the terminal needs to directly support cursor movement and message editing. It also has facilities for dividing the screen into protected and unprotected fields, typically used to display a form with areas for the user to fill in. Messages could even be validated by the terminal prior to sending to the mainframe, for instance checking that only numeric characters are entered in a particular field.
The 7501 is an integrated version of the earlier 7502 comms controller and a 7561 video terminal (the type used on the 2966 SCP operating station): instead of having the separate 7502 cabinet containing the controller logic it's built into the base of the terminal itself, resulting in a somewhat taller unit than the 7561 with a row of switches and LEDs below the screen.
Much of the controller logic is also shared with the SCP, with the familiar Minicom processor also found in the DCU, the modem board and memory boards in common. The main difference is the video display board which supports an 80-column display rather than the 40-column deemed more appropriate for system operators.
The 7500 series terminal controllers required 'teleloading' to obtain their control programs. The built-in ROM code has just enough intelligence to request a teleload from the mainframe, which then downloads the required program. As a consequence these systems do a good impression of being completely non-functional until this has happened. We'll be able to test this procedure under George 3 emulation on a PC: readers of this page will realise the 2966 is not quite up to the job yet! Luckily the required teleload utilities and control programs have survived in a dump of a George 3 filestore.
Very little 7500 series terminal equipment seems to have survived, so we are always on the look out. If you know where there are any of these distinctive orange terminals, or even the older blue and grey 7181 terminals, please get in touch with the museum.
07/03/2010 update from Delwyn Holroyd
The failed DCU power supply gave an opportunity to do some spring cleaning around the 2966 area last week, but this week I was able to resume work on the machine. Many thanks to our resident power supply expert Phil H for examining and testing the spare -5V supply: although it looked bad on the outside thankfully it was clean on the inside and proved to work. This has now been fitted in the machine. Meanwhile armed with some new LM311s Phil was able to repair the other unit and this will now be the spare.
I first of all checked that we hadn't suffered any more regressions: the machine still boots to the same extent it did before, and the store is still working.
The main task of the day was cleaning all the board edge connectors in the OCP (or CPU in today's terminology). It's not clear when this was last done, and the maintenance logs for the system show it was a fairly routine operation which frequently 'cured' faults (although whether this was down to the cleaning or the physical movement of the boards is open to debate). This revealed that the clock distribution board in the scheduler wasn't actually plugged in, which clearly wouldn't have been helping matters! I also confirmed that all the boards were in the correct slots.
Unfortunately, none of this changed the fault condition at all, so no easy short-cut in the diagnostic process!
The OCP is by far the most complex part of the system. Unlike the rest of the system it's built using ECL (emitter coupled logic) technology, and consists of sixty individual boards mounted on two backplanes. ECL is much faster than TTL, but consumes a great deal more power. The OCP doesn't obey 2900 target level instructions directly, instead it has a microcoded instruction set known as MICOS II, aided by the scheduler which breaks down the target level instructions into one or more microcode 'tasks'. This makes it fairly easy to emulate other instruction sets: 1900 and System 4 were supported (our machine has a 1900 decoder board). The basic clock beat is 80ns (12.5MHz) although some steps occur at 40ns. It has a pipelined architecture which allows one microcode instruction to be completed every clock beat. Target level instructions take a variable number of clocks depending on how complex they are. Most data paths are 32-bit, with 36-bit extensions in some places, and also support for efficiently converting to and from the 24-bit 1900 architecture.
Given the current completely non-functional state of the OCP, and without the aid of the diagnostic software it's difficult to know where to start. Over the last couple of months I've been scanning and studying all the detailed reference documentation from aperture cards in the archive. Armed with this knowledge the diagnostic registers are starting to make sense, but there's still a lot to learn!
21/02/2010 update from Delwyn Holroyd
Early in the day there was some difficulty in booting via the laptop interface, with the system indicating parity errors on the interface. This has happened before when the system is cold, but normally clears after a few attempts. Today after a few dozen attempts it was clear it wasn't going to. Cleaning the contacts on the off-card connector for the interface cable eventually cured it, and afterwards it worked reliably.
There were no problems with the store, and all blocks passed self-test again.
Unfortunately the -5V power supply in the DCU cabinet then chose to die, and the only spare looked in a very sorry state, so no more work can be done on the main cabinets until these have been looked at by our PSU expert Phil H.
15/02/2010 update from Delwyn Holroyd
The objective for this week was to get the store working, and I'm happy to say this was achieved.
Following the board replacements last time the store self-tester worked as expected, and soon showed that two of the four sub-stores had different stuck bits when reading back from any memory location. To narrow down the fault I swapped two of the sub-store control boards to see if the fault followed - however the fault actually disappeared! Reseating the control board for the other faulty sub-store also cured that fault. Presumably these are dry joints and I expect we haven't seen the last of them, but at least the cause should be clear if/when they do re-occur.
With a working store, the next task was to start replacing the boards swapped out last time to identify which actually had faults. It became clear that at least one fault was associated with the cabling between the second and third cabinets - the interface between the store and the coupler in the SCU. It now seems likely that some of the symptoms of the mysterious faults last time were cured by reseating the cabling. The cables are very solidly made woven ribbon with a sealed termination onto a standard header, and look to be in good condition. Hopefully contact cleaner will resolve any lingering reliability issues with these, although it's possible the terminations have degraded.
During the board replacement process, another of the sub-stores started to fail intermittently, and then permanently. This time when the control board was exchanged with another the fault followed, and the faulty control board was replaced with a spare.
Although the day ended with a full 8MB of working store, it's likely some of the intermittent faults will return. Hopefully they will become permanent, which makes them a lot easier to find!
03/02/2010 update from Delwyn Holroyd
After swapping many boards in the SCU, the cause of last week's fault was traced to the SM64 control module in the store cabinet. The behaviour during store initialisation is now different in several respects: not only does it take much longer, but the SCP configures the store into 'non-interleaved' mode.
The store module consists of four sub-stores, each divided into logical blocks. The sub-stores are normally operated in parallel, or 'interleaved' to speed up accesses to store - modern server motherboards use a similar scheme to increase memory bandwidth. If there are faults in one or more sub-stores the system can instead fall back to non-interleaved mode. Individual logical store blocks can also be marked bad and the system will avoid using them. A diagnostic status register indicates which store blocks are good.
Prior to the most recent fault, the 'good block' register had a consistent but unexpected collection of bits set. The machine only has a half-populated store module (8Mb, maximum is 16Mb) so all the bits should be set in one half of the register, but this was not the case. This behaviour together with some other oddities had made me suspect the store wasn't previously functioning properly at all.
With the replacement boards, the set bits are now all in one half as expected, and it appears that two of the four sub-stores are not functioning. This is of course entirely believable!
We don't have nearly as many spares for the SCU and Store modules as for the DCU, so repairs on failed boards will be necessary. Away from the museum work is continuing on building an expanded board test rig. The new test rig will also be compatible with DCU boards, but will have a larger number of I/O channels to interface with the SCU boards. Work is also continuing on scanning the relevant technical descriptions and logic diagrams from aperture cards.
25/01/2010 update from Delwyn Holroyd
Last week the boot process started to fail with an error message 'Invalid SCU coupler type', which I thought probably referred to an incorrect entry in the configuration file that had just been sent to the SCP at that stage - the file hadn't changed but possibly it was being corrupted on the way. In view of this and other strange behaviour seen last week I swapped the processor board in the SCP (system control processor), but to no avail. I also swapped the serial interface boards at each of the link between the SCP and the DCU to eliminate that as a source of corruption (somewhat unlikely, given that the SCP's control program had just been loaded successfully via the same route).
A bit more digging using diagnostic commands from the SCP showed that it wasn't possible to access any of the coupler registers in the SCU. This is done via the DCM (Diagnostic Control Module) which is attached to another serial interface on the SCP. In addition to their normal operation, all the registers in the SCU and it's couplers are connected together serially to form a number of loops. To read a register, the relevant loop is 'spun' so that the required bits are loaded serially into a buffer in the DCM. To write a register bits are loaded from the buffer into the loop.
After some fruitless attempts to read from coupler registers, it spontaneously started to work again! The boot now progressed beyond the invalid coupler type error, so this was clearly referring to a failed attempt to read the property code from a coupler (which identifies the coupler type). After a power cycle of the SCU cabinet, the registers were no longer accessible and we were back where we started. I suspect the cause of this fault probably lies within the DCM.
Even whilst the coupler registers were accessible, the boot process still failed at the store initialisation stage, and this time the 'SCU Reset' trick from last week didn't help. Unfortunately I have to conclude this was a red herring, and there is probably another intermittent fault that just happened to be taking a break last week.
I refitted a repaired 5V/150A power supply module in the DCU cabinet, testing it first with no load and then with a partial load (only some of the logic boards plugged in). Finally with all boards plugged in I balanced the three 5V supply modules so that they were sharing the load equally. Thanks to Phil H for the repair, which involved replacing a failed IC in the switching circuit.
18/01/2010 update from Delwyn Holroyd
I arrived at the museum fully prepared for a day of debugging the store, this being the area where the boot process has been failing. Before getting into that, I realised that following power-up the 'store running' indicator light on the SCU control panel was not illuminated, and a manual activation of the 'SCU Reset' control was necessary. When I attempted to boot again the store initialized successfully! The following steps loaded the initial OCP microprogram and started it running. At this point it stopped with another error, saying the OCP is faulty... unfortunately activating the OCP Reset control didn't help, not all problems are so easily cured!
Not having prepared for conducting OCP diagnostics, I decided to verify diagnostic access to the store from the SCP console. Even before the system has fully booted, diagnostic commands can be entered on the SCP in engineers mode. These allow access to the internal state of the SCU and it's couplers, main store and OCP. Enough of the registers were behaving as expected to convince me that the diagnostic interface was working, but a number of things did not behave as documented in the fault-finding and reference guides. It could be that the documentation doesn't match the hardware, or it could be a side-effect of a more subtle fault.
The store is divided into blocks, and part of the initialization procedure runs a self-test to validate each block and mark it as valid in a diagnostic register. From this register it appears that several store blocks are failing. A lot of further investigation will be required.
Towards the end of the afternoon, the boot process stopped working altogether, with a consistent error occurring at a much earlier stage. I suspect the problem is once again the serial interface between the DCU and the SCP: the original boards at both ends of the link had faults and were swapped out just before Christmas.
10/01/2010 update from Delwyn Holroyd
Over the Christmas break I spent some time examining a CME installation tape which I have in virtual form as a file on a PC. CME (Concurrent Machine Environment) allows the system to host VME (the native 2900 operating system), and a 1900 operating system at the same time. The installation tape contains a package of microcode for the machine, and it proved possible to extract all the necessary IPL elements from this to construct a virtual IPL tape. Most of the effort was spent in analyzing the first and second level bootstrap programs for the DCU to figure out what format they expect to find on the tape, and the commands being sent to the tape deck. I then adapted the software I wrote at the end of last year for the PC interface to make it emulate a tape deck in the required fashion. I tested it this week and successfully booted the system to the same point we had reached when booting from disk.
This is great news as it allows fault finding to proceed on the SCU and OCP without needing a working disk drive, and without any risk to our valuable bootable disk packs.
I also removed the spindle from one of the EDS80 drives for further examination.