VAR logo English
Home page Services Past achievements Contact Site
map
Page d'accueil Services Réalisations précédentes Contact
Français


Interesting Instruction Sets

This page describes two processor designs (MAL1 and Cobweb) that were heavily influenced by the sheer elegance and simplicity found in three externally published processor designs:

von Neumann-style Imperative Language Processors

The Manchester University Prototype (MU0) was the first stored-program computer in the world to run a computer program (21-Jun-1948). It was only a demonstration machine, though, and lacked, amongst other things, the ability to perform indirect addressing, (though this could be achieved using self-modifying code: performing arithmetic on the instruction itself, in the accumulator, storing the result back again and then executing it).

The AGC implemented indirection extremely neatly, by having an INDEX instruction. The execution of this instruction caused the contents of the addressed cell to be held in a special purpose register, and to be added to the address field of the next instruction to be executed. However, the final AGC ended up as a very bitty (though pragmatic) design, and lacked the orthogonality that was so impressive, later, on the PDP11, for example. Unfortunately, though, the PDP11 design came at the cost of complexity (CISC rather than RISC).

The instruction set, that I called MAL1 (as described in my Ph.D. thesis, Shute 1983), was based on the best parts of that of the AGC, and was intended to be a semi-minimalist design. The goal was not for instruction-set minimalism for its own sake – (Bown and Hopkins, of the University of Manchester, demonstrated an instruction set with one instruction: Reverse Subtract, Store back to memory, and Jump if not Zero). The goal, as with the NVDAC (Shute, 1984a,1984b), was to minimise the amount of hardware needed for the central processor. The instruction set was as follows:

0:  JMS adrs  Jump (indirect) to Subroutine
1:  STR adrs  Store ACC at (Adrs)
2:  IND adrs  Index on (Adrs)
3:  BIC adrs  Bit-clear (Adrs) into ACC
4:  ROR adrs  Load right-rotation of (Adrs) into ACC
5:  COM adrs  Load 2s complement of (Adrs) into ACC
6:  ADD adrs  Add (Adrs) into ACC
7:  DCS adrs  Decrement ACC and jump to Adrs if not zero

For the extended MAL2 design, one exploration into expanding the Arithmetic and Logic operations of the instruction set involved a data-path that would take in two integers, A and B, and deliver the result of FA( g1.A, op(A,B), g2 ) where FA is a full-adder, g1 is a simple AND-gated input for argument A, g2 is a carry-in bit, and op(A,B) is one of the sixteen bitwise operations on the two integers A and B. One bit in the opcode is used to choose between a bitwise logic operation (opc4=0) or an arithmetic operation (opc4=1). For the former, g1 and g2 are held at 0, and the four bits opc3, opc2, opc1 and opc0 are read as a truth-table for the bitwise logic operation. This truth table is fed into a bank of four-to-one multiplexers, one for each bit of the inputs A and B and the result, with the two control bits of the multiplexers supplied by the corresponding bit of the A and B inputs.

0000:CLR  1000:AND  0100:RBIC  1100:LDA  0010:BIC  1010:LDB  0110:XOR  1110:OR
0001:NOR  1001:XNOR  0101:NOTB  1101:NBIC  0011:NOTA  1011:NRBC  0111:NAND  1111:M1

For the arithmetic operations (opc4=1) g1 is set to 1, g2 can be one of "0, 1, c, msb(A)" and the truth table is artificially forced to one of "Z0, LDA, LDB, NOTB, NOTA, M1" (where Z0 is a synonym for CLR). For the truth tables of the forced truth table bits, ftt3=XOR(opc3,AND(opc2,opc1)), ftt2=XOR(opc3,opc2), ftt1=XOR(opc3,opc1), ftt0=XOR(opc3,OR(opc2,opc1)). (Concatenated together, these four truth tables form a symmetrical 4x4 matrix, which is surprising given the somewhat ad hoc way in which the elements of the mattrix were added.) The opc0 bit is used to select the contents of the carry-latch, and the "msb" setting is used as a cheap-to-implement minor adjustment to convert an otherwise duplicate instruction into something more useful. There just remains the duplicate instruction, "LD A" (already provided by "LDA") and a desire to find a place for a "ROL A" instruction, "FA(A,LDA,msb)" (though a case could be made for this replacing the "SHLC A" instruction).

0000:LD AFA(A,Z0,0)  0001:INCC AFA(A,Z0,c)  0010:SHL AFA(A,LDA,0)  0011:SHLC AFA(A,LDA,c)
0100:ADD A,BFA(A,LDB,0)  0101:ADC A,BFA(A,LDB,c)  0110:DEC AFA(A,M1,0)  0111:DECB AFA(A,M1,c)
1000:SETCFA(A,M1,1)  1001:DECP AFA(A,M1,msb)  1010:SGN AFA(A,NOTA,msb)  1011:LDCFA(A,NOTA,c)
1100:SUB A,BFA(A,NOTB,1)  1101:SBB A,BFA(A,NOTB,c)  1110:INC AFA(A,Z0,1)  1111:INCN AFA(A,Z0,msb)

Overall, this leads to: tt3=XOR(opc3,AND(opc4,opc2,opc1)), tt2=XOR(AND(opc4,opc3),opc2), tt1=XOR(AND(opc4,opc3),opc1), tt0=IF(opc4 then XOR(opc3,OR(opc2,opc1)) else opc0), and g2=opc4.(opc0.IF(AND(opc3,XOR(opc2,opc1)) then msb else c) OR NOTopc0.IF(AND(opc3,NOTopc2, opc1) then msb else opc3)).

One might ask whether such attempts at minimalism are still relevant, today, in the days of CISCs with floating point instruction sets, highly optimised RISCs and pipelined eager execution instruction streams. There are many applications where the processing is not the primary function. For instance, a memory chip or communications chip might be given rudimentary processing ability to get on with simple housekeeping tasks, without wanting it to take too much of the silicon real estate away from the primary function.

Non-von-Neumann-style Functional Language Processors

The goal of the instruction set of the Cobweb machine was different (Shute, 1992). This machine was designed to be completely non-von-Neumann.

SKIM had shown that it was possible to conceive of a machine that executed Combinators as the primitive instruction set (Curry and Feys, 1958). But this machine was emulated in programming code, or at best in microcode, on a conventional, von Neumann-style computer. Similarly, the Manchester University Dataflow Machine, (as described, for example, in Duce (1984)) was emulated using a conventional von Neumann-style program.

But the beauty of the S, K and I combinators (along with B, C, W, B', C' and K") is that they each involve extremely simple hardware operations (even after allowing for needs of the garbage collection mechanism).

Cobweb, therefore, was given an instruction set, giving an assembler language that was itself a functional program language not unlike a simplified version of Lisp.

As well as the combinators, primitives were also included for operations on the integers (ADD, SUB, MULT, DIV, etc.). A program to evaluate a simple polynomial, for example, written in Sugar (Glaser, Hankin and Till, 1984) might be written:

 poly = [x] x*x – 9;
 poly(5) ?
and would compile to Cobweb code as:
 (poly: C' – (t1: S' * I I) 9)
 (outp: poly 5)
or, rather, as:
 (poly: C' – t1 9)
 (t1: S' * I I)
 (outp: poly 5)

Execution in the hardware

A Cobweb machine processor was, in fact, an extremely simple block of hardware. Indeed, the aim was for it to be so small that it would be feasible to build a vast array of them on a single piece of silicon, and for the network to function as a highly parallel multiprocessor computer.

The processors would be connected simply to their physical neighbours, passing data as discrete packets. (4 or 8-way connectivity could be used, though 6-way was generally assumed between hexagonal processors.) Thus, communicating those packets further afield would be achieved, in effect, by a packet-switched network.

Each packet would consist of up to six fields, with each field able to hold anything that can traditionally be represented in a word of memory (an integer, a floating-point number, an address or a primitive instruction). (Since combinator code binds from left to right, strings of any length are merely split up into several packets and linked together by addresses.)

The compiled code that was given earlier would be injected into the array as three packets:

 (poly: C' – t1 9 ;)
 (t1: S' * I I ;)
 (poly 5 ; outp:)
(Notice how the last one has been rotated left by one field, by the compiler, having recognised that outp: is, in fact, not an address, but a memory-mapped I/O primitive. In fact, the semicolon is merely an internal marker, and does not represent an extra field. Moreover, this internal pointer is the mechanism by which the information content of the centralised program counter of a traditional von Neuman processor has become fragmented and distributed among the concurrent, asynchronous processors.)

Each packet would be injected into a processor in the network, which would then pass it on to an appropriate neighbour. In order to achieve this, the addresses would need to be organised systematically. For reasons discussed later, cartesian coordinates are not feasible, but instead a spiral of working processors (a Catt spiral) would configure itself from the centre of the array, with radial connections across the layers of the spiral (hence the reference to cobwebs) as faster ways of travelling from one end of the spiral to the other. The processors were then simply allocated addresses according to their linear position on the spiral, and routing was achieved simply by comparing the magnitude of the address in the first field of the incoming packet to the magnitude of the address of the current processor.

Thus, in the above example, the first packet would be routed for storage in the processor whose address is poly, and the second to the processor whose address is t1. The third packet would also be routed to the processor whose address is poly. On arrival there, that processor would discover that it was already holding a packet in storage, and the two matching packets would be routed to the processor's ALU for processing.

In the case of matching packets, the ALU simply removes the first field of the application packet (the one starting with poly) and inserts those of the definition packet (the one starting with poly:). This implements the construct of function application, even of higher order functions, using the simple juxtaposition of the function followed by its arguments. The definition packet would then be destroyed if its garbage counter reaches zero as a result of this match.

 (C' – t1 9 5 ; outp:)
would emerge from the ALU.

Packets emerging from the ALU are treated just like packets arriving from a neighbouring processor: a decision is made where to route them next. In this case, with a primitive operator, C', in the first field, the packet is routed back into the ALU for the operator to be executed.

In the case of executing a C' combinator, the fields of the original packet emerge rearranged on two packets:

 (– t2 9 ; outp:)
 (t2: t1 5 ;)
This behaviour is true of all of the combinators: none of them involve anything more complicated than taking an incoming packet, and redistributing its fields on one or two outgoing packets. The only two minor complications are those of obtaining a unique address, t2 in this case, for the name of the new definition packet, and organising the garbage collection information (which involved a partial-reference-counting mechanism).

The second of these packets would be routed off to the processor at address t2, while the first, with a primitive operator in its first field, would be routed to the ALU yet again.

Arithmetic operators can only be executed, though, if both of their arguments are numerical. The ALU behaves differently if one or both arguments have not yet been evaluated: it rotates the fields to the left again:

 (t2 9 ; outp: –)
and so the emerging packet is routed off to t2, too. When it arrives, it matches the previous packet, and results in generation of a single new packet (which has been rotated right again following the match):
 (– t1 5 9 ; outp:)
Again, this is routed through to the ALU (albeit in a different processor this time) because of the primitive operator in the first field, but there is still an unevaluated argument.
 (t1 5 9 ; outp: –)
This packet is, of course, routed off to the processor at address t1, where it is matched with the earlier packet. But instead of generation:
 (– S' * I I 5 9 ; outp:)
which would be too long to fit into a six field packet, the ALU leaves them split into two packets:
 (t3 9 ; outp: –)
 (S' * I I 5 ; t3:)
And so the execution continues.

In fact, there is much scope for optimisation that the compiler could have applied to the original object code. Wherever an I combinator appears, there is probably an optimisation that could have been applied. The string (S' * I I) can be optimised to (W *), for example.

This gives a hint of the strengths and weaknesses of the machine. On the positive side, parallelism is extracted automatically by the hardware, with no need for the programmer to analyse where it is safe to put it in explicitly; plus, all of the operations are extremely simple, and easy to implement with very little hardware. On the negative side, the operations are indeed very simple, and program execution involves long chains of them (even on the optimised code), with much communication over a network of processors between instruction executions.

Why bother with such an idea?

  • Partly out of academic interest, to show that alternatives to the von Neumann way of organising a computation are possible;
  • Partly out of philosophical interest, as a tool in the search for a unifying theory of quantum gravity;
  • Partly in the hope of harnessing the enormous benefits of parallelism in the software that research into functional programming was promising;
  • Partly in the hope of harnessing the enormous benefits of parallelism in the hardware that research into wafer scale integration (WSI) was promising.

Wafer Scale Integration (WSI)

The usual approach for making a multiprocessor computer is to design the processor as an integrated circuit, and to replicate it across the area of a silicon wafer. Then, the individual processors are tested, and the faulty ones are marked with ink. Then the wafer is diced into its individual integrated circuit chips, and the unmarked (working) ones are put into packages. The packaged processor chips are then each soldered on to printed circuit boards, in our case to form a multiprocessor computer.

So, why go to all the trouble of separating the processors on the wafer, if the intention is to solder them back together again on the printed circuit board? Why not just package the complete silicon wafer as a multiprocessor computer?

This is one reason for the attraction of wafer scale integration (though there are also several others).

One of the problems is the random positions of the faulty processors on the wafer. The wafer starts by being laid out as a regular grid of processors, but this is not what is delivered to the user. The Catt spiral technique allows a linear chain of working processors to be "harvested" from the irregular array, but still does not generate a regular 2D grid. Thus conventional, imperative programming languages (many of which had already been enhanced to include matrix operations to be executed on large array-processors, like DAP-Fortran) appeared not to be the answer.

Functional programming was being researched as a means of managing the complexity of unpredictably shaped program execution graphs on to single processor computers. It seemed that it also offered the best way of mapping these on to unpredictably shaped layouts of multiple processors.

Why stop?

The aims of these two directions in research were eventually achieved, but not in the way that their researchers had originally intended. The lessons were instead taken on board by the designers of conventional computing systems.

  • Imperative programming languages absorbed declarative language concepts, such as object-orientated message passing.
  • The von Neumann style of instruction execution adopted pipelining and very long instruction words (VLIW), thereby giving concurrency in execution, and out-of-order instruction execution, thereby giving data-driven parallelism.

As a result, the industry could gain the advantages of the new research without the disadvantage of having to change too far from the old ways of working.

  • Functional programming software
    • Faster instruction execution can still be bought by increasing the clock speed.
    • The von Neumann way of organising a computation is extremely versatile, able even to incorporate the main out-of-order-execution advantages of dataflow and functional reduction processors. The arguments of Hennessy and Patterson (1990) are certainly compelling, and unavoidable in the context of the real world.
  • Wafer scale integration hardware
    • Increased transistor counts can still be bought by decreasing the line width.
    • Wafer sizes continue to evolve faster than chip areas, so it continues to be more economic to dice the wafers, than to find technical solutions to keeping them placed together on the wafer.
Top of this page Home page Services Past achievements Contact Site
map
Page d'accueil Services Réalisations précédentes Contact
© Malcolm Shute, Valley d'Aigues Research, 2005-2023