![]() |
English | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Home page | Services | Past achievements | Contact | Site map |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Page d'accueil | Services | Réalisations précédentes | Contact | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Français | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0: | JMS adrs | Jump (indirect) to Subroutine |
|---|---|---|
| 1: | STR adrs | Store ACC at (Adrs) |
| 2: | IND adrs | Index on (Adrs) |
| 3: | BIC adrs | Bit-clear (Adrs) into ACC |
| 4: | ROR adrs | Load right-rotation of (Adrs) into ACC |
| 5: | COM adrs | Load 2s complement of (Adrs) into ACC |
| 6: | ADD adrs | Add (Adrs) into ACC |
| 7: | DCS adrs | Decrement ACC and jump to Adrs if not zero |
For the extended MAL2 design, one exploration into expanding the Arithmetic and Logic operations of the instruction set involved a data-path that would take in two integers, A and B, and deliver the result of FA( g1.A, op(A,B), g2 ) where FA is a full-adder, g1 is a simple AND-gated input for argument A, g2 is a carry-in bit, and op(A,B) is one of the sixteen bitwise operations on the two integers A and B. One bit in the opcode is used to choose between a bitwise logic operation (opc4=0) or an arithmetic operation (opc4=1). For the former, g1 and g2 are held at 0, and the four bits opc3, opc2, opc1 and opc0 are read as a truth-table for the bitwise logic operation. This truth table is fed into a bank of four-to-one multiplexers, one for each bit of the inputs A and B and the result, with the two control bits of the multiplexers supplied by the corresponding bit of the A and B inputs.
| 0000: | CLR | 1000: | AND | 0100: | RBIC | 1100: | LDA | 0010: | BIC | 1010: | LDB | 0110: | XOR | 1110: | OR | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0001: | NOR | 1001: | XNOR | 0101: | NOTB | 1101: | NBIC | 0011: | NOTA | 1011: | NRBC | 0111: | NAND | 1111: | M1 |
For the arithmetic operations (opc4=1) g1 is set to 1, g2 can be one of "0, 1, c, msb(A)" and the truth table is artificially forced to one of "Z0, LDA, LDB, NOTB, NOTA, M1" (where Z0 is a synonym for CLR). For the truth tables of the forced truth table bits, ftt3=XOR(opc3,AND(opc2,opc1)), ftt2=XOR(opc3,opc2), ftt1=XOR(opc3,opc1), ftt0=XOR(opc3,OR(opc2,opc1)). (Concatenated together, these four truth tables form a symmetrical 4x4 matrix, which is surprising given the somewhat ad hoc way in which the elements of the mattrix were added.) The opc0 bit is used to select the contents of the carry-latch, and the "msb" setting is used as a cheap-to-implement minor adjustment to convert an otherwise duplicate instruction into something more useful. There just remains the duplicate instruction, "LD A" (already provided by "LDA") and a desire to find a place for a "ROL A" instruction, "FA(A,LDA,msb)" (though a case could be made for this replacing the "SHLC A" instruction).
| 0000: | LD A | FA(A,Z0,0) | 0001: | INCC A | FA(A,Z0,c) | 0010: | SHL A | FA(A,LDA,0) | 0011: | SHLC A | FA(A,LDA,c) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0100: | ADD A,B | FA(A,LDB,0) | 0101: | ADC A,B | FA(A,LDB,c) | 0110: | DEC A | FA(A,M1,0) | 0111: | DECB A | FA(A,M1,c) | |||
| 1000: | SETC | FA(A,M1,1) | 1001: | DECP A | FA(A,M1,msb) | 1010: | SGN A | FA(A,NOTA,msb) | 1011: | LDC | FA(A,NOTA,c) | |||
| 1100: | SUB A,B | FA(A,NOTB,1) | 1101: | SBB A,B | FA(A,NOTB,c) | 1110: | INC A | FA(A,Z0,1) | 1111: | INCN A | FA(A,Z0,msb) |
Overall, this leads to: tt3=XOR(opc3,AND(opc4,opc2,opc1)), tt2=XOR(AND(opc4,opc3),opc2), tt1=XOR(AND(opc4,opc3),opc1), tt0=IF(opc4 then XOR(opc3,OR(opc2,opc1)) else opc0), and g2=opc4.(opc0.IF(AND(opc3,XOR(opc2,opc1)) then msb else c) OR NOTopc0.IF(AND(opc3,NOTopc2, opc1) then msb else opc3)).
One might ask whether such attempts at minimalism are still relevant, today, in the days of CISCs with floating point instruction sets, highly optimised RISCs and pipelined eager execution instruction streams. There are many applications where the processing is not the primary function. For instance, a memory chip or communications chip might be given rudimentary processing ability to get on with simple housekeeping tasks, without wanting it to take too much of the silicon real estate away from the primary function.
The goal of the instruction set of the Cobweb machine was different (Shute, 1992). This machine was designed to be completely non-von-Neumann.
SKIM had shown that it was possible to conceive of a machine that executed Combinators as the primitive instruction set (Curry and Feys, 1958). But this machine was emulated in programming code, or at best in microcode, on a conventional, von Neumann-style computer. Similarly, the Manchester University Dataflow Machine, (as described, for example, in Duce (1984)) was emulated using a conventional von Neumann-style program.
But the beauty of the S, K and I combinators (along with B, C, W, B', C' and K") is that they each involve extremely simple hardware operations (even after allowing for needs of the garbage collection mechanism).
Functional languages can be readily compiled down to variable-free Curry combinator expressions (Curry and Feys, 1958), which can then be reduced by a network of asynchronous processing nodes, each applying local rewrite rules. Combinators eliminate the alpha reduction (argument substitution) phase of lambda calculus evaluation, explicitly routing the arguments to their correct positions in the input string. In effect, the alpha reduction is moved over partly to the beta reduction (ex-pression execution) process, and partly to the compiler (which has to parse the source text, to annotate it with the correct combinators, thereby implementing a static call-by-value mechanism). This compilation process is graphically illustrated by the Λ, \ and / combinators (or their synonyms of S’, B’ and C’, or S(1), B(1), and C(1), or S1, B1 and C1) that are readily generated from a conventional parse tree (Turner, for exam-ple as reported in Sleep and Kennaway 1984):
S’ = λ w x y z . w (x z) (y z) B’ = λ w x y z . w x (y z) C’ = λ w x y z . w (x z) y
Ultimately, even these expressions can be compiled further, on down to just two universal combinators, such as S and K.
K = λ x y . x S = λ x y z . x z (y z)
Taken to the extreme, this would involve the adoption of Peano arithmetic (for example, using K(SKK) for "zero", and S(S(KS)K) for "successor"). To the engineer, though, it is more practical to include conventional binary integers (and perhaps floating-point and complex numbers) in the usual way within the instruction set.
Cobweb, therefore, was given an instruction set, giving an assembler language that was itself a functional program language not unlike a simplified version of Lisp.
As well as the combinators, primitives were also included for operations on the integers (ADD, SUB, MULT, DIV, etc.). A toy program to evaluate a simple polynomial, for example, written in Sugar (Glaser, Hankin and Till, 1984) might be written:
poly = [x] x*x – 9; poly(5) ?and would compile to Cobweb code as:
(poly: C' – (t1: S' * I I) 9) (output: poly 5)or, rather, as:
(poly: C' – t1 9) (t1: S' * I I) (output: poly 5)
A Cobweb machine processor was, in fact, an extremely simple block of hardware. Indeed, the aim was for it to be so small that it would be feasible to build a vast array of them on a single piece of silicon, and for the network to function as a highly parallel multiprocessor computer.
The Cobweb machine processors would be connected simply to their physical neighbours, and would pass data as asynchronous packets. Each packet consists of up to six fields, with each field able to hold anything that can traditionally be represented in a word of memory (such as an integer, an address or an in-built instruction), and typically in the order "(id, opcode, operand1, operand2)".
Straying from the pure mathematics of lambda calculus, and into the realm of en-gineering expediency, the compiled code that was given earlier would be injected into the processor array as the following three packets, where the last packet has been skewed left by one field, by the compiler, having recognised that "output:" is, in fact, not an address, but a memory-mapped I/O primitive. This skewing is done so that the next action to be performed on each packet appears in the first physical field, even if this is not the first logical field of the packet. The semicolon itself is not implemented as a physical field, but merely by a 3-bit pointer in the metadata of the packet.
(poly: C' – t1 9 ;) (t1: S' * I I ;) (poly 5 ; output:)
On receiving a packet, the processor passes it on to an appropriate neighbour, via an all-to-all routing switch (a crossbar exchange). The processors would already have allocated themselves addresses according to their linear position on the spiral of the cobweb topology, and packet routing would simply entail comparing the magnitude of the address in the first field of the incoming packet to the magnitude of the address of the current processor. Thus, in the above example, the first packet would be routed for storage in the processor whose address is “poly”, and the second to the processor whose address is “t1”. The third packet would also be routed to the processor whose address is “poly”. On arrival there, that processor would discover that it was already holding a packet in storage, and the two matching packets would be routed to the processor's ALU for processing.
The ALU would simply remove the first field of the application packet (the one starting with “poly”) and insert those of the definition packet (the one starting with “poly:”). This implements function application as the simple juxtaposition of the function followed by its arguments. The definition packet would then be destroyed if its reference count reaches zero as a result of this match. The following packet would therefore emerge from the ALU:
(C' – t1 9 5 ; output:)
Packets emerging from the ALU are treated by the processor’s crossbar exchange, just like packets arriving from a neighbouring processor, with a decision being made where to route them next. In this case, with a primitive operator, C', in the first field, the packet is routed back into the ALU for the operator to be executed. In this case, the fields of the original packet would emerge, rearranged on two packets:
(– t2 9 ; output:) (t2: t1 5 ;)
Each of the other combinators is handled in a similar way; none of them involve anything more than taking an incoming packet, and redistributing its fields on one or two outgoing packets. The only two complications are those of obtaining a unique free address, t2 in this case, for the name of the new definition packet, and organising for the partial-reference-counting garbage collection.
The second of these packets would be routed off to the processor at address “t2”, while the first, with a primitive operator in its first field, would be routed to the ALU again. Arithmetic operations, being strict, can only be executed if both of their argu-ments are numerical. To achieve this, the packet skewing mechanism that was men-tioned earlier is used again:
(t2 9 ; output: –)
The emerging packet would therefore also be routed off to “t2”. When it arrives, it matches the previous packet, and results in the generation of a single new packet, and is unskewed again, following the match:
(– t1 5 9 ; output:)Again, this is routed through to the ALU (albeit in a different processor this time) because of the primitive operator in the first field, but there is still an unevaluated argument.
(t1 5 9 ; output: –)This packet is, of course, routed off to the processor at address t1, where it is matched with the earlier packet. But instead of generation:
(– S' * I I 5 9 ; output:)which would be too long to fit into a six field packet, the ALU leaves them split into two packets:
(t3 9 ; output: –) (S' * I I 5 ; t3:)
And so on. Overall, program execution is non-von Neumann, inasmuch that it exe-cutes functional programs directly in the hardware, avoiding the use of a centralised program counter. In effect, the skewing mechanism on each packet is taking on the role of the program counter, but in a distributed way across the array of packets..
In the case of arithmetic operations, there is scope for a demand-driven, eager evaluation of both of the arguments in parallel. For example:
(* t4 t9 ; output:)
could generate two packets:
(t4 t5 ; output: *) (t9 ; t5:)
As an engineering bonus, by inhibiting this mechanism, if there happened to be a shortage of free tags available (“t5” in this case) the processor could simply generate the single packet “(t4 t9 ; output: *)”. This would achieve the effect of throttling-back on the amount of parallelism that was being extracted whenever the hardware is becoming saturated, and approaching the capacity of the multiprocessor computer.
As a further engineering tweak, in special cases, such as for multiplication in which one of the values (t4, say) evaluates quickly, and turns out to be 0, it would be possible to cut short the wait for the evaluation of the other parameter, by unskewing the top packet immediately. A new packet would also be created, with the sole pur-pose of awaiting the return of the value of “t5” and garbage-collecting it immediately; the computational resources would still have been wasted on evaluating the second argument, but at least its execution time would not have further delayed the computation. (A further engineering tweak would have been needed, to allow the right argument to behave similarly.)
There could be speed advantages, too, if a mechanism could be found for overrid-ing the usual weak head normal form (WHNF) when evaluating xxqqxx Curried functions, allowing eager evaluation as the parameters become available. For example:
quadroot1 = λ a b c . (-b+sqrt(b*b-4*a*c) / (2*a); mycase = quadroot1(2,3);
would ideally allow “mycase” to reduce to a single parameter function:
mycase = λ c . (-3+sqrt(9-8*c) / 4;
All of the operations were to be extremely simple, and easy to implement directly in the hardware. Unfortunately, this simplicity also meant that program execution would involve long chains of them, with much communication over the network of processors. The research discussion at that time (for example, Hockney and Jesshope 1988) seemed to kill the idea of Cobweb being of any practical value: worse than being limited by Amdahl’s Law, the gains from parallelism would peak at some small number, regardless of how many hundreds of processors were actually provided.
If the design were to be limited to a small number of processors, each processor would then have to be equipped with enough storage to allow the overall machine to accommodate the peak number of packets during the program reduction. This can be achieved by including a memory array (one packet wide, by N packets long) for storing the peak number of packets. In this way, even a single processor version of Cobweb would still have the distinction of being a computer that executes function language programs in a non von Neumann way.
The possible lessons to be learned for the use of functional languages and Curry combinators in quantum hardware were also considered.
The usual approach for making a multiprocessor computer is to design the processor as an integrated circuit, and to replicate it across the area of a silicon wafer. Then, the individual processors are tested, and the faulty ones are marked with ink. Then the wafer is diced into its individual integrated circuit chips, and the unmarked (working) ones are put into packages. The packaged processor chips are then each soldered on to printed circuit boards, in our case to form a multiprocessor computer.
So, why go to all the trouble of separating the processors on the wafer, if the intention is to solder them back together again on the printed circuit board? Why not just package the complete silicon wafer as a multiprocessor computer?
This is one reason for the attraction of wafer scale integration (though there are also several others).
One of the problems is the random positions of the faulty processors on the wafer. The wafer starts by being laid out as a regular grid of processors, but this is not what is delivered to the user. The Catt spiral technique allows a linear chain of working processors to be "harvested" from the irregular array, but still does not generate a regular 2D grid. Thus conventional, imperative programming languages (many of which had already been enhanced to include matrix operations to be executed on large array-processors, like DAP-Fortran) appeared not to be the answer.
Functional programming was being researched as a means of managing the complexity of unpredictably shaped program execution graphs on to single processor computers. It seemed that it also offered the best way of mapping these on to unpredictably shaped layouts of multiple processors.
The aims of these two directions in research were eventually achieved, but not in the way that their researchers had originally intended. The lessons were instead taken on board by the designers of conventional computing systems.
As a result, the industry could gain the advantages of the new research without the disadvantage of having to change too far from the old ways of working.