From Natural Genetics to Genetic Computation

Some of the diverse subjects that are evoked by the title "Genetic design" include:

Genetic engineering and conventional breeding for use in:

Raising farm livestock and crops for food
Manufacturing materials, including for pharmaceuticals, components for material devices, and biofuels
Raising pedigree show animals
Observing human pedigree, viz. genealogy

DNA computing
Evolutionary computing systems such as:

Genetic algorithms
Genetic programming
Classifier systems
Simulated annealing
Boltzmann machines
Artificial neural nets

The links above each lead to discussions that are set in the context of information technology. This is the justification for grouping such a diverse list of subjects together on this single web page.

Genetic Engineering and Conventional Breeding

At one level of analysis, there is little difference between genetic engineering and conventional breeding. They both involve mixing natural DNA, with a view to producing better progeny: combining the desirable traits of the parent stocks, while excluding the undesirable traits. At another level of analysis, though, there is a whole world of difference, inasmuch that conventional breeding uses nature's established machinery, using genetic material borne by members of the same species; and is, itself, a technique that has been tried and tested by humanity for millennia.

That said, genetic engineering uses techniques that are also part of nature's established machinery inasmuch that viruses are able to carry genetic material between species. Man is merely harnessing this technique, but at the risks of creating unintentional havoc at a far greater rate than conventional breeding has, over the millennia.

As well as producing fuel for humans (the traditional food industry), agriculture has always been involved, too, in producing fuel for homes, machines and industry. This line, though, is becoming more and more significant as our sources of fossil fuels, the low-hanging fruit anyway, start to run out.

Just a quick note about Lamarckism: genetics is inevitably talking about the transfer of information, and whether Lamarckism occurs or not amounts to establishing whether an information channel can be shown to exist. Past supporters of the idea have assumed, maybe implicitly (especially in the days before DNA was discovered) that such a channel is biochemical, and within the body of the individual organism. However, there might be evidence that it occurs, all the time, via the environment, and that it goes hand in hand with natural selection (NS, 17-Jan-2015, p26).

Genealogy and its Computerisation

KinCode was invented by Knud Højrup for his Knot System, and is a notation for navigating round a family tree; in particular, for representing the routing information for climbing up two sides of a family tree, to an ancestor who is common to both sides. In this way, the notation succinctly represents all the blood relationships, such as father, child, great-uncle, etc, in a way that can be manipulated mathematically.

Starting from self=1, this system numbers the nodes in the pedigree tree consecutively, with father=2, mother=3, paternal-grandfather=4, paternal-grandmother=5, maternal-grandfather=6, maternal-grandmother=7, and so on. When expressed in base-2, binary notation, these integers give a very succinct representation of the path up the pedigree tree.

Blood relations are then expressed by a pair of these integers: one to represent the relationship of the nearest common ancestor to the first person, and the other for the second.

All of my great⁵-grandfathers, for example, are represented, from my perspective, by even numbers in the range from 128 to 254 (and their wives by the corresponding odd numbers), but from their own perspective by the number 0 (male) or 1 (female). In this way, the (partial) list of my great⁵-grandparents also gives their mapping in my pedigree tree: Shute (128), Spearing (130), Hannoun (136), Harris (144), Scammel (152), Young (156), Pretty (160), Trapp (161), Blake (168), Highet (169), Lane (170), Home (172), Fleming (176), Ricketts (177), (Uncles), Goulden (178), Shergold (180), Dowty (182), Musselwhite (184), ([=Mussell, Potticary, Whitaker, Graunt, White, Fortie, Ivie, Iver], Hibbard, Ford, Eyrea, Coney), Carpenter (185), Goodfellow (186), (Harvy), Dominick (187), (Baker), Webb (188), Cooper (190), Mullins (192), Farthing (194), Sprackling (196), Jukes (208), Forward (209), [Warford], Hill (210), White (211), [Tripp], Read (212), Horskins (213), Finney (214), Thomas (216), Curtis (224), Warren (232), Myall (240). (A large proportion of these all living in a 3:1 Somerset-Dorset-Wiltshire Lozenge traced out from: Wincanton, North Cheriton, Stoke Trister, Bayford, Shute's Lane, Lox Lane Farm, Shaftesbury, Ludwell (source of the River Nadder), Wardour, the Stubbles in Tisbury, Quidhampton, Salisbury Cathedral, Wilton, Barford St Martin, Baverstock, Dinton, Chilmark, Hindon, Mere, Bourton.)

Various subsets of this notation allow different operations to be formed, in particular, automatically on the computer that stores the family tree.

An Algebra for Genealogy
An Algebra for Direct Male or Direct Female Descent
An Algebra for Calculating Relationship Coefficients
Narrowing down the Year Ranges

An Algebra for Genealogy

In the KinCode notation, the relationship 6.4 (110.100 in binary) represents one particular type of cousin, and encodes simply as 2y2 in my own notation. In general, the relationship A.B in the former encodes as MyN or MxN in the latter, where M is the integer floor of lg(A) (logarithm to base-2 of A), and N is the integer floor of lg(B). Originally, both notations were developed so as to provide a succinct and consistent way of abbreviating blood-relationships on documents, such as family trees. As such, they would also serve as a useful intermediate representation for translating such documents from one language to another. Using the notation, too, a number of mathematical operations are possible, such as recomputing relationships from different viewpoints.

An Algebra for Direct Male or Direct Female Descent

Another subset of KinCode notation allows for the ready extraction of the lines of direct male descent, and direct female descent. The notation, therefore, could be useful to automated systems designed to extract information from databases pertaining to historical land ownership, museum exhibits that have survived as family heirlooms, etc. Equally, it could be of use to people who study surnames.

With genetic finger-printing, genetic demographers, interested in ancestry and the movements of ancient tribes across Europe, and across the planet, are precisely limited to the lines of direct male descent (via the information in the Y chromosome) and direct female descent (via the information in the mitochondrial DNA).

An Algebra for Calculating an Index of Relatedness

Another subset of KinCode notation allows for the ready extraction of relationship coefficients, and inbreeding coefficients. Applications for this might be medical, to calculate the probability of inheriting certain diseases from previous generations. There is also potential for application in demography (putting a value on the proportion of genes coming from different ancestors spread across Europe, and the world).

The system could be used for non-humans, too. Dog, cat and horse breeders, for example, wishing to put a precise percentage on the pedigree. Similarly, for breeding farm livestock and arable crops.

There could be a legal application, too. (If uncle and niece cannot marry, but cousins can, that defines the threshold value for incest, and these coefficients could be used to determine on which side of the line other relationships fall).

Geneticists need this sort of information, too, to ascertain the degree of evolutionary pressure, as described, for example, by Richard Dawkins in The Selfish Gene. The index of relatedness is given as a/2^g, where 'a' is the number of common ancestors that the two given indivuals share, and 'g' is the genetic distance between the two given individuals. If the relationship is expressed in MzN notation, g=M+N and a=if((M=0)OR(N=0),1,2). As Dawkins notes, though, the simple relationship breaks down in the presence of incest or inbreeding anywhere in the intervening family.

Narrowing down the Year Ranges

There is certainly a great deal of other information that can be extracted from family tree databases, such as that on narrowing down the year ranges when searching for certificates in the public records office.

Evolutionary Computing Systems

Evolutionary computing systems have very little to do with biology, and DNA, except that they harness some sort of "survival of the fittest" mechanism.

As an illustrative example of how genetic algorithms work, consider the process of designing aircraft wings.

Before rushing off to the model shop, and then to the wind tunnel, the first step is to simulate the designs on the computer. This, in turn, implies that we must have some method of describing an aircraft wing to the computer, using a computer description language (as opposed to a computer programming language). At its simplest, the description might just consist of a long list of 3D coordinates, tracing out the shape of the wing; or it might consist of a long list of parameters, expressed as angles and lengths of key components of the wing surface.

In all cases, though, the description takes the same overall form: a long list of numerical parameters. Furthermore, the parameters are represented as integers (floating-point notation being just a fixed-point approximation to mathematical real numbers). This long list of integers can then be treated as if it were a chromosome: a genotype whose phenotype is an aircraft wing, when expressed within the environment of the simulator.

Instead of just taking one aircraft wing description, and running it within the simulator, a genetic algorithm computer system might allow for fifty such chromosomes to be stored at a time. The original chromosome is placed in the first position in the "pool", and is copied 49 times, with each copy being given a slight random mutation to one or more of its parameters. Next, the system takes each one in turn, and runs it in the simulator. The simulator takes the wing design through a set of tests, specified by the designer, and rates the performance of the design in each test. These ratings can then be combined as a single "figure of merit" for each design.

At the end of the simulation of the fifty chromosomes, the system might delete the worst 25, and make one slightly mutated copy of each of the best 25 to replace them. Then the simulation cycle is repeated, deriving the "figures of merit" for each of the new chromosomes. As in nature, most of the mutated copies will turn out to be absolute disasters, but occasionally one will just happen to hit on a better design than its parent. At the end of a run of about 200 cycles, or "generations", the chromosome with the highest figure of merit is offered back to the designer for further consideration.

That, in essence, is how genetic algorithms work. The model can be extended, though, to using other mechanisms that are borrowed from natural genetic systems. For instance, cross-over between two parent chromosomes can be used, as well as mutation, as a method for deriving the 25 new designs at each generation.

Any man-made system can be designed in this way. All that is needed is a description language in which the design can be expressed, and a simulator in which the design can be tested. Computer hardware can be designed using computer hardware description languages; and computer software can be designed using a process that is called genetic programming.

Genetic programming relies on the observation that a computer program, too, is just a list of integers (each integer representing an instruction from the processor's instruction set). Thus, a computer program, too, can be treated as a chromosome within a genetic algorithm system.

For example, the simulation part of the cycle might apply each chromosome to the integers 1, 2, 3, 4, 5, 6 in turn, and rate it as to how close it gets to returning the desired results 1, 2, 6, 24, 120, 720. At the end of the 200th generation there is a possibility (nothing is ever certain with evolutionary computing systems) that the best chromosome will have evolved into a "factorial" function.

In effect, the computer programmer has not had to write the desired factorial program, but has merely specified it using a training example (in this case, the list 1, 2, 6, 24, 120, 720). The difficulty, though, is in designing the system for deriving the figure of merit for ranking the chromosomes (the so-called "fitness function") in such a way that subsequent mutations and crossings-over are likely to be steered in the appropriate direction. Importantly, the representation of the chromosome, too, must be chosen carefully, so that mutations are more likely to have a gradual effect on the change that is caused, rather than a dramatic one; this is the concept of brittleness, that also applies to the natural world (NS, 29-Nov-2014, p48). Rather than having the problem of designing the finished product, the programmer has the problem of designing the inverse function: the mould into which the final product fits.

There is no free lunch, therefore. However, there are many problems for which designing the inverse function is easier than designing the function itself. Consequently, genetic programming is a useful tool to add to the toolbox, to be used when other tools are less appropriate.

Genetic algorithms can, in fact, also be used by theoretical biologists. They can be used as a platform for experimenting with mechanisms that might have been, or could have been, at work in the natural world, to understand better how the natural world works. Avida (NS, 07-Aug-2010, p6), and before it, Tierra, for example allow experiments to investigate how the computer equivalent of DNA might evolve once it finds itself in a rich primordial soup. The proposal for the bottle experiment is aimed at the stage before that, to see if self-replicating DNA would evolve spontaneously; and the mathematicians deme network was an experiment for studying the evolution of one small part of human knowledge.

Meanwhile, the techniques of genetic algorithms can be applied back in the real biological world (NS, 25-Jun-2011, p34) with real DNA being used at each cycle of the machine. Although not exactly what is described for the present system, microbes could be pumped round a loop of chambers, with one chamber where they are stressed to provoke mutations, and another chamber to screen for the ones that will be used for breeding in the next cycle.

Another example of a mix of biological and artificial algorithm is found in the use of a chamber containing a population of eukaryote single-cell organisms being used as the non-linear reservoir for a reservoir computing artificial neural net (NS, 98-Oct-2021, p12) where the input is the temperature of the incubation chamber, and the readout is the population count.

Human viral infectious diseases (VID)

In the days since Covid-19 was identified, much more activity has been seen in the pharmaceuticals industry to seek out vaccines and treatments for all human viral infectious diseases. The following tree is compiled mainly from information found on Wikipedia, and based roughly on patient-perceived symtoms (as in, having the pox, the 'flu, or a common cold).