The Strategy Behind Fitness

Research > Genetic Algorithms > Articles > Strategy Behind Fitness

One of the beauties of the Genetic Algorithm is that the general algorithm is the same regardless of its application. The only piece of the GA that is tailored for the given problem is the fitness function. But since the fitness function is the only connection to the given problem, the programmer must be quite careful in designing the function so that the GA can optimize it correctly.

Also, so far I have only discussed the Genetic Algorithm itself without going into any examples. In this article, I will talk about some theoretical problems that you could use a Genetic Algorithm to solve, and how one could design good fitness function for them. Hopefully things start to fit together for you once you see some concrete examples.

Anatomy of a Genome and Its Fitness
Each genome is just a list of integers or binary digits, which must somehow represent a sample solution to your problem. Generally speaking, the longer your genome is, the more creative your solution can be. It is common practice for problems which a sample solution is just a number that you code that number in binary because that has the greatest number of digits. The reason why you want your genome's list to be long has to do with how the genomes mate: you take two genomes and randomly recombine parts to get a new child genome (more on that in the next article). For now, think of lengthening your genome's list as "expanding the search space". (Note: the term "search space" is my own concoction. I have not found an equivalent term in the literature.)

There are two main parts of a fitness function: decoding and evaluation. How you choose to translate your genome into one particular solution to your problem is called the encoding scheme. The part where you take your translated solution and measure how "good" it is we'll call the evaluation process. When you're done evaluating, you give the genome a number called it's fitness.

Exemplar Gratis: A Multi-Variable Equation
This is a classic problem for GAs. The task is that you have an equation which has many variables and you want to find the highest value in a certain domain. We'll observe this 2-variable function graphed below:


Graph of
Notice how there could be several different highest points.
So the problem is to find the x and y values for the highest value of f(x). The encoding scheme here should be obvious: a sample solution to this is just a value for x and a value for y, which maybe you want to encode with 0's and 1's. Let's say that you want each value for x and y to have 3 digits past the decimal point, so that means that in binary your two values for x and y could look like this:

x=10.100111111001110111 (2.6235 in base 10)
y=0.10000100110111010011 (0.5190 in base 10)

It's easy enough to represent this genome as an array of bits in your program, as long as you keep track of where the decimal point is and where the x values stop and the y values start. Both the decoding and the evaluation process for this kind of genome are simple - just convert the values to base 10 and plug them into the f(x,y) and return the value for the fitness.

Optimizing Evaluation
When looking at how a GA works, you can see that the fitness function gets called all the time. In fact, it gets called for each genome, each generation. That can be a whole lot of computation depending on what your problem is. What is worse is that you never know how many generations it will take for the population to converge. It is crucial, therefore, to program your fitness function so that it runs quickly because it is often the bottleneck in the entire program.

Alternative: Marker-Based Encoding

This is more of a cutting-edge technique for encoding genomes. Let me introduce the idea using an example from mathematics: graphs. A graph is a structure with vertices and edges (ie points with connections). Here is an example of what a simple graph might look like:


A graph: circles are vertices and lines are edges
Now you can use these things for tons of practical applications; imagine the vertices being airports and the edges being flights, or vertices being websites and edges being links. There is a whole set of mathematic study which goes into this called Graph Theory.

Suppose your job is to make a graph. You don't want to specify beforehand how many vertices, how many edges, and how those vertices are connected - all of that needs to be randomly generated in the genome. So how do we code up a graph into a list of integers? Consider this scheme:

  • Each genome is a list of 200 integers between 0 and 99
  • A "start marker" is divisible by 11
  • After the start marker, the sequence of integers tells which vertex to connect to
  • A "stop marker" is the next integer divisible by 11.

To decode this genome, you need to pass through it twice. The first pass, look for start and stop markers and count how many vertices you have (call it n). Name each vertex starting at 0. The second pass is where you fill in each edge. Take the edges and evaluate it mod n and put an edge there.

Take a look at how this random sequence of integers might be decoded. Note that n=3. The second line are the names of each vertex.

34 49 55 23 81 44 92 66 19 02 99 13 23 11 82 88 82 15
012

  • The start markers are highlighted in green .
  • The used numbers representing edges are in blue.
  • The stop markers are in red.
  • The unused integers are in grey.

So when you evaluate each number mod 3, you get this:

23 mod 3 = 2
81 mod 3 = 0

19 mod 3 = 1
02 mod 3 = 2

82 mod 3 = 1

In the end, it codes for this graph:


Note: vertices 0 and 1 have edges pointing to themselves; this is allowed

Now it's important to choose the length of your genomes and frequency of start/stop markers carefully. If you have a long genome with frequent starts and stops, then expect to have lots of vertices, for example.

It has been shown that this encoding scheme works reasonably well for coding things like neural networks. The idea came from looking at how real genomes work: they have start and stop codons within the mRNA which function the same way. The advantages to this approach are also that you can have entire sections of unused genome, so it allows for a lot of flexibility. One mutation or mating can create or remove a whole new vertex, as well as shift how the whole structure works.

Conclusion
Designing a good fitness function is crucial to the entire Genetic Algorithm. In some ways, once you get the other parameters and operators chosen and working okay, you won't ever need to adjust them again. All the changes you make to get different results will be how you evaluate your fitness, which is one of the many beauties of GAs. Having a centralized description of your problem also means that you have to make your fitness function behave exactly how you want it to, otherwise you won't get the answer you're looking for. Get your fitness function right and your whole GA should go smoothly.



News | Research | Personal Profile | Photo Album
Contact Me

Page last updated: May 14, 2006