Sunday, April 14, 2013

Implementing a Genetic Algorithm in Bash

For this first experiment I'm going to be implementing a Ginetic Algorithm in Bash.  I like Bash for these little experiments since it's easy to fine tune as I work through the details.  If you take the proper precautions it can run fairly fast... I'll do a Blog entry on how to speed up Bash at some point (I promise :-)  Of course there's only so much you can do with an interpreted language, for an actual implementation this should be ported to a compiled language.

I've always been interested in Genetic Algorithms, but unlike other areas of AI I've never found a good use for it at work.  What really interests me is whether there is ever a good time to use it (where it is more effective than other algorithms) and the idea of tuning the algorithm.

So to start with we need to think about what a Ginetic Algorithm actually does, it attempts to mimic darwinian evolution to evolve a best-fit child to solve a problem:
create a pool of candidates
determine the fitness of the candidates
each generation we select parents for the next generation
create children from the parents by giving some DNA from parent 1 and some from parent 2
allow for some form of random mutation of the child
rinse-and-repeat

I read some interesting articles at some point about genetics and have some ideas about how to tune our algorithm, some things I want to look at:
how big should the pool size be?
what should be the reproduction rate of parents?
should parents live for multiple generations?
how many of the best fitness parents should be allowed to have children?
how many of the least fit should be eliminated (logans run style)?
would it help if we periodically brought back the best candidate?
how long should we let things run for?
how many mutations should be allowed at once?

So, first we need to discuss how we will encode our candidate solution (the genetic material) and how we will determine the fitness.  For a simple experiment we will be trying to match a string, and our genes will be individual characters within the string.  Our fitness will be determined by how close (in the alphabet) we get to each character added together; we wish to minimize this number for this experiment.  In other implementations, we might be dealing with points on a map and minimizing the distance to travel between them, or factors in an equation.

So, let's see what we can come up with...



#! bash
##########################################################################################
# imporant global variables                                                              #
##########################################################################################
     # constants                                                                         #
     #####################################################################################
GENERATION=0           # our current generation, how long the simulation has been running
POOL_SIZE=${1:-6}      # how many candidates are alive at once
REPRO_CHANCE=${2:-30}  # in a generation, the chance (0 to 100) that a candidate will choose a mate
BEST_FITS=${3:-70}
MUTATION_RATE=${4:-75} # for each birth, the chance (0 to 100) that a mutation will occur
NUM_MUTATIONS=${5:-1}  # max number of mutations to apply to a single child
RETURN_CHANCE=${6:-40} # chance the best candidate comes back to save us...
CHURN_RATE=${7:-40}
NUM_SECONDS=${8:-300}
let "CHURN_RATE = POOL_SIZE - ((POOL_SIZE * CHURN_RATE) / 100)"

let "LAST = POOL_SIZE - 1"

let "BEST_FITS = (POOL_SIZE * BEST_FITS) / 100"   # our hottest candidates
if [ $BEST_FITS -lt 1 ]; then
   BEST_FITS=1
fi

BEST_VAL=99999
BEST_MATCH=""


TARGET_STRING="Hello, World!"   # determine our target string
LENGTH=${#TARGET_STRING}        # get the length of the target string
declare -a CANDIDATES           # our candidate population
declare -a TARGET_ARRAY         # used for our fitness function
declare -a CANDIDATE_ARRAY      # used for our fitness function
declare -a FITNESS              # matches the CANDIDATE_ARRAY, the fitness value of each member

CHROMOSOMES='AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz ~1!2@3#456^7&890_=+|;:,./<>?'
NUM_CHROMOSOMES=${#CHROMOSOMES}

echo "Pool Size: $POOL_SIZE          Repro Rate: $REPRO_CHANCE     Mutate Rate: $MUTATION_RATE"
echo "Return Rate: $RETURN_CHANCE       Best Fits: $BEST_FITS       Num Mutations: $NUM_MUTATIONS"
echo "Churn Rate: $CHURN_RATE/$POOL_SIZE       Run for $NUM_SECONDS seconds"
     #####################################################################################
     # useful scratch variables, I am including them here for reference only...          #
     #####################################################################################
LAST_PARENT=0      # marks the end of the parents in this generation, after this are the kids
POPULATION_SCORE=0 # used to show how close the population as a whole is getting

#echo ">>> $CHROMOSOMES <<<>>> $NUM_CHROMOSOMES <<<"
##########################################################################################
# helper functions                                                                       #
##########################################################################################
     #####################################################################################
     # int2char( index ) - will return the character at the indexed position in our
     #                     CHROMOSOMES string.
     #####################################################################################
int2char()
{
   TNDX=$1
   TCHAR=${CHROMOSOMES:$TNDX:1}
}

     #####################################################################################
     # char2int( myChar ) - gives a value to a given character based on the position of
     #                      that character in the CHROMOSOMES string.
     #####################################################################################
char2int()
{
   TCHAR="$1"
   TEMPSTR=${CHROMOSOMES%%${TCHAR}*}
   TVAL=${#TEMPSTR}
}

     #####################################################################################
     # get_value_array( index ) - produces an array of indexed values for each char in
     #                            our candidate string.  These can then be used to create
     #                            our fitness values by comparing against the target array
     #                            "abc" ==> 0 1 2
     #####################################################################################
get_value_array()
{
   INDEX=$1
   TSTR="${CANDIDATES[${INDEX}]}"
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      TCHAR=${TSTR:$TNDX:1}
      char2int "$TCHAR"
      CANDIDATE_ARRAY[$TNDX]=$TVAL
      let "TNDX += 1"
   done
}

     #####################################################################################
     # determine_fitness( index ) - compares a given candidate against the target string
     #                              by converting the candidate string into values and
     #                              seeing how far apart those values are from the target
     #                              string values.
     #####################################################################################
determine_fitness()
{
   INDEX=$1
   get_value_array $INDEX
   TFITNESS=0
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      let "TVAL = TARGET_ARRAY[$TNDX] - CANDIDATE_ARRAY[$TNDX]"
      if [ $TVAL -lt 0 ]; then
         let "TFITNESS = TFITNESS - TVAL"
      else
         let "TFITNESS = TFITNESS + TVAL"
      fi
      let "TNDX += 1"
   done
   FITNESS[${INDEX}]=$TFITNESS
}

     #####################################################################################
     # create_new_candidate() - creates a new random string of the appropriate length
     #####################################################################################
create_new_candidate()
{
   TEMPVAR=0
   TSTRING=""
   while [ $TEMPVAR -lt $LENGTH ]; do
      let "TNDX = RANDOM % NUM_CHROMOSOMES"
      int2char $TNDX
      TSTRING="${TSTRING}${TCHAR}"
      let "TEMPVAR += 1"
   done
}

     #####################################################################################
     # display_candidate( index ) - displays a single candidate in our population
     #####################################################################################
display_candidate()
{
   TMBR=$1
   printf "   %3s   %5s   >>> ${CANDIDATES[${TMBR}]} <<< \n" $TMBR ${FITNESS[${TMBR}]}
}


     #####################################################################################
     # generate_report() - displays the current population
     #####################################################################################
generate_report()
{
   TMBR=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      display_candidate $TMBR
      let "TMBR += 1"
   done
}

     #####################################################################################
     # swap_candidates( index_1, index_2 ) - swaps entries in our arrays for these two
     #                                       candidates.
     #####################################################################################
swap_candidates()
{
   TMBR1=$1
   TMBR2=$2

   TSTR="${CANDIDATES[${TMBR1}]}"
   TVAL=${FITNESS[${TMBR1}]}

   CANDIDATES[${TMBR1}]="${CANDIDATES[${TMBR2}]}"
   FITNESS[${TMBR1}]=${FITNESS[${TMBR2}]}

   CANDIDATES[${TMBR2}]="$TSTR"
   FITNESS[${TMBR2}]=$TVAL
}

display_prodigy()
{
   printf ">>> %4s   Generation: %-5s   " $SECONDS $GENERATION
   printf "%15s" "( ${FITNESS[0]} to ${FITNESS[$LAST]} )"
   printf "   Prodigy: > $BEST_MATCH < with a score of $BEST_VAL\n"
}

     #####################################################################################
     # sort_candidate_pool() - sorts the pool arrays (CANDIDATES and FITNESS) placing the
     #                         most fit candidates at the top.  The most fit candidates
     #                         have a better chance of surviving and reproducing into the
     #                         next generation.
     #####################################################################################
sort_candidate_pool()
{
   # implement a simple bubble sort
   TMBR1=0                                      # the slot currently being inspected
   let "TMBR2 = POOL_SIZE - 1"                  # our last non-sorted slot
   while [ $TMBR2 -gt 0 ]; do
      TMBR1=0
      while [ $TMBR1 -lt $TMBR2 ]; do
         if [ ${FITNESS[${TMBR1}]} -gt ${FITNESS[${TMBR2}]} ]; then
            swap_candidates $TMBR1 $TMBR2       # place the biggest value in the last slot
         fi
         let "TMBR1 += 1"
      done
      let "TMBR2 -= 1"
   done

   if [ ${FITNESS[0]} -lt $BEST_VAL ]; then
      BEST_VAL=${FITNESS[0]}
      BEST_MATCH="${CANDIDATES[0]}"
      display_prodigy
   fi
}

     #####################################################################################
     # mutate_candidate( index ) - randomly change one character in our string
     #####################################################################################
mutate_candidate()
{
   TMBRX=$1
   TCNT=1
   while [ $TCNT -le $NUM_MUTATIONS ]; do
      PARENT1="${CANDIDATES[${TMBRX}]}"
      let "TNDX = RANDOM % LENGTH"   # pick a random character to mutate
      let "TLEN = LENGTH - TNDX - 1"
      if [ $TNDX -eq 0 ]; then
         SUBSTR_L=""
         SUBSTR_R="${PARENT1:1:${TLEN}}"
      elif [ $TLEN -eq 0 ]; then
         let "TLEN = LENGTH - 1"
         SUBSTR_L="${PARENT1:0:${TLEN}}"
         SUBSTR_R=""
      else
         SUBSTR_L="${PARENT1:0:${TNDX}}"
         let "TNDX += 1"
         SUBSTR_R="${PARENT1:${TNDX}:${TLEN}}"
      fi

      # now we mutate that spot
      let "TNDX = RANDOM % NUM_CHROMOSOMES"
      int2char $TNDX
      CANDIDATES[${TMBRX}]="${SUBSTR_L}${TCHAR}${SUBSTR_R}"
      let "TCNT += 1"
   done
   # echo ">>>>> Mutating child...     > $PARENT1 < ==> > ${SUBSTR_L}${TCHAR}${SUBSTR_R} <"
}

     #####################################################################################
     # create_offspring( index1, index2 ) - creates children for two parents
     #   "ABCDEFGH"    + "MNOPQRST"   ==> "????????" + "????????"
     #   "ABC","DEFGH" + "MNO","PQRST" - pick a random spot to do our split, eg. 3
     #   "ABCPQRST" , "MNODEFGH" - swap the pieces for our children
     #####################################################################################
create_offspring()
{
   TMBR1=$1
   TMBR2=$2
   PARENT1="${CANDIDATES[${TMBR1}]}"
   PARENT2="${CANDIDATES[${TMBR2}]}"

   let "TNDX = RANDOM % (LENGTH - 2) + 1"
   let "TLEN = LENGTH - TNDX"
   SUBSTR_X1="${PARENT1:0:${TNDX}}"
   SUBSTR_X2="${PARENT1:${TNDX}:${TLEN}}"
   SUBSTR_Y1="${PARENT2:0:${TNDX}}"
   SUBSTR_Y2="${PARENT2:${TNDX}:${TLEN}}"
   CHILD1="${SUBSTR_X1}${SUBSTR_Y2}"
   CHILD2="${SUBSTR_Y1}${SUBSTR_X2}"

   # the children will replace the worst candidates
   CANDIDATES[${TMBR1}]="$CHILD1"
   #   give each child a chance for a random mutation
   let "TEMP = RANDOM % 100"
   if [ $TEMP -le $MUTATION_RATE ]; then
      mutate_candidate $TMBR1
   fi

   CANDIDATES[${TMBR2}]="$CHILD2"
   let "TEMP = RANDOM % 100"
   if [ $TEMP -le $MUTATION_RATE ]; then
      mutate_candidate $TMBR2
   fi

#   echo ">>>>> $TMBR1 + $TMBR2   ===>  > ${CANDIDATES[${TMBR1}]} < + > ${CANDIDATES[${TMBR2}]} <"
}

     #####################################################################################
     # initialize() - does some set-up stuff to prepare for our testing
     #####################################################################################
initialize()
{
   printf "Target String:   >>> $TARGET_STRING <<< \n"
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      TCHAR=${TARGET_STRING:$TNDX:1}
      char2int "$TCHAR"
      TARGET_ARRAY[$TNDX]=$TVAL
      # printf " %3s" $TVAL
      let "TNDX += 1"
   done
   printf "\n"

   TMBR=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      create_new_candidate
      CANDIDATES[${TMBR}]="$TSTRING"
      determine_fitness $TMBR
      let "TMBR += 1"
   done
}

##########################################################################################
# now we can get started...                                                              #
##########################################################################################
initialize

SUCCESS=0                                         # loop until we hit our target string...
while [ $SUCCESS -eq 0 ]; do
   # go through each of our candidates and give them a fitness score
   TMBR=0
   POPULATION_SCORE=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      determine_fitness $TMBR
      let "POPULATION_SCORE += FITNESS[${TMBR}]"
      let "TMBR += 1"
   done

   sort_candidate_pool       # place our candidates into fitness-order

   if [ $SECONDS -gt $NUM_SECONDS ]; then
      display_prodigy
      generate_report
      echo ""
      exit
   fi

   # if any of our candidates is the target string we are done, report success
   if [ ${FITNESS[0]} -eq 0 ]; then
      SUCCESS=1
   else
      let "LAST_PARENT = POOL_SIZE - 1"  # all candidates can be parents
      TMBR=0
      while [ $TMBR -lt $LAST_PARENT ]; do
         # go through each of our candidates and see if they will be parents
         let "TEMP = RANDOM % 100"
         if [ $TEMP -le $REPRO_CHANCE ]; then     # if we are looking for love...
            let "TMBR2 = RANDOM % BEST_FITS"      # randomly pick one of the available parents
            if [ $TMBR -eq $TMBR2 ]; then         # we can't have a kid by ourselves
               if [ $TMBR -eq 0 ]; then           #    so use the best available candidate
                  TMBR2=1
               else
                  TMBR2=0
               fi
            fi
            create_offspring $TMBR $TMBR2         # now let's have those kids!
         fi
         let "TMBR += 1"
      done

      let "TEMP = RANDOM % 100"
      if [ $TEMP -le $RETURN_CHANCE ]; then
         # echo ">>>>> Bring back the prodigal son!"
         CANDIDATES[${LAST_PARENT}]="$BEST_MATCH" 
      else
         create_new_candidate
         CANDIDATES[${LAST_PARENT}]="$TSTRING"
         FITNESS[${LAST_PARENT}]=$BEST_VAL
      fi

      while [ $LAST_PARENT -gt $CHURN_RATE ]; do
         let "LAST_PARENT -= 1"
         create_new_candidate
         CANDIDATES[${LAST_PARENT}]="$TSTRING"
      done

      let "GENERATION += 1"          # increase the generation number
   fi
done

display_prodigy
printf "\n\n\nMatch found in Generation $GENERATION \n\n\n"




That ended up fairly well... it appears to be exactly what I wanted.  Fairly flexible, I dropped a few ideas early on: I found the ability to carry parents over one generation to the next too beneficial to drop, and the ability to bring back the best candidate (returning hero) was also too beneficial.  I will give some of the results in another Blog Post, in the meantime feel free to play with this.

Enjoy,
David Mitchell

2 comments:

  1. This really is excellent work. And you're right, bash is an under-appreciated language. And if all you need is the results in text format - bash is a go-to for just getting it done. I've always deferred to c# or vb.net for quick and dirty work that needed some sort of graphical representation of the work being done by the GA and it's solution - very important when you do the learning exercises such as the traveling salesman.

    ReplyDelete
  2. Really Excelent, thank you very much!!

    ReplyDelete