Thursday, April 25, 2013

Bash/Korn Shell Diary - Part 4

The beauty of Scipts is that you can craft data in a persistent way.  From the command line building a complicated loop, or performing large numbers of sequential tasks can be tricky.  The Command Line is for testing, a Script file is for building.  And in order to build properly you need to understand your building-blocks, this brings us to flow-control.

Normal program flow in a script is iterative, we execute one line at a time starting from the top and working our way down until we reach the end.  Bash provides us with several ways to alter the command execution.

If-Then-Else blocks are the most fundamental type of flow control.  They allow you to branch execution based on testable conditions.  Bash also gives us the “elif” (Else-If) operator so that we don’t have to use nested If/Else blocks unnecessarily.  Since I’ve been forced to work in some professional languages that don’t support this simple feature I’d like to say “Thank you Bash”.

   if [ test_conditon ]; then
      :
   elif [ test_condition ]; then
      :
   fi


Test conditions come in a couple of flavors:
-    File Checks, these allow you to do things like check if a file exists (“-f MyFile.txt”), see if the file is writable (“-w MyFile.txt”), see if an inode (file) is a link (“-a MyFile.txt”), see if a file is newer than another file (“myFile.txt –nt myFile.bak”).
-    Simple string comparisons: (“$MYVAR” = “TESTING”).
-    Simple math comparisons: ($MYVAR –gt 12) or ($MYVAR > 12): -lt, -le, -eq, -ne, -ge, -gt.
-    If you use double brackets you can even use and (&&), and or (||) operators.  Double brackets also expand octal and hex values when it runs across them in the standard formats (08, 0xf3).
-    When using multiple compound comparisons you can short-circuit evaluation: (“... || …” the second test will only be executed if the first test fails) and (“… && …” the second test will only be executed if the first test succeeds).  This gets complicated when short-circuiting more than two tests so make sure you test thoroughly.
-    “-a” and “-o” can also be used for “and” and “or”.
-    The colon operator is actually a valid command in Shell, and will basically do nothing in this case.  Bash expects at least one valid command to be executed in an If statement.
-     Executed command exit values (`grep "xyz" myfiles.* 2>/dev/null`).  This gives you the same benefit as checking against the exit value stored in the varialbe ($? -eq 0), but allows you to do it in one place if that kind of thing appeals to you.

Case Statements are a more convenient form of If-Else checks.  Rather than do a long series of elif's you can do simple string matches in one place:

   case $var in
      a* ) ... ;;
      [0-9] | [0-9][0-9] ) ... ;;
      * ) ;;
   esac


A few things to note:
- "case" is ended with "esac" which is "case" spelled backwards.  We saw this before with "if" and "fi".  I have nothing more to say about this.
- We can use simple regular expressions in our string matches: "a*" means a string that starts with an "a", so it would match "alf" as well as "a".  "[0-9]" is a single digit, etc...
- You can put multiple cases on a single line by using the pipe operator ("|") as an or statement.
- Our final case here was a single "*" which will match any line.
- each case statement is ended with a double semicolon (";;").  There is no flow-through.

Bash provides a couple flavors of Pre-Check loops (loops that are checked prior to running through the enclosed commands for the first time):

   for $VAR in ... ; do
      :
   done

   while [ ... ]; do
      :
   done

   until [ ... ]; do
      :
   done


Our For loop allows us to cycle a variable through many different states.  This is useful if you know all of the states to be evaluated prior to entering the loop.  This is a convenient way to rifle through all of the *.csv files in a directory for instance, or perform the same logic on a pre-set series of strings (tokens to be found in files for instance).

While will rifle through commands as long as a condition is true, and Until does the same thing while the condition is false (not really necessary to have both of these operators since you have the negation test... but whatever).

In a previous lesson I showed you how you could redirect standard output and error out of your scripts, but there is another trick you can use to redirect standard input into a command by using the "<" operator.  By doing the following we can rip through a file one line at a time:
   while read LINE ; do
      :
   done < input_file.txt


Our $LINE variable now contains the full contents of each line of the file, one at a time.

There is another flow-control operator that is intended for user interaction:
   echo "Please make a selection: "
   select word in quit *.txt ; do
      echo "word is $word"
      echo "REPLY is $REPLY"
      if [ "$word" = "quit" ]; then
        break
      fi
      echo ""
   done


looks something like this:

   Please make a selection:
   1) quit              4) combinations.txt
   2) alchemy.txt       5) myout.txt
   3) alchemy2.txt      6) output.txt
   #? 1
   word is quit
   REPLY is 1


This structure deserves a few comments:
- Here I used the "$PS3" variable.  PS3 is a special variable that provides a prompt specifically for select queries.
- If you don't provide any choices explicitly the options given will come from the $@ variable (all arguments, quoted arguments can contain spaces).
- The text of the selected choice is stored in the variable you provided ("word" in this case).
- The number of the selection (that the user typed in) is stored in "$REPLY".
- You can also read user input one line at a time with "read MYVAR".  There are some more advanced tricks for reading a single character we'll try to get to later.
- We used the "break" keyword to jump out of one level of flow-control looping.  We could use "break 2" to jump out of two levels of nested loops if desired.

Bash also allows the user to use a more traditional For loop:
   for (( a = 1; a <= LIMIT ; a++ )); do
      :
   done


Here we are using roundy-braces.  We talked about these earlier when we were discussing the "let" operator.  They do have their uses.

There is one other mechanism I want to discuss here: scoping.  Although not normally associated with flow-control they do allow one special trick:
   echo "$MYVAR" >> outfile.txt
   echo "$MYVAR2" >> outfile.txt


can be replaced with:
   {
      echo "$MYVAR"
      echo "$MYVAR2"
   } >> outfile.txt


This may seem rather minor, but by handling all of our redirection in one chunk like this we can cut down on forks, or file open/close operations and this adds up over time.  In one test run it took me 6 seconds to do 10000 file redirects.  It took many million such operations to take six seconds when the output is scoped.

That's enough for now.  We'll pick up again later.

Bash/Korn Shell Diary - Part 3

Shell Script Basics

Although the command line can be very powerful, at some point (either due to complexity or a desire for reuse) you will want to move into scripting.  A script is a series of commands stored in a plain-text file that starts with a shebang line (the special characters “#!”) followed by a path to the executable (and optional arguments) that will act as an interpreter for this script.

Script files may either be sourced (run in the current environment or shell) or they may be executed which will start them as a child process.  Sourced scripts (“source  ./myscript.sh”) will run in your current shell, this means that hitting an “exit” condition will exit your current shell.  Further, you run the risk of changing (and possibly corrupting) the current environment as your script changes your current environment.  These scripts will also have access to your full environment including any variables that you have set.

Executed scripts (“./myscript.sh”) will spawn a new shell to run in.  These tasks may be backgrounded if desired (and no user interaction is needed).  But, since this is a new shell it will not share the same environment as your command line shell.  If you want variables to be visible inside the new shell they need to be explicitly exported (“export MYVAR”) prior to running, or you can pass them in as arguments:

   #! bash
   # on all except for the shebang line a pound sign indicates a “to-end-of-line-comment”
   NUM=0
   while [ $# -gt 0 ]; do
      let “NUM += 1”
      echo “Argument ${NUM} = ${1}”
      shift
   done

If we have called our script with arguments we can parse those arguments out.

   $ ./temp.sh one "two three" four
   Argument 1 = one
   Argument 2 = two three
   Argument 3 = four


The shift operator allows us to rotate out some of our arguments ($2 becomes $1, $3 becomes $2, etc…).  You can skip over multiple arguments by using “shift 2” or “shift 3”.  The “$#” is a special variable that tells us how many arguments are sitting in the queue for our current scope (functions use the same methodology for handling arguments).  You can also access your arguments directly by using the “${2}”, “${3}” syntax – curly braces around our variable names used to remove ambiguity.  You can also access all arguments at once by using “$*”.

You can even go one step further and either check if expected arguments are set, or provide default values for when they are not set:
   MYVAR=${1:-default}   # the colon operator treats an empty (NULL) string as not being set
   MYVAR=${1-default}    # the dash operator “-“ says to use the default argument if not set


Although there are other ways to deal with unset arguments (such as checking if a variable is set using the “${MYVAR?}” syntax) either setting default values or forcing the operator to provide the inputs seem to work best. 

Variables

And speaking of variables, now that we are using a shell script I can tell you that they are a lot of fun to play with.  For instance if you know ahead of time that you will only use a variable in a certain way you can optimize the performance in the shell and make it easier to work with by using declare.

   declare -i MYINT   # MYINT only holds integers, if you shove a string in there it becomes a 0
   declare –u MYSTR   # MYSTR converts all letters to upper-case
   declare –l MYSTR   # MYSTR converts all letters to lower-case
   declare –r MYSTR=”YES”   # MYSTR is now read-only and cannot change


Unless otherwise specified all variables can be used interchangeably as Strings or (if they hold an appropriate value) Integers.  To assign a new value to a variable you use the “=” operator: “MYVAR=a string of text”.  And to access the variable you use the ${MYVAR} syntax.  In most cases the curly-braces are optional, but they help remove ambiguity from the shell and allow us to use some more advanced string manipulation features… so get used to them.  Things on the right hand side of your assignment are evaluated fully before assignment, so the following is legal and will append to your string: MYVAR=”${MYVAR} more text here”

Now that we have some data in our string we can start playing with them; Shell provides us with a rich set of string manipulation functions – built-in.  The following code snippet:
   MYSTRING="ab cd,ef gh,ij kl,mn op"
   typeset -u MYSTRING
   echo "Uppercase: $MYSTRING"

   typeset -l MYSTRING
   echo "Lowercase: $MYSTRING"

   echo "MYSTRING contains ${#MYSTRING} characters."

   echo "Let's change the vowels to x's: ${MYSTRING//[aeiou]/x}"
   echo "How about just the first comman to an underscore: ${MYSTRING/,/_}"

   echo ""
   echo "Characters 11-15: ${MYSTRING:10:5}"
   echo "Characters 11-end: ${MYSTRING:10}"

   echo ""
   echo "Hey is that CSV (Comma-Seperated-Values) format?"
   echo "Field 1 = ${MYSTRING%%,*}"
   echo "Field 4 = ${MYSTRING##*,}"
   echo "Field 1,2 and 3 = ${MYSTRING#*,}"
   echo "Field 2,3 and 4 = ${MYSTRING%,*}"

   echo ""
   echo "or how about this neat trick:"
   IFS=","
   set -- ${MYSTRING}
   echo "Field 1 = $1"
   echo "Field 2 = $2"
   echo "Field 3 = $3"
   echo "Field 4 = $4"

   echo ""
   printf "And how about some pretty formatting?  %7s %-3s \n" $1 $2

produces this output:
   Uppercase: ab cd,ef gh,ij kl,mn op
   Lowercase: ab cd,ef gh,ij kl,mn op
   MYSTRING contains 23 characters.
   Let's change the vowels to x's: xb cd,xf gh,xj kl,mn xp
   How about just the first comman to an underscore: ab cd_ef gh,ij kl,mn op

   Characters 11-15: h,ij
   Characters 11-end: h,ij kl,mn op

   Hey is that CSV (Comma-Seperated-Values) format?
   Field 1 = ab cd
   Field 4 = mn op
   Field 1,2 and 3 = ef gh,ij kl,mn op
   Field 2,3 and 4 = ab cd,ef gh,ij kl

   or how about this neat trick:
   Field 1 = ab cd
   Field 2 = ef gh
   Field 3 = ij kl
   Field 4 = mn op

   And how about some pretty formatting?    ab cd ef gh
Let’s see what we can learn from this code:
-    The typeset built-in allows you to do some broad-stroke manipulation of variables, similar to declare.
-    You can count the number of characters in a string with ${#...}, empty variables have 0 characters btw.
-    You can do find and replace with the ${…//find-this/replace-with-this} syntax or even ${…//delete-this/}.
-    You can also find/replace the first matched argument with ${…/find-this/replace-with-this}.
-    Those search terms we just talked about support some level of regular expressions.
-    You can easily remove up to a pattern either from the left (“#”) or the right (“%”).
-    Variables can replace the standard arguments by using the “set -- VARIABLE” syntax.
-    The $IFS variable is a special variable that determines the default field separator for reading arguments.  It defaults to whitespace (space, tab or newline), but can be overridden for special processing.
-    Although not demonstrated here you can also make individual characters upper or lowercase by using the ${…^^uppercase-this-match}, ${…,,lowercase-this-match}, these match globally, use a single carat (“^”) or comma (“,”) to convert only the first match.
- The printf built-in provides matches the special formatting used with the C++ or Unix command formats: %-3s says I'm expecting a string that is 3 characters long, right justify.  Toss the argument at the end outside of the quotes.

By using these special tricks we can replace most of the string manipulation features that would require us to fork commands.  Forked commands run in their own special shell and require a lot of overhead, these are the major slowdown for scripts running on orders of magnitude slower.  With these tricks we can replace most of the uses for external UNIX commands like “sed”, “awk”, “cut”, “tr”, etc…

Strings can also be used in comparison checks, but more on these later:
   if [ “$MYSTRING” = “$THATSTRING” ]; then …    are the strings equal?
   if [ ! “$MYSTRING” = “$THATSTRING” ]; then …  are they not equal?
   if [ “${MYSTRING?}” ]; then …                 is this thing even set?


Integer type variables are basically strings (they can be manipulated just like strings) that happen to contain integer values (-100, 0, 123784, etc…)  This special happenstance gives us some new toys to play with:
   MYVAR=24
   let “MYVAR = 12” “MYVAR += 8” "MYVAR += 16"
   echo "$MYVAR"
   let "MYVAR = (MYVAR / 3) % 7"
   echo "$MYVAR"

   let "MYVAR = 8#20"
   echo "$MYVAR"

gives us:
   36
   5
   16

We can even work with some very large numbers, up to 32-bit signed values!  But, we cannot work with floats like this.
-    When using the “let” operator we are telling the shell we are going to work with variables, so we don’t need the “$VAR” syntax, you could also use the “(( … ))” syntax if you prefer, at one point some timing studies showed me the “let” operator was evaluated by the shell quicker than the round-brace operators by 3ns, and at one point that was important enough to me to make me stick with the “let” operator for awhile.
-    Multiple operations can be done on a single “let” line, each is encased in double quotes.
-    You have a rich variety of operators available to you, basically all of the C++/Java type operators using the standard order of precedence.  You can do bitwise shifting with “>>” and “<<”, you have all the normal math operators plus “%” for the remainder (modulus), and even “**” for exponents.  Use round braces for clarification or to change the normal operator precedence.
-    You can take input in other bases using the “base#decimal-value” operation for conversion.

Integers are fun… but they still aren’t floaty-numbers :-(  You can however use special techniques to simulate fixed point floats (currency for example) or convert in-and-out of hex values.  I keep special functions in source-able script files for just this type of problem.  I still wish we had better float handling though.



In Shell you also have access to Arrays.  Let’s dive right in:
   declare -a myChars
   CNT=1
   while [ $CNT -le 12 ] ; do
      myInts[$CNT]=$CNT
      let "CNT += 1"
   done

   CNT=1
   for CHAR in {a..e} ; do
      myChars[$CNT]=$CHAR
      let "CNT += 1"
   done

   echo "how's this work?  ${myInts}"
   echo "nope... let's try something else:"
   I=1

   while [ $I -le ${#myInts[*]} ]; do
      echo "myInts[$I] = ${myInts[${I}]}"
      let "I += 1"
   done

   for CHAR in ${myChars[*]} ; do
      echo ">>> $CHAR"
   done

gives us:
   how's this work?
   nope... let's try something else:
   myInts[1] = 1
   myInts[2] = 2
   myInts[3] = 3
   myInts[4] = 4
   myInts[5] = 5
   myInts[6] = 6
   myInts[7] = 7
   myInts[8] = 8
   myInts[9] = 9
   myInts[10] = 10
   myInts[11] = 11
   myInts[12] = 12
   >>> a
   >>> b
   >>> c
   >>> d
   >>> e

This shows us a couple of things:
-    Although arrays can be declared explicitly it isn’t necessary.  Bash notices when you use array syntax and marks the variables as arrays on-the-fly.  I could prove this to you if there was some way to list all arrays dynamically, oh wait, there is: “typeset –a” will list all arrays in the current shell, and by-the-way try the “-i” flag for integers, “-r” for readonly, “-f” for functions when you get the time.
-    Arrays are single-dimension so “MYARRAY[12]” is possible, but “MYARRAY[12,6]” will require a few tricks to implement in bash.
-    Arrays can be staggered, you can feel free to assign to arrays willy-nilly.  Have an array with only a 3rd index filled in if you’d like.
-    To assign to an array use “MYARRAY[5]=value” or “MYARRAY[$INDEX]”.
-    To read from an array use “${MYARRAY[5]}” or “${MYARRAY[$INDEX]}”.
-    To find the number of items in an array use “${#MYARRAY[*]}”.
-    To list all members of an array at once use “${MYARRAY[*]}”.
-    And if you do just access “${MYARRAY}” you will only get the first index of that array.
-    Oh, and we almost missed the little curly brace expansion trick we pulled up there "{a..e}", this is a special trick that fills in all of the blanks between the two ends ("a b c d e" in this case).  You can use this trick on the command line to generate tons of test files: ("touch myfile{1..5}.{txt,bat}").  This type of expansion only works in certain circumstances though, so we can't use it to cleanly assign our variables.
-    As a side note, earlier when we used the "set --" command we assigned our string to a special array.

That should be enough for one Blog... we'll pick up again in the next entry :-)

Tuesday, April 23, 2013

Bash/Korn Shell Diary - Part 2

Understanding the Command Line

Bash (I will refer to the entire family of “Ksh/Bash/Sh” as “Bash” from now on to make things simpler for me) provides two environments – an interactive command-line and a scripting engine which can parse commands from a user provided shell script file.

When running in command-line mode, Bash provides the user with a running environment.  Everything that takes place is within a single process with additional (non-built-in) commands running as child processes.  The shell provides the user with special variables that can customize the environment, and like all unix piped commands it also provides three streams (0 – standard input which defaults to the keyboard, 1 – standard output and 2 – standard error which both default to the terminal screen). 

To redirect these streams we can use the following formats:
    Commands  > output.txt    # redirect standard-output to a file
    Commands >> output.txt    # redirect to a file, appending to the end
    Commands  < inputs.txt    # take standard-input from a file
    Commands 2> /dev/null     # redirect standard-error to nothing (dump it)
    Commands  > out.txt 2>&1  # send standard-out and error to a file
    Commands | tee –a out.txt # split the output stream, append to a file
    Commands |& tee out.txt   # split the output and error stream


Bash also allows you to temporarily redirect a sequence of commands by using the following notation:
    {
        Commands
        …
    } > output.txt


So: “>” indicates an overwrite, “>>” indicates an append, and “|” indicates a pipe – which passes the standard-output from one command to the standard-input of the next command.  Please note that using a Pipe (“|”) will fork a new process, this takes significantly longer than using shell built-in commands and is the major cause of slow-downs in scripts.

Each command returns a value.  This value can be tested against either directly in an “if” check or indirectly by looking at the return value.
    $ if `echo “DATE” | grep “DAY” | tr ‘A-Z’ ‘a-z’`; then
    >    echo “Success”
    > else
    >    echo “Failure”
    > fi
    Success

    $ echo “DATE” | grep “DAY” | tr ‘A-Z’ ‘a-z’
    $ echo “$?”
    0


A return value of 0 indicates success, while a different value indicates failure.  This is because there is only one way for a command to succeed, but there can be multiple ways for a command to fail.  Looking at the above example we see that we received a Successful exit status from our piped command, even though we failed to produce any output, this is because piped commands (or functions) return the value of the last command executed.  By looking a little deeper we can see that our second command is the one that actually failed:

    $ echo “DATE” | grep “DAY” | tr ‘A-Z’ ‘a-z’
    $ echo “${PIPESTATUS[*]}”
    0 1 0


In these last couple of examples we’ve seen a few of the environment variables of importance to our shell:
    $? – exit status of the last executed command
    ${PIPESTATUS[*]} – exit status of the last executed series of commands executed


Some other useful variables for setting up your shell:
    $ PS1=” \[\e[33m\]\w\[\e[0m\]\n\$”
    $ PS2=”>   “


The PS1 variable sets up your command line prompt, in this case we are using special codes to change the color to brown “[e[33m]” and again to white “[e[0m]” an escape sequence to display the path “\w”, and a final newline “\n” followed by the traditional Bash-type prompt “$”.  PS2 is a special prompt that is displayed when  you are entering a multi-line subroutine (such as our compound “if” statement above).  To assign new values to our variables we use this format, to display a variable we can either dump all values (“env”) or display them using the special $variable syntax:
    $PWD      - current directory (as determined by the shell)
    $OLDPWD   - previous directory
    $PATH     - the search path for any command that is kicked out to the external shell

   
Some characters need to be escaped out (“\$”) because those characters can have special meaning in our shell.  When we are inside of double quotes (“”) the shell is allowed to pass over the input and do variable expansion ($MYVAR shows the value held in the variable “MYVAR”).  Other quotes have other special behaviors: Single Quotes (‘ – on the same key as a double quote) do not expand variabes, and Backticks (` - on the same key as the tilde ~) will execute the contents.

    TMP=”a”

    #     |--- evaluate/execute between these ticks ----|
    WORD4=`echo “this is $TMP test” | awk ‘{ print $4 }’`
    #           |-- expand vars --|       |-- don’t ---|
    #                 in here               expand here
    echo “$WORD4”   # <-- “test”


I’m going to assume you know the basics of moving around and manipulating your environment: cd, pwd, ls, cp, mv, rm.  As a fun side-note if you ever wanted to write your own shell the “cd” and “pwd” commands are the only things you really have to implement since the shell needs to know where you are, all other commands could be forked to the outer shell (a bare minimum shell can be written in about a page of C++ code and makes for a fun topic when people ask you if you’ve ever done any shell programming).  Fortunately, Bash provides significantly more than that.

From the command line you have the ability to remember previous commands:

    $ cd ~/SCRIPTS/TEST01/RUN01
$ history                     ß list previously executed commands
1 cd ~/SCRIPTS/TEST01/RUN01
2 history

$ !cd
    $ ^01^02                      ß substitute the 1st “02” for “01” in the last command I’ve run
    cd ~/SCRIPTS/TEST02/RUN01
    $ ^!!:gs/0/3                  ß substitute “3” for “2” globally in the previous command
    cd ~/SCRIPTS/TEST32/RUN31
    $ !h                          ß repeat the last command starting with this “h”
    history
    1 cd ~/SCRIPTS/TEST01/RUN01
    2 history
       :
    6 history

    $ !!                          ß execute the last command again
    1 cd ~/SCRIPTS/TEST01/RUN01
       :
    6 history
    7 history


Bash also gives us some level of Job control, the ability to place jobs into the background (“command&”), examine what jobs are in the background (“jobs”), the ability to manipulate jobs by job ID (“%1”, “%2”…) as opposed to the unique process ID for that job ($$ - gives us our own PID, $! – the PID of the most recent backgrounded child) bring them to the foreground (“fg %1” - only one job can be running in the foreground at once, and this will pause your current shell environment until it has completed), and the ability to kill jobs (“kill %1”).

Bash/Korn Shell Diary - Part 1

The Cygwin Shell

Over the years I have worked with a lot of shells and each of them have their own benefits and drawbacks.  I’ve been able to work with Perl, Sh, Bash, Ksh, CShell, Make, TCL/TK and probably a few others as well.  Currently I am working in a Windows-Only environment and I have settled on Gnu Bash 4.1+ running in Cygwin as my shell of choice.  Although a lot of the tips and tricks I am using here are specific to this shell (or the “Sh” family) there are a few items, such as speeding up your scripts, that will be applicable to any scripting language; even so, if you wish to follow along I would recommend downloading Cygwin.  Cygwin brings a lot of functionality to your Windows environment and is worth learning if you are serious about becoming a Windows Power-User.

http://www.cygwin.com

Crafting a Re-usable Script

Whenever I create a new shell script I always start by looking at a problem.  Let’s say I have information in multiple files I need to pull together to create a single Excel file; for me this is a very common scenario.   First, I try to think about my inputs and my outputs; where can I pull the information I want?  Do I need to compute or generate any information?  If information needs to be computed I need to work out how to do this…

 I copy all of the source files into one directory and then get started on the command line.  I take a few sample cases and practice extracting or generating the desired output.  I test out placing the final output in the correct format, make it pretty or parse-able depending on the end-goal.  As each piece is perfected I place the commands into an empty script (notepad plain-text file with a “.sh” extension and the appropriate first line “#!  bash”).

As I move along I test out my script to make sure there are no special-cases that will jump out and bite me; Special cases need to be handled.  If I am dealing with intermediary files I will stop removing/regenerating them in the script, once I have them in the correct format, (toss those lines into uncalled function blocks so they aren’t lost) to speed up our testing.

Then I start to generalize things to make the script more re-usable.  I start moving variables to optional command-line arguments.  I make the script able to parse more input files.  I remove operator interaction.  And, most importantly I try to make the script more user-friendly: If the input files are in a known location I start grabbing the originals so the operator doesn’t have to copy them in all the time.  I try to make the information that could change from run-to-run encased in variables and stored at the top of the script or driven off of command-line arguments (or an external global variables script).

Then I look for ways to speed things up.  I start with the heavy hitters – removing as many forks as I can.  If the script will get a lot of usage, or if we might need quick turn-around I might start gathering timing statistics on how long chunks take to run, and then work on optimizing the logic and applying more advanced speed-up routines (yes, I have timed out how long ‘let “MYVAR += 1”’ takes compared to “MYVAR=$(( $MYVAR + 1 ))”).

And finally, even though I’ve been commenting my script all along, I do a final scrub of the comments and make sure everything is well-documented and makes perfect sense.  Even though I may be the only person to ever look at the innards of my script I want to make sure that years from now when I need to tweak things I will understand it immediately and can make the changes in seconds as opposed to hours.

So, in summary:
  1. Take a few minutes to plan things out.
  2. Practice on the command line extracting and formatting the information you need.
  3. Build up your script one chunk at a time until it is perfect.
  4. Run your script in chunks to ensure it is running correctly.
  5. Make the script more generic/reusable.
  6. Optimize for performance.
  7. Comment everything.

Sunday, April 14, 2013

Optimizing the Genetic Algorithm

Well, I took some time to optimize the GA I described earlier, it was a little dissapointing.  First let's see the non-optimized results:

Pool Size: 11          Repro Rate: 40     Mutate Rate: 40
Return Rate: 20       Best Fits: 7       Num Mutations: 5
Churn Rate: 1/11       Run for 300 seconds
Target String:   >>> Hello, World! <<<

>>>    1   Generation: 0        ( 204 to 447 )   Prodigy: > dahm=YLrwwvep < with a score of 204
>>>    3   Generation: 5        ( 181 to 436 )   Prodigy: > cd&dT433R!2?3 < with a score of 181
>>>    6   Generation: 15       ( 170 to 431 )   Prodigy: > jbgJjX/&tandW < with a score of 170
>>>   28   Generation: 75       ( 161 to 475 )   Prodigy: > mgOIi=U#!Wh~1 < with a score of 161
>>>   31   Generation: 85       ( 158 to 404 )   Prodigy: > mWlyIz2rjVKB8 < with a score of 158
>>>   36   Generation: 98       ( 148 to 422 )   Prodigy: > mWlyIz2rjomG  < with a score of 148
>>>   38   Generation: 105      ( 137 to 457 )   Prodigy: > HPltpzZRuWvF; < with a score of 137
>>>   72   Generation: 200      ( 127 to 372 )   Prodigy: > IeGhzy QNalbZ < with a score of 127
>>>   94   Generation: 261      ( 126 to 445 )   Prodigy: > IeGVo/7whHHbM < with a score of 126
>>>  107   Generation: 297      ( 123 to 460 )   Prodigy: > IeGVo9+pPgOn  < with a score of 123
>>>  119   Generation: 333      ( 114 to 479 )   Prodigy: > IeGVo9+pOKMdP < with a score of 114
>>>  277   Generation: 772      ( 111 to 399 )   Prodigy: > IeGVo91SL2Oaq < with a score of 111
>>>  301   Generation: 838      ( 262 to 502 )   Prodigy: > IeGVo91SL2Oaq < with a score of 111
     0     262   >>> TuPNI52Hdh#Id <<<
     1     272   >>> IZGRHhIzukqrP <<<
     2     291   >>> WgNdL>,A;MBuc <<<
     3     301   >>> =AwF03qv&nBPJ <<<
     4     314   >>> D 1xMZmgqB&x2 <<<
     5     333   >>> 0^We=,70jUixb <<<
     6     338   >>> j5VWZ/510f 9E <<<
     7     341   >>> n<v~FYdygWyun <<<
     8     357   >>> XZK1vAGc@HqK& <<<
     9     360   >>> #r_7d#39IF 4= <<<
    10     502   >>> ,.V/+B8&y+=vv <<<


The closest we could come in five minutes to "Hello, World!" was "IeGVo91SL2Oaq", which is a bit disappointing , especially since we have a 11 member pool with 838 generations... so over 9000 candidate solutions evaluated to get here.

To optimize I held all of the values steady except one (at a time) and modified the remaining variable slowly, and after five runs at each level I ended up with some more optimized values... Did this once again using the new values and ended up with my somewhat optimized results:

Pool Size: 5          Repro Rate: 62     Mutate Rate: 20
Return Rate: 75       Best Fits: 1       Num Mutations: 1
Churn Rate: 5/5       Run for 300 seconds
Target String:   >>> Hello, World! <<<

>>>    1   Generation: 0        ( 286 to 439 )   Prodigy: > 6gmp3Iw|lq;uX < with a score of 286
>>>    1   Generation: 1        ( 249 to 428 )   Prodigy: > aoX3UVw|lqKzp < with a score of 249
>>>    1   Generation: 3        ( 248 to 336 )   Prodigy: > agmp3IWzU+Kzp < with a score of 248
...

>>>  268   Generation: 2467        ( 4 to 84 )   Prodigy: > HeMlo. WoSld1 < with a score of 4
>>>  283   Generation: 2602        ( 3 to 74 )   Prodigy: > HeMlo. WoSld! < with a score of 3
>>>  300   Generation: 2762        ( 2 to 64 )   Prodigy: > Hello. WoSld! < with a score of 2
>>>  301   Generation: 2764        ( 2 to 55 )   Prodigy: > Hello. WoSld! < with a score of 2
     0       2   >>> Hello. WoSld! <<<
     1      25   >>> HeMlo. WoSwd! <<<
     2      26   >>> Tello. WoSld! <<<
     3      28   >>> HeMloy WoSld! <<<
     4      55   >>> 9ello. WoSld! <<<
 
A bit better... still can't hit our target string in five minutes but we got pretty darn close.  And clocking in at over 13,000 candidates evaluated... I have nothing to say.

So, what lessons did I learn through all this?

Pool Size and Return Chance seem to be the most important variables to tune at the beginning.  Favoring more generations seems to bring the most benefit, but as you start tuning other variables seem more important.  Tuning is very difficult, and based on some quick experience with the Traveling Salesman problem the tuning from one has no bearing on the other.  It appears that the better combinations resulted in more candidate evaluations.

The amount of tuning required to get even marginal results is absurd... perhaps I need to create a GA to tune the GA... but based on the number of computations involved I'm not overly impressed.  For most problems if you already know the fitness algorithm you are better off just iterating your way to a solution.

All of this has been interesting... but I'm still not convinced that it is useful. 

Next up, onto something new.

Implementing a Genetic Algorithm in Bash

For this first experiment I'm going to be implementing a Ginetic Algorithm in Bash.  I like Bash for these little experiments since it's easy to fine tune as I work through the details.  If you take the proper precautions it can run fairly fast... I'll do a Blog entry on how to speed up Bash at some point (I promise :-)  Of course there's only so much you can do with an interpreted language, for an actual implementation this should be ported to a compiled language.

I've always been interested in Genetic Algorithms, but unlike other areas of AI I've never found a good use for it at work.  What really interests me is whether there is ever a good time to use it (where it is more effective than other algorithms) and the idea of tuning the algorithm.

So to start with we need to think about what a Ginetic Algorithm actually does, it attempts to mimic darwinian evolution to evolve a best-fit child to solve a problem:
create a pool of candidates
determine the fitness of the candidates
each generation we select parents for the next generation
create children from the parents by giving some DNA from parent 1 and some from parent 2
allow for some form of random mutation of the child
rinse-and-repeat

I read some interesting articles at some point about genetics and have some ideas about how to tune our algorithm, some things I want to look at:
how big should the pool size be?
what should be the reproduction rate of parents?
should parents live for multiple generations?
how many of the best fitness parents should be allowed to have children?
how many of the least fit should be eliminated (logans run style)?
would it help if we periodically brought back the best candidate?
how long should we let things run for?
how many mutations should be allowed at once?

So, first we need to discuss how we will encode our candidate solution (the genetic material) and how we will determine the fitness.  For a simple experiment we will be trying to match a string, and our genes will be individual characters within the string.  Our fitness will be determined by how close (in the alphabet) we get to each character added together; we wish to minimize this number for this experiment.  In other implementations, we might be dealing with points on a map and minimizing the distance to travel between them, or factors in an equation.

So, let's see what we can come up with...



#! bash
##########################################################################################
# imporant global variables                                                              #
##########################################################################################
     # constants                                                                         #
     #####################################################################################
GENERATION=0           # our current generation, how long the simulation has been running
POOL_SIZE=${1:-6}      # how many candidates are alive at once
REPRO_CHANCE=${2:-30}  # in a generation, the chance (0 to 100) that a candidate will choose a mate
BEST_FITS=${3:-70}
MUTATION_RATE=${4:-75} # for each birth, the chance (0 to 100) that a mutation will occur
NUM_MUTATIONS=${5:-1}  # max number of mutations to apply to a single child
RETURN_CHANCE=${6:-40} # chance the best candidate comes back to save us...
CHURN_RATE=${7:-40}
NUM_SECONDS=${8:-300}
let "CHURN_RATE = POOL_SIZE - ((POOL_SIZE * CHURN_RATE) / 100)"

let "LAST = POOL_SIZE - 1"

let "BEST_FITS = (POOL_SIZE * BEST_FITS) / 100"   # our hottest candidates
if [ $BEST_FITS -lt 1 ]; then
   BEST_FITS=1
fi

BEST_VAL=99999
BEST_MATCH=""


TARGET_STRING="Hello, World!"   # determine our target string
LENGTH=${#TARGET_STRING}        # get the length of the target string
declare -a CANDIDATES           # our candidate population
declare -a TARGET_ARRAY         # used for our fitness function
declare -a CANDIDATE_ARRAY      # used for our fitness function
declare -a FITNESS              # matches the CANDIDATE_ARRAY, the fitness value of each member

CHROMOSOMES='AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz ~1!2@3#456^7&890_=+|;:,./<>?'
NUM_CHROMOSOMES=${#CHROMOSOMES}

echo "Pool Size: $POOL_SIZE          Repro Rate: $REPRO_CHANCE     Mutate Rate: $MUTATION_RATE"
echo "Return Rate: $RETURN_CHANCE       Best Fits: $BEST_FITS       Num Mutations: $NUM_MUTATIONS"
echo "Churn Rate: $CHURN_RATE/$POOL_SIZE       Run for $NUM_SECONDS seconds"
     #####################################################################################
     # useful scratch variables, I am including them here for reference only...          #
     #####################################################################################
LAST_PARENT=0      # marks the end of the parents in this generation, after this are the kids
POPULATION_SCORE=0 # used to show how close the population as a whole is getting

#echo ">>> $CHROMOSOMES <<<>>> $NUM_CHROMOSOMES <<<"
##########################################################################################
# helper functions                                                                       #
##########################################################################################
     #####################################################################################
     # int2char( index ) - will return the character at the indexed position in our
     #                     CHROMOSOMES string.
     #####################################################################################
int2char()
{
   TNDX=$1
   TCHAR=${CHROMOSOMES:$TNDX:1}
}

     #####################################################################################
     # char2int( myChar ) - gives a value to a given character based on the position of
     #                      that character in the CHROMOSOMES string.
     #####################################################################################
char2int()
{
   TCHAR="$1"
   TEMPSTR=${CHROMOSOMES%%${TCHAR}*}
   TVAL=${#TEMPSTR}
}

     #####################################################################################
     # get_value_array( index ) - produces an array of indexed values for each char in
     #                            our candidate string.  These can then be used to create
     #                            our fitness values by comparing against the target array
     #                            "abc" ==> 0 1 2
     #####################################################################################
get_value_array()
{
   INDEX=$1
   TSTR="${CANDIDATES[${INDEX}]}"
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      TCHAR=${TSTR:$TNDX:1}
      char2int "$TCHAR"
      CANDIDATE_ARRAY[$TNDX]=$TVAL
      let "TNDX += 1"
   done
}

     #####################################################################################
     # determine_fitness( index ) - compares a given candidate against the target string
     #                              by converting the candidate string into values and
     #                              seeing how far apart those values are from the target
     #                              string values.
     #####################################################################################
determine_fitness()
{
   INDEX=$1
   get_value_array $INDEX
   TFITNESS=0
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      let "TVAL = TARGET_ARRAY[$TNDX] - CANDIDATE_ARRAY[$TNDX]"
      if [ $TVAL -lt 0 ]; then
         let "TFITNESS = TFITNESS - TVAL"
      else
         let "TFITNESS = TFITNESS + TVAL"
      fi
      let "TNDX += 1"
   done
   FITNESS[${INDEX}]=$TFITNESS
}

     #####################################################################################
     # create_new_candidate() - creates a new random string of the appropriate length
     #####################################################################################
create_new_candidate()
{
   TEMPVAR=0
   TSTRING=""
   while [ $TEMPVAR -lt $LENGTH ]; do
      let "TNDX = RANDOM % NUM_CHROMOSOMES"
      int2char $TNDX
      TSTRING="${TSTRING}${TCHAR}"
      let "TEMPVAR += 1"
   done
}

     #####################################################################################
     # display_candidate( index ) - displays a single candidate in our population
     #####################################################################################
display_candidate()
{
   TMBR=$1
   printf "   %3s   %5s   >>> ${CANDIDATES[${TMBR}]} <<< \n" $TMBR ${FITNESS[${TMBR}]}
}


     #####################################################################################
     # generate_report() - displays the current population
     #####################################################################################
generate_report()
{
   TMBR=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      display_candidate $TMBR
      let "TMBR += 1"
   done
}

     #####################################################################################
     # swap_candidates( index_1, index_2 ) - swaps entries in our arrays for these two
     #                                       candidates.
     #####################################################################################
swap_candidates()
{
   TMBR1=$1
   TMBR2=$2

   TSTR="${CANDIDATES[${TMBR1}]}"
   TVAL=${FITNESS[${TMBR1}]}

   CANDIDATES[${TMBR1}]="${CANDIDATES[${TMBR2}]}"
   FITNESS[${TMBR1}]=${FITNESS[${TMBR2}]}

   CANDIDATES[${TMBR2}]="$TSTR"
   FITNESS[${TMBR2}]=$TVAL
}

display_prodigy()
{
   printf ">>> %4s   Generation: %-5s   " $SECONDS $GENERATION
   printf "%15s" "( ${FITNESS[0]} to ${FITNESS[$LAST]} )"
   printf "   Prodigy: > $BEST_MATCH < with a score of $BEST_VAL\n"
}

     #####################################################################################
     # sort_candidate_pool() - sorts the pool arrays (CANDIDATES and FITNESS) placing the
     #                         most fit candidates at the top.  The most fit candidates
     #                         have a better chance of surviving and reproducing into the
     #                         next generation.
     #####################################################################################
sort_candidate_pool()
{
   # implement a simple bubble sort
   TMBR1=0                                      # the slot currently being inspected
   let "TMBR2 = POOL_SIZE - 1"                  # our last non-sorted slot
   while [ $TMBR2 -gt 0 ]; do
      TMBR1=0
      while [ $TMBR1 -lt $TMBR2 ]; do
         if [ ${FITNESS[${TMBR1}]} -gt ${FITNESS[${TMBR2}]} ]; then
            swap_candidates $TMBR1 $TMBR2       # place the biggest value in the last slot
         fi
         let "TMBR1 += 1"
      done
      let "TMBR2 -= 1"
   done

   if [ ${FITNESS[0]} -lt $BEST_VAL ]; then
      BEST_VAL=${FITNESS[0]}
      BEST_MATCH="${CANDIDATES[0]}"
      display_prodigy
   fi
}

     #####################################################################################
     # mutate_candidate( index ) - randomly change one character in our string
     #####################################################################################
mutate_candidate()
{
   TMBRX=$1
   TCNT=1
   while [ $TCNT -le $NUM_MUTATIONS ]; do
      PARENT1="${CANDIDATES[${TMBRX}]}"
      let "TNDX = RANDOM % LENGTH"   # pick a random character to mutate
      let "TLEN = LENGTH - TNDX - 1"
      if [ $TNDX -eq 0 ]; then
         SUBSTR_L=""
         SUBSTR_R="${PARENT1:1:${TLEN}}"
      elif [ $TLEN -eq 0 ]; then
         let "TLEN = LENGTH - 1"
         SUBSTR_L="${PARENT1:0:${TLEN}}"
         SUBSTR_R=""
      else
         SUBSTR_L="${PARENT1:0:${TNDX}}"
         let "TNDX += 1"
         SUBSTR_R="${PARENT1:${TNDX}:${TLEN}}"
      fi

      # now we mutate that spot
      let "TNDX = RANDOM % NUM_CHROMOSOMES"
      int2char $TNDX
      CANDIDATES[${TMBRX}]="${SUBSTR_L}${TCHAR}${SUBSTR_R}"
      let "TCNT += 1"
   done
   # echo ">>>>> Mutating child...     > $PARENT1 < ==> > ${SUBSTR_L}${TCHAR}${SUBSTR_R} <"
}

     #####################################################################################
     # create_offspring( index1, index2 ) - creates children for two parents
     #   "ABCDEFGH"    + "MNOPQRST"   ==> "????????" + "????????"
     #   "ABC","DEFGH" + "MNO","PQRST" - pick a random spot to do our split, eg. 3
     #   "ABCPQRST" , "MNODEFGH" - swap the pieces for our children
     #####################################################################################
create_offspring()
{
   TMBR1=$1
   TMBR2=$2
   PARENT1="${CANDIDATES[${TMBR1}]}"
   PARENT2="${CANDIDATES[${TMBR2}]}"

   let "TNDX = RANDOM % (LENGTH - 2) + 1"
   let "TLEN = LENGTH - TNDX"
   SUBSTR_X1="${PARENT1:0:${TNDX}}"
   SUBSTR_X2="${PARENT1:${TNDX}:${TLEN}}"
   SUBSTR_Y1="${PARENT2:0:${TNDX}}"
   SUBSTR_Y2="${PARENT2:${TNDX}:${TLEN}}"
   CHILD1="${SUBSTR_X1}${SUBSTR_Y2}"
   CHILD2="${SUBSTR_Y1}${SUBSTR_X2}"

   # the children will replace the worst candidates
   CANDIDATES[${TMBR1}]="$CHILD1"
   #   give each child a chance for a random mutation
   let "TEMP = RANDOM % 100"
   if [ $TEMP -le $MUTATION_RATE ]; then
      mutate_candidate $TMBR1
   fi

   CANDIDATES[${TMBR2}]="$CHILD2"
   let "TEMP = RANDOM % 100"
   if [ $TEMP -le $MUTATION_RATE ]; then
      mutate_candidate $TMBR2
   fi

#   echo ">>>>> $TMBR1 + $TMBR2   ===>  > ${CANDIDATES[${TMBR1}]} < + > ${CANDIDATES[${TMBR2}]} <"
}

     #####################################################################################
     # initialize() - does some set-up stuff to prepare for our testing
     #####################################################################################
initialize()
{
   printf "Target String:   >>> $TARGET_STRING <<< \n"
   TNDX=0
   while [ $TNDX -lt $LENGTH ]; do
      TCHAR=${TARGET_STRING:$TNDX:1}
      char2int "$TCHAR"
      TARGET_ARRAY[$TNDX]=$TVAL
      # printf " %3s" $TVAL
      let "TNDX += 1"
   done
   printf "\n"

   TMBR=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      create_new_candidate
      CANDIDATES[${TMBR}]="$TSTRING"
      determine_fitness $TMBR
      let "TMBR += 1"
   done
}

##########################################################################################
# now we can get started...                                                              #
##########################################################################################
initialize

SUCCESS=0                                         # loop until we hit our target string...
while [ $SUCCESS -eq 0 ]; do
   # go through each of our candidates and give them a fitness score
   TMBR=0
   POPULATION_SCORE=0
   while [ $TMBR -lt $POOL_SIZE ]; do
      determine_fitness $TMBR
      let "POPULATION_SCORE += FITNESS[${TMBR}]"
      let "TMBR += 1"
   done

   sort_candidate_pool       # place our candidates into fitness-order

   if [ $SECONDS -gt $NUM_SECONDS ]; then
      display_prodigy
      generate_report
      echo ""
      exit
   fi

   # if any of our candidates is the target string we are done, report success
   if [ ${FITNESS[0]} -eq 0 ]; then
      SUCCESS=1
   else
      let "LAST_PARENT = POOL_SIZE - 1"  # all candidates can be parents
      TMBR=0
      while [ $TMBR -lt $LAST_PARENT ]; do
         # go through each of our candidates and see if they will be parents
         let "TEMP = RANDOM % 100"
         if [ $TEMP -le $REPRO_CHANCE ]; then     # if we are looking for love...
            let "TMBR2 = RANDOM % BEST_FITS"      # randomly pick one of the available parents
            if [ $TMBR -eq $TMBR2 ]; then         # we can't have a kid by ourselves
               if [ $TMBR -eq 0 ]; then           #    so use the best available candidate
                  TMBR2=1
               else
                  TMBR2=0
               fi
            fi
            create_offspring $TMBR $TMBR2         # now let's have those kids!
         fi
         let "TMBR += 1"
      done

      let "TEMP = RANDOM % 100"
      if [ $TEMP -le $RETURN_CHANCE ]; then
         # echo ">>>>> Bring back the prodigal son!"
         CANDIDATES[${LAST_PARENT}]="$BEST_MATCH" 
      else
         create_new_candidate
         CANDIDATES[${LAST_PARENT}]="$TSTRING"
         FITNESS[${LAST_PARENT}]=$BEST_VAL
      fi

      while [ $LAST_PARENT -gt $CHURN_RATE ]; do
         let "LAST_PARENT -= 1"
         create_new_candidate
         CANDIDATES[${LAST_PARENT}]="$TSTRING"
      done

      let "GENERATION += 1"          # increase the generation number
   fi
done

display_prodigy
printf "\n\n\nMatch found in Generation $GENERATION \n\n\n"




That ended up fairly well... it appears to be exactly what I wanted.  Fairly flexible, I dropped a few ideas early on: I found the ability to carry parents over one generation to the next too beneficial to drop, and the ability to bring back the best candidate (returning hero) was also too beneficial.  I will give some of the results in another Blog Post, in the meantime feel free to play with this.

Enjoy,
David Mitchell

Saturday, April 13, 2013

And so it Begins...

For awhile I have been toying with the idea of writing a Blog; a place to gather my thoughts, breathe life into into my ideas, investigate the world around me.  In this Blog I intend to write about technical stuff such as the internals of the Solaris Operating System, cutting edge AI, performance tuning and details about how I write computer programs I'm particularly proud of.

I also intend to write about fundamental truths I discover about the world around us... like how fear and excitement are the same thing, or why it's so difficult for some people to admit that they are wrong.  Things I've gleaned from Psychology or Physics or even my Wiccan faith.

I might also feel the need to share some things I just find funny or cute, probably a picture or two of my children since they bring me so much happiness.  I guess I'm not really sure where this journey will end up, but I feel the need to take it.  I think that it will help me to grow as a programmer, a writer and perhaps a person.  Let's find out.