Genetic Programming Theory and Practice X Papers

Artificial Intelligence Technology Semantic Web Technology Knowledge Information Processing Technology Reasoning Technology Collecting AI Conference Papers Digital Transformation Technology

What I am going to describe will be the 10th Workshop on Genetic Programming, Theory and Practice, held May 12-14, 2012, at the Center for Complex Systems Research at the University of Michigan in Ann Arbor.

The goal of this workshop series will be to facilitate the exchange of research results and ideas between those who focus on the theory of genetic programming (GP) and those who focus on the application of GP to a variety of real-world problems. Additional information about the workshop, addenda to the chapters, and a site for ongoing discussion by participants and others can be found at http://cscs.umich.edu/gptp-workshops/.

Genetic programming was proposed by John Koza in 1990. While the other major methodologies of evolutionary algorithms were proposed at the same time and studied independently, genetic programming was proposed from the beginning as an extension of genetic algorithms and is in a very different position from the other three methods. Specifically, while the representation of genotypes in genetic algorithms is mainly in the form of sequences, genetic programming uses tree structures. Therefore, it is possible to represent data with structure, such as mathematical expressions and program codes, which could not be represented by genetic algorithms. Although this is the only major difference, the tendency to search for solutions differs from that of genetic algorithms, and unique phenomena and problems also occur. Currently, improvement ideas for it are being actively studied, and research is being conducted almost independently from genetic algorithms. There is a growing number of books dealing exclusively with genetic programming. The system presented by Koza was written in Lisp, so it has become a kind of convention to express the solution in Lisp S-expressions, even though it is now implemented in all programming languages.

Further details are provided below.

Evolving SQL Queries from Examples with Developmental Genetic Programming

Large databases are becoming ever more ubiquitous, as are the opportunities for discovering useful knowledge within them. Evolutionary computation methods such as genetic programming have previously been applied to several aspects of the problem of discovering knowledge in databases. The more specific task of producing human-comprehensible SQL queries has several potential applications but has thus far been explored only to a limited extent. In this chapter we show howdevelopmental genetic programming can automatically generate SQL queries from sets of positive and negative examples. We show that a developmental genetic programming system can produce queries that are reasonably accurate while excelling in human comprehensibility relative to the well-known C5.0 decision tree generation system.

A Practical Platform for On-Line Genetic Programming for Robotics

There is growing interest in on-line evolution forautonomous robots. On-line learningis critical to achieve high levels of autonomy in the face of dynamic environments, tasks, and other variable elements encountered in real world environments. Although a number of successes have been achieved with on-line evolution, these successes are largely limited to fairly simple learning paradigms, e.g. training small neural networks of relatively few weights and in simulated environments. The shortage of more complex learning paradigms is largely due to the limitations of affordable robotic platforms, which tend to be woefully underpowered for such applications.

In this paper we introduce a simple robotics platform based on Commodity Off The Shelf (COTS) designprinciples that makes on-line genetic programming for robotics practical and affordable. We compare the relative strengths and weaknesses of a number of different build options. As a proof-of-concept we compare three variations of evolutionary learning models for a color-following problem on a robot based on one of the designs: a simple neural network learning framework of the type typically seen in current research, a more extensive learning model that could not be supported by traditional low-cost research robots, and a simple evolutionary algorithm, but using standard tree-based genetic programming representation, which is also beyond the scope of traditional low-cost research robots. Our results show that the more powerful evolutionary models enabled by more powerful robots significantly improves the on-line evolutionary performance and thus that there are practical benefits to the COTS based

Cartesian Genetic Programming for Image Processing

Combining domain knowledge about both imaging processing and machine learning techniques can expand the abilities of Genetic Programming when used for image processing. We successfully demonstrate our new approach on several different problem domains. We show that the approach is fast, scalable and robust. In addition, by virtue of using off-the-shelf image processing libraries we can generate human readable programs that incorporate sophisticated domain knowledge.

A New Mutation Paradigm for Genetic Programming

Lévy flights are a class of random walksdirectly inspired by observing animal foraging habits, where a power-law distribution of the stride length can be often observed. This implies that, while the vast majority of the strides will be short, on rare occasions, the strides are gigantic. We propose a mutation mechanism in Linear Genetic Programming inspired by this ethological behavior, thus obtaining a self-adaptive mutation rate. We experimentally test this original approach on three different classes of problems: Boolean regression, quadratic polynomial regression, and surface reconstruction. We find that in all cases, our method outperforms the generic, commonly used constant mutation rate of one over the size of the genotype. Moreover, we compare different common values of the power-law exponent to the another self-adaptive mutation mechanism directly inspired by Simulated Annealing. We conclude that our novel method is a viable alternative to constant and self-adaptive mutation rates, especially because it tends to reduce the number of parameters of genetic programming.

Introducing an Age-Varying Fitness Estimation Function

We present a method for estimating fitness functions that are computationally expensive for an exact evaluation. The proposed estimation method applies a number of partial evaluations based on incomplete information or uncertainties. We show how this method can yield results that are close to similar methods where fitness is measured over the entire dataset, but at a fraction of the speed or memory usage, and in a parallelizable manner. We describe our experience in applying this method to a real world application in the form of evolving equity trading strategies.

EC-Star: A Massive-Scale, Hub and Spoke, Distributed Genetic ProgrammingSystem

We describe a new Genetic Programming systemnamed EC-Star. It is supported by an open infrastructure, commercial-volunteer-client parallelization framework. The framework enables robust and massive-scale evolution and motivates the hub and spoke network topology of EC-Star’s distributed GP model. In this model an Evolution Coordinator occupies the hub and an Evolutionary Engine occupies each spoke. The Evolution Coordinator uses a layered framework to dispatch high performing, partially evaluated candidate solutions for additional fitness-case exposure, genetic mixing, and evolution to its Evolutionary Engines. It operates asynchronously with each Evolutionary Engine and never blocks waiting for results from an Evolutionary Engine.

Genetic Analysis of Prostate Cancer Using Computational Evolution, Pareto-Optimization and Post-processing

Given infinite time, humans would progress through modeling complex data in a manner that is dependent on prior expert knowledge. The goal of the present study is make extensions and enhancements to a computational evolution system (CES) that has the ultimate objective of tinkering with data as a human would. This is accomplished by providing flexibility in the model-building process and a meta-layer that learns how to generate better models. The key to the CES system is the ability to identify and exploit expert knowledge from biological databases or prior analytical results. Our prior results have demonstrated that CES is capable of efficiently navigating these large and rugged fitness landscapes toward the discovery of biologically meaningful genetic models of disease. Further, we have shown that the efficacy of CES is improved dramatically when the system is provided with statistical or biological expert knowledge. The goal of the present study was to apply CES to the genetic analysis of prostate cancer aggressiveness in a large sample of European Americans. We introduce here the use of Pareto-optimization to help address overfitting in the learning system. We further introduce a post-processing step that uses hierarchical cluster analysis to generate expert knowledge from the landscape of best models and their predictions across patients. We find that the combination of Pareto-optimization and post-processing of results greatly improves the genetic analysis of prostate cancer.

Meta-Dimensional Analysis of Phenotypes Using the Analysis Tool for Heritable and Environmental Network Associations (ATHENA): Challenges with Building Large Networks

The search for the underlying heritability ofcomplex traits has led to an explosion of data generation and analysis in the field of human genomics. With these technological advances, we have made some progress in the identification of genes and proteins associated with common, complex human diseases. Still, our understanding of the genetic architecture of complex traits remains limited and additional research is needed to illuminate the genetic and environmental factors important for the disease process, much of which will include looking at variation in DNA, RNA, protein, etc. in ameta-dimensional analysis framework. We have developed amachine learning technique, ATHENA: Analysis Tool for Heritable and Environmental Network Associations, to address this issue of integrating data from multiple “-omics” technologies to identify models that explain or predict the genetic architecture of complex traits. In this chapter, we discuss the challenges in handling meta-dimensional data usinggrammatical evolution neural networks (GENN) which are one modeling component ofATHENA, and a characterization of the models identified in simulation studies to explore the ability of GENN to build complex, meta-dimensional models. Challenges remain to further understand the evolutionary process for GENN, and an explanation of the simplicity of the models. This work highlights potential areas for extension and improvement of the GENN approach within ATHENA.

A Baseline Symbolic Regression Algorithm

Recent advances in symbolic regression (SR) have promoted the field into the early stages of commercial exploitation. This is the expected maturation history for an academic field which is progressing rapidly. The original published symbolic regression algorithms in (Koza 1994) have long since been replaced by techniques such as pareto front, age layered population structures, and even age pareto front optimization. The lack of specific techniques for optimizing embedded real numbers, in the original algorithms, has been replaced with sophisticated techniques for optimizing embedded constants. Symbolic regression is coming of age as a technology.

As the discipline of Symbolic Regression (SR) has matured, the first commercial SR packages have appeared. There is at least one commercial package on the market for several years http://www.rmltech.com/. There is now at least one well documented commercial symbolic regression package available for Mathmatica www.evolved-analytics.com. There is at least one very well done open source symbolic regression package available for free download http://ccsl.mae.cornell.edu/eureqa. Yet, even as the sophistication of commercial SR packages increases, there have been glaring issues with SR accuracy even on simple problems (Korns 2011). The depth and breadth of SR adoption in industry and academia will be greatly affected by the demonstrable accuracy of available SR algorithms and tools.

In this chapter we develop a complete public domain algorithm for modern symbolic regression which is reasonably competitive with current commercial SR packages, and calibrate its accuracy on a set of previously published sample problems. This algorithm is designed as a baseline for further public domain research on SR algorithm simplicity and accuracy. No claim is made placing this baseline algorithm on a par with commercial packages – especially as the commercial offerings can be expected to relentlessly improve in the future. However this baseline is a great improvement over the original published algorithms, and is an attempt to consolidate the latest published research into a simplified baseline algorithm of similar speed and accuracy.

The baseline algorithm presented herein is called Age Weighted Pareto Optimization. It is an amalgamation of recent published techniques in pareto front optimization (Kotanchek et al., 2007), age layered population structures (Hornby 2006), age fitness pareto optimization (Schmidt and Hipson 2010), and specialized embedded abstract constant optimization (Korns 2010). The complete pseudo code for the baseline algorithm is presented in this paper. It is developed step by step as enhancements to the original published SR algorithm (Koza 1992) with justifications for each enhancement. Before-after speed and accuracy comparisons are made for each enhancement on a series of previously published sample problems.

Symbolic Regression Model Comparison Approach Using TransmittedVariation

Genetic programming (GP) is one of the best approaches today to discover symbolic regression models. To find models that trade off accuracy and complexity, the non-dominated sorting genetic algorithm II (NSGA-II) is widely used. Unfortunately, it has been shown that NSGA-II can be inefficient: in early generations, low-complexity models over-replicate and take over most of the population. Consequently, studies have proposed different approaches to promote diversity. Here, we study the root of this problem, in order to design a superior approach. We find that the over-replication of low complexity-models is due to a lack of evolvability, i.e., the inability to produce offspring with improved accuracy. We therefore extend NSGA-II to track, over time, the evolvability of models of different levels of complexity. With this information, we limit how many models of each complexity level are allowed to survive the generation. We compare this new version of NSGA-II, evoNSGA-II, with the use of seven existing multi-objective GP approaches on ten widely-used data sets, and find that evoNSGA-II is equal or superior to using these approaches in almost all comparisons. Furthermore, our results confirm that evoNSGA-II behaves as intended: models that are more evolvable form the majority of the population.

A Framework for the Empirical Analysis of Genetic Programming SystemPerformance

This chapter introduces a framework for statistically sound, reproducible empirical research in Genetic Programming (GP). It provides tools to understand GP algorithms and heuristics and their interaction with problems of varying difficulty. Following an approach where scientific claims are broken down to testable statistical hypotheses and GP runs are treated as experiments, the framework helps to achieve statistically verified results of high reproducibility.

More or Less? Two Approaches to Evolving Game-Playing Strategies

We present two opposing approaches to the evolution of game strategies, one wherein a minimal amount of domain expertise is injected into the process, the other infusing the evolutionary setup with expertise in the form of domain heuristics. We show that the first approach works well for several popular board games, while the second produces top-notch solvers for the hard game of FreeCell.

Symbolic Regression Is Not Enough: It Takes a Village to Raise aModel

From a real-world perspective, good enough has been achieved in the core representations and evolutionary strategies of genetic programming assuming state-of-the-art algorithms and implementations are being used. What is needed for industrial symbolic regression are tools to (a) explore and refine the data, (b) explore the developed model space and extract insight and guidance from the available sample of the infinite possibilities of model forms and (c) identify appropriate models for deployment as predictors, emulators, etc. This chapter focuses on the approaches used in DataModeler to address the modeling life cycle. A special focus in this chapter is the identification of driving variables and metavariables. Exploiting the diversity of search paths followed during independent evolutions and, then, looking at the distributions of variables and metavariable usage also provides an opportunity to gather key insights. The goal in this framework, however, is not to replace the modeler but, rather, to augment the inclusion of context and collection of insight by removing mechanistic requirements and facilitating the ability to think. We believe that the net result is higher quality and more robust models.

FlexGP.py: Prototyping Flexibly-Scaled, Flexibly-Factored Genetic Programming for the Cloud

Running genetic programming on the cloud presents researchers with great opportunities and challenges. We argue that standard island algorithms do not have the properties of elasticity and robustness required to run well on the cloud. We present a prototyped design for a decentralized, heterogeneous, robust, self-scaling, self-factoring, self-aggregating genetic programming algorithm. We investigate its properties using a software “sandbox”.

Representing Communication and Learning in Femtocell Pilot Power Control Algorithms

The overall goal of evolving algorithms for femtocells is to create a continuous on-line evolution of the femtocell pilot power control algorithm to optimize their coverage. Two aspects of intelligence are used for increasing the complexity of the input and the behaviour, communication and learning. In this initial study we investigate how to evolve more complex behaviour in decentralized control algorithms by changing the representation of communication and learning. The communication is addressed by allowing the femtocell to identify its neighbours and take the values of its neighbours into account when making decisions regarding the increase or decrease of pilot power. Learning is considered in two variants: the use of input parameters and the implementation of a built-in reinforcement procedure. The reinforcement allows learning during the simulation in addition to the execution of fixed commands. The experiments compare the new representation in the form of different terminal symbols in a grammar. The results show that there are differences between the communication and learning combinations and that the best solution uses both communication and learning.