One of the tricks to System-on-Chip (SoC) design is squeezing everything needed into a relatively small memory space. Though SRAM and flash sizes are improving rapidly, they are finite on SoCs. Clyde’s company, HI-TECH Software, has been tackling this exact problem on the Cypress PSoC for some time. He explores the thinking behind “omniscient” compilers and shows some compelling results from their use.
As applications for the popular Cypress PSoC grow, designers can stretch the mixed-signal array to the limits of its SRAM and flash densities. Until now, the only solutions when bumping up against SRAM and flash memory limitations have been to:
- Limit the functionality of the end product
- Migrate the application to a larger PSoC device with more SRAM and flash memory
- Handcraft assembly language code to reduce the program, stack, and variable sizes, an exceptionally cumbersome and time-consuming task that restricts program code portability
None of these alternatives is very attractive.
New compiler technology can effectively double the flash on an existing PSoC by achieving nearly twice the code density of existing compilers. Optimizing pointers, registers, and stacks frees up 10-15 percent of SRAM resources. With advanced compiler technology, designers can keep growing that application in the PSoC device already in use.
Pitfalls of existing compilation technology
Conventional compilation technology shadows the modular software design process. Programs typically are broken up into modules, partly to accommodate increased program complexity and partly to distribute programming tasks among teams of engineers to speed up the process. Compilers track this process, individually compiling each module into an independent sequence of low-level machine instructions. Once all the modules are compiled, a linker links the modules together along with any code being used from precompiled libraries, as shown in Figure 1.
| |
| Figure 1: Independent compilation (click graphic to zoom by 2.2x) |
The compiler never has complete information about the program being compiled. Although many compiler vendors claim "global optimization," the optimization is done only within single modules. There is no optimization across all program modules, which leads to suboptimal stack, register, and memory allocation. Multiple instances of the same routine may be unnecessarily repeated in different program modules. Reentrant code may exist in interrupt routines and the main line program. Declarations of the same variable or object may not be consistent between modules. The result is code that takes up more of the PSoC’s flash memory than is necessary and pointers and stacks that use too much SRAM.
Almost invariably, it is necessary to make use of nonstandard compiler features to handcraft the program to the target device architecture, compromising code portability.
Until recently, it has not occurred to compiler vendors that, aside from current software development methodologies, there is no reason to compile each program module independently. In fact, the opposite is true.
"Omniscient" compiler technology to the rescue
A new compiler technology called Omniscient Code Generation (OCG) analyzes each variable, function, stack, register, and pointer in every program module before it generates the object code, as shown in Figure 2.
| |
| Figure 2: Compilation using OCG (click graphic to zoom by 1.8x) |
Rather than relying completely on the linker to uncover errors between the independently compiled modules, an OCG compiler defers object code generation until a view of the whole program is available. It does this by completing the initial compilation stages for each module separately. Then, instead of compiling to machine code instructions in an object file, it compiles each module to an intermediate code file that represents a more abstract view of each module. Before producing any actual machine instructions or allocating any registers, it applies optimization algorithms across all program modules based on this global view of the program.
This approach resolves most of the problems associated with conventional compilation. By combining program modules into one large program, the compiler can identify and consolidate reentrant code, identify inconsistent variable declarations between modules, eliminate redundant code that may exist in more than one module, and optimize all the stacks, registers, pointers, and memories to exploit the PSoC architecture’s advantages.
Call graph: A clearer view
The intermediate code files generated by the OCG compiler are loaded into a call graph structure. Any library functions referenced by the program are also located and extracted.
Each calling convention contains a set of rules that defines which CPU registers should be preserved across calls. All functions in every program module must adhere to the same calling conventions. Unfortunately, it is impossible to know at compile time which registers will and will not actually be used by a called function. To avoid the potentially catastrophic consequences of not having a needed register, compilers frequently allocate more registers than are really necessary, thus wasting scarce PSoC SRAM resources.
In the case of the PSoC mixed-signal array, the same SRAM space is used for both software function stacks and data variables. If the programmer or compiler does not allocate sufficient SRAM space to accommodate the dynamic stack’s maximum depth, the stack can overflow into data variable space, potentially causing the program to crash. This is a problem with reentrant or recursive functions, which must have dynamic stack space to store local variables or be managed to prevent them from overwriting existing data.
The OCG code generator uses the call graph (pictured in Figure 3) to identify any functions that are called recursively or reentrantly, such as those called from both main line code and interrupt functions. The code generator allocates dynamic stack space for storing local variables to ensure that a function reentrant call does not overwrite existing data. It also searches the call graph for any functions never called by the program and removes them.
| |
| Figure 3: Call graph (click graphic to zoom by 2.2x) |
Most functions are nonreentrant and nonrecursive and may be implemented with a more predictable, static-compiled stack. The compiler’s OCG technology looks at all the program modules, identifies all nonreentrant and nonrecursive functions, and compiles an optimally sized function stack with sufficient memory to accommodate each function’s maximum depth. Because the call graph for all functions has already been determined, functions executed at different times can share the same SRAM space for their static-compiled stack. This feature reduces stack space to the absolute minimum required and frees up more SRAM for data. A compiled static stack also reduces the likelihood of stack overflows that occur when a dynamic stack expands into the data variables’ space in the SRAM.
At the end of this optimization, compiled stack space can be allocated before any machine code has been generated. OCG knows exactly how big the stack needs to be and where it is located before it generates any code.
Pointer reference graph identifies inconsistent variable declarations
Determining the memory space for each pointer is one of the most important features of OCG. Allocating too much memory to pointers wastes scarce SRAM resources on the mixed-signal array.
When the stacks have been optimized, the compiler builds reference graphs for all objects and pointers in the program. The OCG code generator has an algorithm that uses each instance of a variable having its address taken plus each instance of a pointer value assignment to a pointer (either directly via function return or function parameter passing or indirectly via another pointer). It uses this information to build a data reference graph (pointer reference graph) that identifies all objects that can possibly be referenced by each pointer. This information is used to determine exactly how much memory space each pointer will be required to access.
Variables and other objects used in multiple modules must have consistent definitions across all modules for the program to function properly. However, with programming teams dispersed across various facilities (that also may be on different continents), the rule is extremely difficult to enforce.
Programmers can mitigate the potential for error by having the linker check for incompatible variable redeclarations between modules. In the best case, the programmer can correct the variable inconsistencies in the C code and recompile it – a cumbersome but effective approach. In the worst case, the linker doesn’t have enough information to detect the inconsistency, and the human error is compiled into the object code, adding to the debugging task.
Since the OCG code generator sees all program modules at once, it can immediately identify variable or object declarations inconsistent among the different modules. If inconsistencies are found, the code generator sends the programmer an error message that includes the names and locations of all variable names with inconsistencies so they can be corrected prior to compilation. The code generator flags and removes variables that are never referenced and identifies functions that return values that are never used, so the code that prepares the return value can be eliminated.
The OCG code generator uses the data on all the pointers and variables in the pointer reference graph (Figure 4) to determine the exact size of both the compiled and dynamic stacks and allocates memory to them. It also allocates the memory for global and static variables. Since the code generator has perfect information about pointer and variable usage, memory allocations are always optimized.
| |
| Figure 4: Pointer Reference Graph (click graphic to zoom by 2.2x) |
Noncontiguous memory address optimization
Standard C assumes a single linear address space, while in reality many embedded processors including the PSoC mixed-signal array have complex, nonlinear memory spaces, often with different address widths.
For example, the PSoC mixed-signal array has a paged SRAM architecture in which only 256 bytes of SRAM are addressable at any one time. Accessing any other memory page requires the Page Select Register (PSR) to be reset. This is a cumbersome, cycle-intensive process that should be avoided if possible because each PSR reset takes 3 bytes of code and 12 cycles to execute. Thus, if data from a page already in use needs to be written to a different memory page, a great deal of additional program code and clock cycles will be added to the program. The memory space to which variables are assigned suddenly becomes a very important determinant of both execution speed and code size.
This is particularly true for PSoC interrupts because the device automatically selects Page0 for interrupt routines. If the interrupt routine requires access to a variable in any page other than Page0, the PSR must be saved, the memory access mode changed, and the PSR loaded with the other page address. Afterwards the PSR must be restored to its state prior to the interrupt – for a total of 12 bytes of extra program code and 50 clock cycles for every instance. In an extreme case, suboptimal variable allocation to the memory pages could easily double the program size and number of cycles required to execute the routine.
Conventional compilers are not as likely to assign variables associated with interrupts to pages other than Page0 because they do not have enough information to do otherwise. A programmer aware of this issue can manually craft the code to ensure that all variables required for interrupt routines are stored in Page0. However, assembly code solutions compromise code portability.
Another means of solving this problem is using an OCG compiler that knows which variables will be used in interrupt routines and automatically assigns them to Page0.
The OCG code generator also sizes each pointer and encodes it in an optimally efficient way for the PSoC architecture (depicted in Figure 5). Each pointer is automatically assigned an optimized set of address spaces without any intervention from the programmer.
| |
| Figure 5: SRAM utilization with conventional and OCG compilation |
Bottom-up code generation
Once the pointers and variables are assigned memory space, machine code generation can begin. The OCG code generator begins at the bottom of the call graph, starting with those functions that do not call any other functions. Automatic in-lining of these functions may be performed if desired. In any case, the code can be generated without the constraints of rigid calling conventions. Code generation then proceeds up the call graph so that the code generator knows exactly which functions are called by the current function. The code generator knows which registers and other resources are available at each point. Calling conventions can be tailored to the register usage and argument type of a function instead of following a blind set pattern.
Customized library functions
Most microcontrollers come with predefined C code libraries of routinely performed functions. Two cases in point are the workhorse sprintf() and printf() functions used for formatting text strings or output. These functions have options for outputting text strings in many different formats and are particularly useful. However, when implemented in their entirety, they can occupy a code footprint of 5 KB or more.
Most programs only need a fraction of the available options, so the code can be reduced accordingly. The OCG code generator can analyze all the program’s format strings supplied to these functions, determine the subset of format specifiers and modifiers required for the program, and create a customized version. The savings on code size can be immense. For example, the code for a minimal version of sprintf() that implements simple string copying can be as little as 20 or 30 bytes, whereas a version providing real number formats with specific numbers of digits could occupy 5,000 bytes or more. No programmer input is required to benefit from this C library code customization and optimization.
Customized runtime start-up code
The C language requires uninitialized static and global variables to be cleared to zero on start-up. Many newer embedded compilers provide canned start-up code that performs this housekeeping function. However, canned start-up code is often much larger than necessary for a given program. For example, if the program has no uninitialized global variables, there is no need to include code to clear them. OCG makes this information available to the code generator, which then creates custom runtime start-up code. In a minimal case, the start-up code may be completely empty.
Legacy code
Most software evolves over time, integrating existing handcrafted assembly code routines developed and refined for earlier generations of the program. The OCG compiler can combine the C program with externally supplied assembler and object modules. A pre-scan of these modules is completed before code generation, and information is extracted to identify any reserved memory areas, references to C functions, variables from assembler code, and similar details. This information is passed to the code generator, allowing the legacy code to successfully integrate with the highly optimized generated code.
Result: Code half the size
Unlike compilers that claim "global optimization" but really only optimize on individual program modules, OCG looks at every module in the entire program and optimizes across all program modules. One obvious advantage of OCG is smaller, faster code.
PSoC device code compiled using HI-TECH Software’s OCG-enabled HI-TECH C PRO for the PSoC Mixed-Signal Array is about 50 percent smaller than code produced by conventional compilers. OCG can effectively double the amount of program code that can be stored in the device’s on-chip flash, freeing it for more code or additional dynamically configurable functions.
A single C language source file compiled for the PSoC mixed-signal array was 51.3 percent smaller using the OCG compiler than that compiled by an existing third-party compiler (see Table 1).
| |
| Figure 6: Table 1 (click graphic to zoom by 2.7x) |
Code size can be used as a proxy for performance. With less code to execute, the OCG compiler also improves execution speed. OCG technology improves SRAM utilization by as much as 10-15 percent. Stack overflows are effectively eliminated.
Equally important, OCG allows embedded C programs to be written without architecture-specific extensions. Embedded microcontrollers’ somewhat irregular architectures are often an awkward fit with the standard C language, frequently requiring substantial architecture-specific handcrafting to achieve efficient code.
OCG simplifies and streamlines the programmer’s job by abstracting and hiding the underlying architecture while simultaneously delivering reduced code size and increased execution speed. By performing an analysis of the whole program at compile time, the omniscient code generator can make optimal decisions about memory placement, pointer scoping, and stack allocation without any special directives or language extensions. The analysis is performed every time the program is recompiled, so it is always accurate and up-to-date.
Additional information may be obtained at HI-TECH’s website focusing on Cypress PSoC tools at www.cypress.htsoft.com.
Clyde Stubbs is the founder and CEO of HI-TECH Software (Brisbane, Australia). His university research in compiler technology led to the foundation of the company in 1984, and his focus has been on advanced code generation technology for microcontrollers ever since. He graduated with honors in Computer Science from the University of Queensland, Queensland, Australia in 1982.
HI-TECH Software+61 7 3722 7777
clyde@htsoft.com
www.htsoft.com


