WO2001001256A1

WO2001001256A1 - Method and apparatus for static analysis of software code

Info

Publication number: WO2001001256A1
Application number: PCT/US2000/018213
Authority: WO
Inventors: George Fink
Original assignee: Sun Microsystems, Inc.
Priority date: 1999-06-30
Filing date: 2000-06-29
Publication date: 2001-01-04
Also published as: AU6204000A

Abstract

A method and apparatus for static analysis of program code. Embodiments of the invention allow for detection of run time bugs that may arise during the execution of a software application by implementing data structures that represent an image of the program and its variables in various execution instances. The invention is comprised of a context graph that represents various execution paths that are constructed from a series of related contexts. A context is a node in the context graph that represents the value of variables, state of methods, and the relationship between those variables and methods at an execution instance. The edges connecting the nodes represent one method calling the other, establishing a path of execution. Embodiments of the invention simplify the execution paths of large and complex programs into a context graph, using certain approximation and generalizations in analyzing the class files of a program. A context graph once developed can be queried for the status of different nodes and the relationship of those nodes at a certain instance of execution. Embodiments of the invention scale well to larger program codes, as they statistically represent information regarding various execution possibilities that are critical for analyzing the program, excluding any unnecessary details.

Description

METHOD AND APPARATUS FOR STATIC ANALYSIS OF SOFTWARE CODE

BACKGROUND OF THE INVENΗON

FIELD OF INVENTION

This invention relates to the field of computer software, and more specifically, the static analysis of software code. Portions of the disclosure of this patent document contain material that is subject to copyright protection.

Portions of this patent document may contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office file or records, but otherwise reserves all copyright rights whatsoever.

Sun, Sun Microsystems, the Sun logo, Solaris, "Write Once, Run Anywhere", Java, JavaOS, JavaStation and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.

BACKGROUND ART

Software developers extensively test and debug software products prior to release to determine whether the software operates as expected at the time of execution. One method used to detect software problems (or bugs) is to dynamically analyze the software by executing it in all possible scenarios (paths) and inspecting the result for accuracy.

Certain software development environments and languages, such as Java, involve non-deterministic parallelism. This means that a software program's path of execution for a given input cannot be determined in advance (i.e., it can vary from one execution to the next). As a result, the dynamic analysis of the program is not a good solution to detect all problems that may arise at the time of execution, because given perfect tests all program bugs may not surface during testing and debugging no matter how many times the program is executed. Software programs that require synchronized behavior are particularly vulnerable to this problem.

An alternative approach for debugging this type of software program is static analysis. This method involves the static representation of a program by tracing the values of program variables and the relationship between program functions in different paths of execution. For smaller scale program codes, with finite execution paths, this method can work well and efficiently. However, static analysis of a program can be a formidable task for complicated programs with many lines of code. A method is needed that can statically represent a software's different execution states in a more efficient manner.

The above referenced problems can be understood from a review of a general description of the object-oriented programming languages, and synchronization problems that can lead to failure states during execution of a software program. Object Oriented Programming Languages and Program Execution

Object-oriented programming is a method of creating computer programs by combining certain fundamental building blocks, and creating relationships among and between the building blocks. The building blocks in object-oriented programming systems are called "objects." A software application can be written using an object-oriented programming language whereby the program's functionality is implemented using these objects.

An object is a programming unit that groups together a data structure (none or more instance variables) and the operations (methods) that can use or affect that data. Thus, an object consists of data and one or more operations or procedures that can be performed on that data. The joining of data and operations into a unitary building block is called "encapsulation."

An object can be instructed to perform one of its methods when it receives a "message." A message is a command or instruction sent to the object to execute a certain method. A message consists of a method selection (e.g., method name) and zero or more arguments. A message tells the receiving object what operations to perform.

One advantage of object-oriented prograirtming is the way in which methods are invoked. When a message is sent to an object, it is not necessary for the message to instruct the object how to perform a certain method. It is only necessary to request that the object execute the method.

Another advantage of object oriented programming is that execution of a program doesn't have to initiate from a main method (i.e., main point of entry). In other words, different initial points of entry may be chosen in different execution instances. An execution instance refers to a particular point during program execution or a certain state in a path of execution. In one execution instance, object A can be the initial method that invokes other methods in objects B and C, for example. Alternatively, at another execution instance, object B may be the initial point of entry and can, for example, invoke methods in objects D and C.

Object-oriented programming languages are predominantly based on a "class" scheme. An example of a class-based object-oriented programming scheme is generally described in "Smalltalk-80: The Language," by Adele Goldberg and David Robson, published by Addison- Wesley Publishing Company, 1989.

An object class provides a definition for an object which typically includes both fields (e.g., variables) and methods. An object class is used to create a particular object "instance." (The term "object" by itself is often used interchangeably to refer to a particular class or a particular instance.) An instance of an object class includes the variables and methods defined for that class. Multiple instances can be created from the same object class. Each instance that is created from the object class is said to be of the same type or class.

To illustrate, an employee object class can include "name" and "salary" instance variables and a "set_salary" method. Instances of the employee object class can be created, or instantiated, for each employee in an organization. Each object instance is said to be of type "employee." Each employee object instance includes "name" and "salary" instance variables and the "set_salary" method. The values associated with the "name" and "salary" variables in each employee object instance contain the name and salary of an employee in the organization. A message can be sent to an employee's employee object instance to invoke the "set_salary" method to modify the employee's salary (i.e., the value associated with the "salary" variable in the employee's employee object).

A hierarchy of classes can be defined such that an object class definition has one or more subclasses. A subclass inherits its parent's (and grandparent's etc.) definition. Each subclass in the hierarchy may add to or modify the behavior specified by its parent class. Some object-oriented programming languages support multiple inheritance where a subclass may inherit a class definition from more than one parent class. Other programming languages, such as the Java programming language, support only single inheritance, where a subclass is limited to inheriting the class definition of only one parent class.

The Java programming language also provides a mechanism known as an "interface" which comprises a set of abstract methods and constant declarations. An object class can implement the abstract methods defined in an interface. Both single and multiple inheritance of interfaces are available to an object class. That is, an object class can inherit an interface definition from more than one parent interface.

The Java programming language allows objects to be referenced as any class or interface that they inherit from. Thus, depending on the implementation of a software application, in various points of execution, variables may hold objects of the variable's class, or interface objects of subclasses of the variable's class, or objects that implement the variable's interface. This is referred to as type overloading. To identify the properties of objects in an execution path, a method is needed that can trace the objects back to their point of instantiation from a particular class. Multi-threading and Synchronization

Certain programming languages such as the Java programming language support and implement concurrent processing. Concurrent processing involves the simultaneous execution of multiple processes by a computer system. A process may use multiple resources during execution. On the other hand, a resource may be used by more than one process. Thus, multiple processes may use a single resource during execution. Problems may arise when multiple processes attempt to use a resource at the same time.

Multiple processes can be implemented by multiple "threads." A thread is a single flow of control within a computer program. A software program can implement multiple concurrent processes during execution, by instantiating multiple concurrent threads. When many concurrent threads are generated, a method is needed to detect any synchronization problems that may arise during the execution of the program. Synchronization refers to prescheduling the occurrence of events, such as access to a resource by multiple processes, in a chronological order to avoid any conflicts.

For example, concurrent processes may want to use the same resource simultaneously. If all processes are not synchronized in accessing or updating that resource the resource may be updated inconsistently. This is referred to as a "race condition." One process may update a value, while another process may update the same value at the same time. This can result either in an error state where the system may come to a halt, or an inaccurate value update.

To avoid this problem, a synchronization scheme is used that does not permit concurrent access to one resource by multiple processes. For example, a resource in use can be "locked" by the accessing process to prevent access by another process. Thus, processes have to take turns in locking a resource for use.

The locking scheme can cause a deadlock situation when multiple processes using multiple resources need to use a resource before releasing the lock on other resources. For example, process 1 while having a lock on resource A may be waiting to access resource B being used by process 2. If process 2 is waiting to use resource A before releasing resource B, then a deadlock situation arises, as both processes are waiting for the other to release its lock. A method that can detect synchronization problems would be beneficial to avoid deadlocks. Synchronization problems can be detected during execution by identifying locking states and determining any conflicts that may arise from those states.

Some current solutions attempt to resolve synchronization problems by statically analyzing program code. Using this method synchronization bugs, as well as other faults, can be detected as different paths of execution are statistically represented to point out any potential conflicts. The static analysis of a program may scale well to small program codes, however, for complicated codes with many lines of instructions, static analysis can be very inefficient.

Call Graph

One method attempts to automate the static analysis of program code by building a model of the program using a call graph. A call graph is an image of a computer program's execution paths, comprised of nodes and edges connecting those nodes. Each node represents a method in the program, and an edge is indicative of the relationship between various methods. For example, an edge between node A and node B may indicate that method A invokes method B. Unfortunately, the above-described technique is impractical and does not scale well to larger program codes as it attempts to represent an exact model of the code's execution paths. Due to the amount of detail included in the call graph, and the recursive nature of some method calls, implementation and analysis of the call graph requires extensive human intervention that is time consuming and inefficient.

For example, nodes of a call graph do not include information regarding the type or the instantiation points of program methods. As such, to determine the call history or to perform a type analysis for a method in a call graph, each preceding node has to be traversed back along the connecting edges. A scheme is needed that can statically analyze a program code's different execution states in an efficient and finite manner, excluding any unnecessary details.

SUMMARY OF THE INVENTION

Embodiments of the invention allow for detection of run time bugs that may arise during the execution of a software application by implementing data structures that represent an image of the program and its variables in various execution instances. An execution instance refers to a particular point during program execution or a certain state in a path of execution.

Among other data structures, the invention is comprised of a context graph that represents various execution paths that are constructed from a series of related contexts. A context is a node in the context graph that represents the value of variables, state of methods, and the relationship between those variables and methods at an execution instance. The edges connecting the nodes represent one method calling the other, establishing a path of execution.

Embodiments of the invention simplify the execution paths of large and complex programs into a context graph, using certain approximation and generalizations in analyzing the class files of a program. Class files contain information about the objects in a program, their structures and relationships. This information can be used to build a context graph that represents the various states of the program, during execution. A context graph once developed can be queried for the status of different variables and nodes and the relationship of those variables and nodes at a certain instance of execution.

Embodiments of the invention scale well to larger program codes, as they represent information regarding various execution possibilities that are critical for analyzing the program, excluding any unnecessary details. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram illustrating the components of a general purpose computer, that may be used in conjunction with one or more embodiments of the invention.

Figure 2 is a block diagram illustrating an object list also in form of a tree structure, implemented according to one or more embodiments of the invention.

Figure 3A is a block diagram illustrating a reference table, implemented according to one or more embodiments of the invention.

Figure 3B is a block diagram illustrating a tree structure including information noted in the reference table.

Figure 4 is a block diagram illustrating a context graph implemented according to one or more embodiments of the invention.

Figure 5 is a flow diagram illustrating a method of implementing a context graph for a program code, according to one or more embodiments of the invention.

Figure 6A is a flow diagram illustrating a method of creating a context in a context graph, according to one or more embodiments of the invention.

Figure 6B is a flow diagram illustrating a method of analyzing references pertaining to various contexts in the context graph, according to one or more embodiments of the invention. Figure 7 is a block diagram illustrating a method of implementing a recursive procedure in a context graph, according to one or more embodiments of the invention.

DETAILED DESCRIPTION OF THE INVENTION

A method and apparatus for static analysis of software code is described. In the following description numerous specific details are set forth in order to provide a more thorough description of the invention. It will be apparent however, to one skilled in the art, that the invention may be practiced without these specific details. In other instances, well known features have not been described in detail so as not to obscure the invention.

COMPUTER EXECUTION ENVIRONMENT (HARDWARE)

An embodiment of the invention can be implemented as computer software in the form of computer readable program code executed on a general purpose computer such as computer 100 illustrated in Figure 1, or in the form of byte code class files executable by a virtual machine running on such a computer. A keyboard 110 and mouse 111 are coupled to a bi-directional system bus 118. The keyboard and mouse are for introducing user input to the computer system and communicating that user input to central processing unit (CPU) 113. Other suitable input devices may be used in addition to, or in place of, the mouse 111 and keyboard 110. I/O (input /output) unit 119 coupled to bi-directional system bus 118 represents such I/O elements as a printer, A/V (audio/video) I/O, etc.

Computer 100 includes a video memory 114, main memory 115 and mass storage 112, all coupled to bi-directional system bus 118 along with keyboard 110, mouse 111 and CPU 113. The mass storage 112 may include both fixed and removable media, such as magnetic, optical or magnetic optical storage systems or any other available mass storage technology. Bus 118 may contain, for example, thirty-two address lines for addressing video memory 114 or main memory 115. The system bus 118 also includes, for example, a 32-bit data bus for transferring data between and among the components, such as CPU 113, main memory 115, video memory 114 and mass storage 112. Alternatively, multiplex data /address lines may be used instead of separate data and address lines.

In one embodiment of the invention, the CPU 113 is a SPARC™ microprocessor from Sun Microsystems, or a microprocessor manufactured by Motorola, such as the 680X0 processor, or a microprocessor manufactured by Intel, such as the 80X86, or Pentium processor. However, any other suitable microprocessor or microcomputer may be utilized. Main memory 115 is comprised of dynamic random access memory (DRAM). Video memory 114 is a dual-ported video random access memory. One port of the video memory 114 is coupled to video amplifier 116. The video amplifier 116 is used to drive the cathode ray tube (CRT) raster monitor 117. Video amplifier 116 is well known in the art and may be implemented by any suitable apparatus. This circuitry converts pixel data stored in video memory 114 to a raster signal suitable for use by monitor 117. Monitor 117 is a type of monitor suitable for displaying graphic images.

Computer 100 may also include a communication interface 120 coupled to bus 118. Communication interface 120 provides a two-way data communication coupling via a network link 121 to a local network 122. For example, if communication interface 120 is an integrated services digital network (ISDN) card or a modem, communication interface 120 provides a data communication connection to the corresponding type of telephone line, which comprises part of network link 121. If communication interface 120 is a local area network (LAN) card, communication interface 120 provides a data communication connection via network link 121 to a compatible LAN. Wireless links are also possible. In any such implementation, communication interface 120 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information.

Network link 121 typically provides data communication through one or more networks to other data devices. For example, network link 121 may provide a connection through local network 122 to host computer 123 or to data equipment operated by an Internet Service Provider (ISP) 124. ISP 124 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 125. Local network 122 and Internet 125 both use electrical, electromagnetic or optical signals which carry digital data streams. The signals through the various networks and the signals on network link 121 and through communication interface 120, which carry the digital data to and from computer 100, are exemplary forms of carrier waves transporting the information.

Computer 100 can send messages and receive data, including program code, through the network(s), network link 121, and communication interface 120. In the Internet example, server 126 might transmit a requested code for an application program through Internet 125, ISP 124, local network 122 and communication interface 120. In accord with the invention, one such downloaded application is the method and apparatus for secure transfer of data streams described herein.

The received code may be executed by CPU 113 as it is received, and /or stored in mass storage 112, or other non-volatile storage for later execution. In this manner, computer 100 may obtain application code in the form of a carrier wave. The computer systems described above are for purposes of example only. An embodiment of the invention may be implemented in any type of computer system or programming or processing environment.

STATIC ANALYSIS OF PROGRAM CODE USING A CONTEXT GRAPH

Although the present invention has applicability to software applications developed in any programming language, the invention in parts is described, by way of example, in connection with the Java programming language and programming environment.

Object oriented programming languages, such as the Java programming language, provide for implementation of various classes of objects. Classes are like blueprints from which objects of a program are instantiated. Thus, each class defines the type of an object and describes the methods and variables used in that object. As discussed earlier, a method within an object can make calls to other methods defined within other objects. Due to the complex relationships that may exist between these methods and the objects accessed by them, it is difficult to statically trace each object type or the state of the program at every point of execution.

For example, sometimes objects instantiated from different classes share a common name. Hence, an object's type or class cannot be determined from the object's name, per se. This is referred to as "type overloading." In other instances, a deadlock may occur if multiple objects attempt to access the same resources simultaneously. To detect and overcome these sort of problems, it is important to be able to determine the class from which an object was instantiated, or the state of any locks on the resources. Thus, it is desirable to know the state of all objects and to be able to distinguish them from one another at various points during program execution. The present invention has applicability to software applications developed in any programming language. However, for the purpose of illustrating some of the above indicated complications that may arise during execution of program code, Java Code I is introduced in Table 1 below, by way of example: if (e) v = new B() else v = new C() v.foo(); class B { object fid void foo() {fid = new objectQ; bar();} void bar() (fubar();}

1

) class C extends B { void bar() (fustar();} }

TABLE 1

Execution of Java Code I results in instantiation of an object "v" of class B or C. Classes B or C define the variables and methods for object "v". For example class B defines methods "foo()" and "bar()". Thus, v.foo() and v.bar() can be invoked by object "v", when "v" is an instance of class B. Similarly, class C has been declared as a subclass of class B. This means that an instance of class C can be substituted for an instance of class B. As such, when a redefined method of class B is to be invoked, the method will be invoked in accordance to the definitions set by class C. Thus, class C inherits method foo() as defined in class B. However, method bar() has been redefined by class C. Thus, v.bar() as defined by class C invokes a method distinct from v.bar() defined by class B.

Pursuant to Java Code I, if "e" is true then instruction "v= new B()" results in instantiation of object "v" that is of type B. Otherwise, "v" will be an object of type C, per "v=new C()". Thus, "v" depending on the value of "e" will be an instance of class B or C. If object "v" is an instance of class B (i.e., of type B) then v.foo will call another method "bar()" and method bar() will in turn call method "fubar()." However, if the object is an instance of class C (i.e., of type C) then v.foo will call method bar(), and method bar() will call fustar().

It is apparent from the example above that keeping track of all the values and different types of many instances of a number of classes can be complicated and time consuming, especially when some of the methods share the same name (e.g., bar()). For example, class C can override method bar() previously defined in class B. Since the names of the methods are the same and the types of the objects are similar, it is difficult to differentiate these methods from one another at various instances of execution, unless the call history and the point of instantiation of each object are traceable.

One or more embodiments of the invention comprise a context graph that is implemented to statically represent the call history and the points of instantiation of various objects during program execution. To implement the context graph, class files of a program are analyzed. Class files contain information about the objects in a program, their structures and relationships. Thus, using the context graph of the invention, the status of different variables after invocation of a method, or the type of an object at a certain point of execution can be easily determined.

Figure 4 illustrates a context graph 410 of Java Code I, implemented according to one or more embodiments of the invention. Each node in context graph 410 is represented by a method call, and includes additional information with respect to that method, such as the point of instantiation of the method and/or the type of object or objects invoking that method. Arrows (or edges) pointing from one node to another node, represent the invocation of a method by another method. Referring to Java Code I and Figure 4, node A.main() represents the state of object "v" during the execution of the main method in Java Code I, for example. "R1:B or C" indicates that object "v" can be an instance of classes B or C at that instance of execution. "B.init(), C.init(), foo()" indicates the method calls and object types that can be instantiated at that execution instance. Thus, by looking at node A.main() the status of object "v" and other variables in that context can be determined.

The change in status of objects during run time is reflected in the change in the value of information assigned to each node of the context graph. In embodiments of the invention, data tags are assigned to different fields in each node to detect these changes. For example, consider context 5 denoted by C5 in context graph 410. "R5: C" indicates that object "v" is an instance of class C in that point of program execution. "C.bar()" indicates that method bar() of object "v" has been invoked at that point of program execution. "fustar()" indicates that method bar() invokes method fustar() at that point of program execution. In addition to these information, the call history of method "bar()" at that point of execution can be determined. For example, by tracing the arrow one step up, it can be determined that the foo() method of context 3 (C3) had invoked method bar() at a previous instance of execution. Further by tracing the arrows all the way back to the top of context graph 410, it can be determined that A.main() was the method where the foo() method was invoked.

Once a context graph for a program code is implemented, it can be queried for specific information about the program. As explained above, using data tags, it can be determined how certain variable values change during program execution. Additionally, the type or point of instantiation of objects, and /or method calls where each object originates from can be traced. In the following, a number of methods for utilizing the context graph of this invention are disclosed. However, in order not to obscure the invention with unnecessary detail, other possible methods are not included.

Value Based Analysis

In one or more embodiments of the invention, the value or other attributes of specific objects can be queried. Data tags are associated with specific values for each object, and each tag is given an initial value. The change in value of each tag is indicative of how an object's type or other attributes are modified during program execution. Depending on whether objects are merged, cloned or otherwise manipulated, tags associated with those object are combined, duplicated, or otherwise modified, in one or more embodiments of the invention to indicate and memorialize any changes.

Using the data tags it can be determined, for example, what data values various objects acquire during different paths of execution, and how those values change as the object is manipulated by different methods, starting from the point of instantiation. In embodiments of the invention, value based analysis is used to determine the types a specific object can have. Using this analysis, potential type problems may be identified. For example, the context graph can be queried for the value of local variables before and after a method is invoked and executed. This is highly beneficial in areas where certain pointer variables are used, and when it is difficult to determine the value of a pointer due to the intricacies of the computing environment, unless the value is dynamically acquired. State Based Analysis

In embodiments of the invention, the context graph can be used to analyze the overall state of a program during execution. For example, the context graph can be queried about the locking state of resources after a method have been invoked. The analysis can be also focused on the state of thread activity. For example, the context graph can be queried on how a thread is acting, including thread synchronization activities (e.g., waking other threads, changing priorities, etc.).

Unlike value based analysis, state based analysis is used to track the transitions in the overall state of the program. In one or more embodiments of the invention, data tags are assigned to the initial state of a program. Any change in these data tags are traced during program execution to determine possible errors that may arise. For example, the context graph may be queried to determine whether two concurrent threads have a lock on a shared resource and whether that would cause a deadlock during execution.

Scan Based Analysis

It may be necessary to analyze a program for specific information at particular instances of execution. The state and value based analysis provide an overall image of the program code, defining all transitions and modifications from one state to the next and /or one value to the other. In embodiments of the invention, the context graph can be used to collect information about different paths of execution, excluding any unnecessary details about other execution instances that are not of interest.

For example, specific data can be collected and correlated by scanning through the context graph to determine which execution routes contain the information that are of interest, rather than analyzing all program sequencing information from one context to the next. Scan based analysis is particularly useful after the program code has been generally analyzed using one or both of the above to analysis schemes, for gaining further detailed information about specific execution instances that may require additional attention.

To accomplish this, in one or more embodiments of the invention, several data structures are implemented that hold the necessary information for static analysis of the program code. Embodiments of the invention comprise an object list, a reference table, and a context graph. The context graph is implemented from the information included in the object list and the reference table, as further described below.

Object List

One or more embodiments of the invention comprise an object list that includes a list of all unique objects created during program execution, with uniqueness being defined by the point of instantiation in the context graph. Figure 2 is a block diagram illustrating object list 210 in the form of a tree structure, implemented according to one or more embodiments of the invention. Object list 210 is implemented so that it includes the names and the points of instantiation of all possible objects created during program execution.

The point of instantiation of an object is the location within the context graph, from which the object originates. For example, the two instances of object "v" according to Table 1 originates from the main method of the program. Therefore, object "v" is represented by the two nodes attached to A.main(), in object list 210. If B.foo() were to allocate an object, it would be represented by foo().init()(B) and foo().init()(C) in object list 210 traceable to A.mainQ indicating the point at which object "v" was instantiated. In Figure 2, the object list uniquely classifies each object by point of instantiation within the context graph.

Reference Table

One or more embodiments of the invention comprise a reference table that includes information about a method's variables and objects, and any changes in their values during program execution. Figure 3A is a block diagram illustrating a reference table, implemented according to one or more embodiments of the invention. Reference table 310, comprises a plurality of references, Rl through R7. A reference table is a link between variables in a context method associated with elements of the object list that can be represented by that variable.

Different paths of execution are denoted by different contexts. For example, referring to Figure 3A and Java code I, Rl is a reference to classes B or C in context 1 (CI), indicating that object "v" can be an instance of class B or C in that context. Not shown here is the map to the exact elements of the object list that produces these instances. Referring to Figure 3B, CI represents the main program method (A.main()). Thus, Rl defines the possible objects for variable "v" when the main method is executed.

Additional nodes in reference table 310 refer to the possible types that can be attributed to variables when other methods are invoked by the main method. For example, referring to Figure 3 A, R2 is a reference to class B in context 2 (C2). Per Figure 3B, C2 is associated with method "foo()" as defined in class B. Thus, in a path of execution leading from context 1 to context 2 object "v" can be an instance of class B, and therefore the "foo()" method is implemented as defined in class B, denoted as B.fooQ. Referring to Figure 3A, R3 corresponds to a different path of execution (CI, C3) where object "v" can be an instance of class C. Context 3 denotes a path of execution from CI to C3, indicating that in that context object "v" can be an instance of class C. If "v" is an instance of class C, then the "foo()" method is implemented as defined in class C. However, note that in the Java code I, class C extends B, and that there are no local declarations for the foo() method. As a result method foo() is implemented as defined in class B. However, the reference is still to an object of class C.

Similarly R4 and R5, per Figure 3A and 3B, are references to further variables in the path of execution indicating the possible objects and contexts with those variables. For example, R4 refers to the variable in the path of execution denoted by context C4.

Referring to Java code I and Figures 3A - 3B, class B defines a data member "fid" during execution. Each instance of class B, including instances of B's subclass C, contains a reference to a different data member. To simplify the analysis, objects that share a reference and similar data members, also share references to the data members. For example, in Figure 3A, references R1-R5 to objects of class B or C share the same reference to data member "fid."

Method foo() is invoked in various contexts during execution. It is invoked in the main body of the code (i.e., A.main), is defined in class B (i.e., void foo()), and is extended into class C. Hence, method foo() including object "fid", can be invoked in contexts 1 through 5. Therefore, contexts 1 through 5 share a common "fid" object which is referenced by Rl through R5 as illustrated in Figure 3A.

R6 and R7 refer to field information contained in the two objects of class B and C. Since same references to the two objects of class B and C are shared (Rl), R6 and R7 are aliased (i.e., point to the same object). Thus in one or more embodiments of the invention, when information about R6 changes that information is propagated to R7 and other related contexts referenced by R7.

Context Graph

One or more embodiments of the invention comprise a context graph that represents a program's possible execution paths, the state of each method invoked during that path, and the type and value of any objects or variables defined therein. Referring to Figure 4, each node of the context graph represents a method and variable bindings in a path of execution. A variable binding is an association of variables in a method to elements in the reference table.

A context graph is relative to a single entry point in a program. For example, referring to Figure 4, CI represents a context where the main method A.main() is the entry point of the program (i.e., main method was invoked first). If another method was chosen as an entry point (e.g., method foo()) then a different context graph would have been produced.

In embodiments of the invention, a context graph is produced along with the information stored in the reference table and /or object list. This information is obtained from parsing and analyzing the program code (i.e. class files). Referring to Figure 4, context graph 410, embodiment of object list 210, and reference table 310 are an embodiment of the program listed in Table 1. For example, context 1 (CI) is a representation of main method (A.main()), that can have object instances of classes B or C. CI also indicates that method foo() can be invoked from within the main method. Moving down the execution path in Figure 4, context 2 (C2) is a representation of method foo() involved on an instance of class B (B.foo()). Therefore, looking at C2, one can determine that the foo() method in that context was invoked by the main method and is of type B. It can also be determined that the foo() method invokes a bar() method in that context.

Similar information can be extracted from contexts 3 through 7 with respect to methods described therein, their variable bindings, and objects referenced therein.

Figure 5 is a flow diagram illustrating a method of producing a context graph for a program code, according to one or more embodiments of the invention. At step 510, the software's program code (i.e., class files) is analyzed, and a main context is created for a method that is chosen as the point of entry for the program. Figure 6A is a flow diagram illustrating a method of creating a context in a context graph, according to one or more embodiments of the invention.

At step 610, variable bindings are created for a context. For each distinct variable, an element of the reference table is created. These references are extracted from the reference table and the methods are extracted from the object list. For example, if a context graph for Java code I is being implemented, then a node is created for the main method of the main class (e.g., A.main()). This node will represent context 1 and will be encoded with information about and variable bindings (e.g., Rl) and methods that are invoked in the main method (e.g., foo()).

At step 620, any static and final methods in the node are expanded. Expanding a method includes creating additional nodes in the context graph for any methods that can be further invoked by the current method. Static and final methods are unique, in that unlike other methods they cannot be duplicated, altered, overridden or subclassed. Therefore, it is not possible for static or final methods to give rise to a type overloading scenario. As such, they can be finitely expanded into one or more contexts. The created contexts are added to a work list, at step 630. A work list is a list or other structure that includes contexts that need to be worked on. Each context added to the work list is reviewed to determine whether its neighbor contexts or methods are affected in any way by the changes made to that context. A special example of a static method is object instantiation. At this time new objects are added to the object list.

Referring to Figure 5, at step 520 it is determined whether there are any contexts on the worklist. If so, then at step 530, the first context on the worklist is selected. That context is analyzed based on information available for that context in object list 210 and reference table 310. At step 540, based on said information, various reference bindings are set and aliases are added. For example, if reference 1 binds to (refers to) object B, and reference 2 binds to object C and both are referred to by variable "v" in A.main(), then reference 1 and reference 2 are aliased together to simplify the reference table. Aliasing simplifies the analysis process because it identifies different instances of objects as behaving in the same or similar manner. This way the size of the reference table remains manageable without losing too much information about the program.

A method used to simplify a context graph is context aliasing to handle recursive procedures or methods. A recursive procedure can be found in a method that invokes itself either directly or through a chain of other methods invoked by it. As a recursive procedure may result in the creation of an infinite loop (i.e., a loop in program code that is repeated indefinitely), the context graph representation of the execution path created will be large and infinite in itself. In embodiments of the invention, recursive procedures are controlled by binding the context with a recursive method call to the first instance of the recursive method call, and aliasing references to that method. (See Figure 7)

Figure 6 B is a flow diagram illustrating a method of analyzing references pertaining to various contexts in the context graph, according to one or more embodiments of the invention. After reference bindings are set, and aliasing is completed, at step 640, it is determined whether any changes are made to any references in reference table 310. If there are any changes to the references to a context, then that context is added to the work list at step 650 for further analysis.

At step 550, instance methods are expanded and analyzed based on their class type, point of creation and other attributes. At this step, type ambiguities are resolved by analyzing the object's history and point (5) of instantiation. Instance methods are not static or final, so instance methods ca be overridden, and the exact amount invoked is ambiguous based upon the actual class type.

To determine an object's type it is necessary to know the call history and point of instantiation of that object. In one or more embodiments of the invention, steps 530 through 550 may be repeated for every remaining context on the worklist.

Thus, a method and apparatus for static analysis of program code has been described in conjunction with one or more specific embodiments. The invention is defined by the claims and their full scope of equivalents.

Claims

1. An apparatus for static analysis of program code, said apparatus comprising:

an object list, comprising references to one or more objects to be instantiated during program execution, and at the point of instantiation of said one or more objects;

a reference table, comprising references to one or more objects to be referred by one or more methods during program execution, and

a context graph, comprising references to one or more contexts, said contexts including information about said one or more methods during program execution.

2. An apparatus of claim 1, wherein said information about said one or more methods include references of objects invoking said methods during program execution.

3. An apparatus of claim 1, wherein said information about said one or more methods includes the type of objects invoking said methods during program execution.

4. An apparatus of claim 1, wherein a method of a first context in said context graph invokes a second context in said context graph.

5. A method for static analysis of program code, said method comprising:

parsing a worklist to select a context including information about one or more methods to be invoked during program execution;

creating a context including information about one or more methods to be invoked during program execution, if said worklist is empty;

selecting a method to be referenced by said context;

associating one or more local variables of said method to one or more objects to be instantiated during program execution; and

merging said context with at least one other context having methods that are equivalent to said method.

6. A method of claim 5 wherein said step of creating a context comprises:

creating one or more value references for a method of said context;

expanding static and final methods for said context; and

adding said context to a worklist. A method of claim 5 further comprising:

detecting a change in said value references;

adding contexts using said value references to said worklist.