WO2000031626A1

WO2000031626A1 - Method of identifying recurring code constructs

Info

Publication number: WO2000031626A1
Application number: PCT/CA1999/000993
Authority: WO
Inventors: Autumn Umanetz; Lingyan Shen
Original assignee: Netron Inc.
Priority date: 1998-11-19
Filing date: 1999-10-22
Publication date: 2000-06-02
Also published as: AU6455799A; CA2254494A1

Abstract

A method of identifying recurring or common logical code elements within the source code of a computer application. Parsing of the source code of a set of files within the application produces a syntax tree which is then traversed to identify blocks of code. A fingerprint is created for each block of code, each fingerprint containing a characteristic for each type of statement located within the block. The characteristic consisting of a vector containing: the statement type, the number of occurrences of the statement, the number of bytes of data input to the statement and the number of bytes of data output from the statement. The user may select that only certain types of statements are to be considered in creating a fingerprint. The user may also choose to aggregate types of statements into a single characteristic when creating the fingerprint. The fingerprints for each block of code are then submitted to a Bayesian classification engine which places the blocks of code into common groups based upon their fingerprints and displays them to the user. The user may then browse the selected groups to determine if there exist modules within the application that may be reused, redeployed or re-engineered.

Description

Title: METHOD OF IDENTIFYING RECURRING CODE CONSTRUCTS

FIELD OF THE INVENTION

The present invention is a software application designed to facilitate the re-use or re-engineering of existing software applications. The invention identifies areas of code commonality, facilitating the extraction of the common code elements into reusable objects.

BACKGROUND OF THE INVENTION

It is well known that many organizations continue to use and maintain what are known in the computer software industry as "legacy" applications. These legacy applications have recently gained media attention with regard to the year 2000 problem. Regardless of the year 2000 problem, organizations are repeatedly deciding whether or not to renovate or replace their existing legacy applications. For example, transitioning an application to make it available on an internet, or structuring the application so that it may be distributed. The process of renovating is selected in the hope of making legacy applications easier to maintain as well as more easily modified. The problem with renovating or re-engineering legacy applications, is that quite often the original developers are unavailable to help explain the structure of the application, nor is development or design documentation available. In re-engineering legacy applications there are two traditional approaches: a) code analysis; and b) re-engineering.

The purpose of code analysis is to provide understanding of the structure of a legacy application. In analyzing code the type of information that is generally provided is: a) graphs showing program flow; b) charts showing data flow; c) listings of variable aliases; and d) where used information.

Re-engineering tools on the other hand are utilized to convert an existing application into a different type of application. The most common target for conversion is to turn a mainframe application into a client server application. Neither of these solutions addresses the problem of identifying re-usability within the existing legacy application. In any legacy application, there will be programmed a number of

"business rules" or logic components. For example, the method used to determine the sales taxes on a specific class of item. To effectively renovate legacy applications, an organization needs to first identify the inventory of their current business rules and then be able to reuse them in new applications. The present invention aids in analyzing an existing application, to identify the business rules in the application.

The ability to recognize common business rules or logic components has a number of advantages namely;

a) when maintaining an application, having the ability to ensure that all instances of the logic to be altered have been located; b) improving the flexibility of the legacy application by replacing inline code with components; c) creating new applications based upon the business rule or logic components which are embedded in existing applications; and d) obtaining a better understanding of the components in the legacy application, in particular being able to understand what gets done where, and reorganizing code that may be duplicated in many portions of the application. Thus there is a need for a software tool to identify and thus provide for isolation of common business rules within a legacy application.

BRIEF SUMMARY OF THE INVENTION

A method for identifying recurring code constructs within the source code of a software application, the method comprising the steps of:

a) parsing the source code to create a syntax tree; b) traversing the syntax tree to identify blocks of code; c) creating a fingerprint for each block of code; d) submitting the fingerprints obtained in step c) to a classification engine; and e) providing the output from step d) to a user for analysis.

A system for identifying recurring code constructs within the source code of a software application comprising: a) a code parser; b) a create coarse fingerprint module for analyzing the output of said parser to produce a raw fingerprint file; c) a classification engine to classify data contained within the raw fingerprint file; and d) an output module to format the data output from the classification engine to the user for analysis.

A method for determining whether variables in a block of code are to be considered as input to or output from said block of code, independent upon access to or modification of said variables within and without said block of code.

A database interface, said interface providing methods for accessing an object within a syntax tree, said interface comprising methods for: a) retrieving said object from said syntax tree by type or reference; b) retrieving information regarding the attributes of said object; and c) retrieving abstract type or relationship data of said object, given a string representing the name of said object.

A method for excluding statements from within a block of code, said statements being excluded from consideration when calculating a fingerprint for said block of code.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the present invention, and how to show more clearly how it may be carried into effect, reference will now be made by way of example, to accompanying drawings which show a preferred embodiment of the present invention and which:

Figure 1 is a schematic diagram illustrating the components utilized in fingerprinting a legacy application; Figure 2 is an illustration of the advanced classifications settings screen;

Figure 3 is an illustration of the results provided by the classification engine;

Figure 4 is an illustration of a finer classification screen; Figure 5 is a screen capture of the results obtained by running a finer classification; and

Figure 6 is an illustration of the filter settings screen.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The term "fingerprint" for this specification including the claims means a set of characteristics that are present within a block of source code. In the preferred embodiment these characteristics are: statement type, frequency of statement type, bytes input to statement type, and bytes output from statement type. A fingerprint however may be based upon other characteristics within a block of code, such other characteristics are described later. The present invention relates to the developing of a "fingerprint" to identify the characteristics of a block of source code. The purpose of the preferred embodiment of the present invention is to determine a method of identifying recurring code constructs without depending on:

a) naming conventions; b) consistency of coding style; or c) consistency in alignment of code.

Further, the code fingerprint should be desensitised to arbitrary differences in ordering of statements, i.e. differences in order that do not impact the functionality of the code.

Most computer languages support the same basic types of operations:

1. Flow control such as branching procedure calls or interactive constructs;

2. Conditionals such as if, case and evaluate;

3. Assignments such as set, move and equals;

4. Arithmetic; and 5. I/O

Many different statement types may be used to achieve essentially the same result. For example the choice between a FOR or WHILE loop control is arbitrary; it is most often dictated by the type of boundary condition the programmer was most comfortable with.

Referring first to Figure 1, a schematic diagram illustrating the components utilized in fingerprinting a legacy application. A fingerprinting system 10 is comprised of a parser 12 which takes as input, source code 14 and produces syntax tree 16. In the preferred embodiment parser 12 is the Revolve COBOL parser provided by MicroFocus Corporation. As can be appreciated by one skilled in the art, the present invention is not restricted to the Revolve parser or the COBOL language. Syntax tree 16 provides input data for create coarse fingerprint process 18. The create coarse fingerprint process 18 analyzes syntax tree 16 and creates a persistent raw fingerprint database 20. Raw fingerprint database 20 serves as input to aggregate fingerprint process 22. A user request 24 provides input parameters to the aggregate fingerprint process 22 to aid in selecting the source code components within source code 14 that are of greatest interest with regard to re-usability. The output of the aggregate fingerprint process 22 is then passed to a filter 26. The user may interact with the filter 26 to prepare classification data for the source code modules of interest and passes that classification data to a classification engine 28. The classification engine 28 then outputs the result of the classification to a formatting process 30 for presentation to the user for review in determining if there are any source code modules for potential reuse.

In order to establish a fingerprint for a block of code, the following information is required from the parser 12:

a) identification of boundaries for each block of code; b) identification of the set of statements within each block of code; and c) identification of the data acted on by each statement.

All of this information is readily available from any parser. Because the classification is most meaningful when done on large numbers of programs, it is helpful if the fingerprint information is persistent. Thus, the invention uses the raw fingerprint database 20 to retain fingerprint information. This allows multiple passes of the classification to be done without re-parsing the input source code 14. Boundary of a Block of Code

Most programs are written with a combination of formal boundaries and informal boundaries. Formal boundaries are those provided by the language for the isolation of a set of statements required to accomplish some task. In FORTRAN and C, functions provide formal boundaries. In COBOL, paragraphs and sections provide formal boundaries. Informal boundaries are boundaries that a programmer might identify as separating sets of statements. These boundaries might be identified by white space, comments or the beginning and end of block oriented statements in the language (e.g., an if and an end-if).

This means that the identifiers for each block of code delimited by specific boundaries, need to be extracted from the parser.

COBOL Example 1

Process-Transfer Section. Move "Y" to Process-Trans,

Subtract Transfer-Amount from Current-Balance giving New-Balance. If New-Balance < 0

Move Account-Number to Check-Account-Number Perform Check-Account If New-Balance < Safety-Balance AND Process-Trans = "Y" Move Customer-Number to arn-Customer-Number Perform Warn-Customer .

If Process-Trans = "Y"

Perform Calculate-Transaction-Fee

Compute New-Balance = Current-Balance - Transfer-Amount -

Trans-Fee.

Check-Account Section. If Account-Type = "Z"

Move "Y" to Process-Trans Go Politely-Remind-Customer . If Account-Type = "M* Go Friendly-Check-Account. Go Abort-Transfer.

Calculate-Transaction-Fee Section. Compute Trans-Fee= Transfer-Amount * .001. Perform Special-Discount-Check

Varying Discount-Type From 1 by 1 until Discount-Type Greater than Num-Discount-Types . If Trans-Fee > 100 Move 100.00 to Trans-Fee.

If Account-Type = "M"

Divide Trans-Fee By 2 Giving Trans-Fee. If Account-Type = "Z" Move 0 to Trans-Fee.

Statements Within a Block of Code

All parsers produce an ordered set of statements. For fingerprinting purposes, the order is not significant. What is significant is the use of the verbs in the statement.

In the COBOL section titled "Calculate-Transaction-Fee" of COBOL Example 1 above, the statements found in the section and their number of occurrences are:

• 1 Compute

• 1 Perform

• 3 If

• 2 Move

• 1 Divide

Data Acted on by a Statement

The last information required from the parser is what data is referenced or manipulated by the statement. All parsers provide this information. For each statement we need to know what data: a) is input to the statement (i.e., referenced or read); and b) is output or modified by the statement.

From a statement within the section titled "Calculate-Transaction-Fee" of the COBOL Example 1 above;

Perform Special-Discount-Check

Varying Discount-Type From 1 by 1 until Discount-Type Greater than Num-Discount-Types. We see that for this statement, Num-Discount-Types and 1 are input, while Discount-Type is output.

C Example 1

{ for(j=l;j <= upper ;j++)

{ if (Location (j) ==1 | |

Location (j) + 1 != WrapAround (LOWER, Location (j) + 1, UPPER) ) count++ ;

} return count; } long WrapAround (long lower, long num, long upper) { long NumReturned; // a long is 4 bytes NumReturned=num; if (num > upper)

NumReturned=lower + (num - upper) - 1; if (num < lower)

NumReturned=upper - (lower - num) + 1; returned NumReturned; } long Bounded (long lower, long num, long upper) { long NumBounded; } From the C Example 1 above, within the function WrapAround, if (num > upper )

- num and upper are both input

NumRe turned= lower + (num - upper) - 1 - lower, num, and upper are input, NumReturned is output.

Accessing Parser Information

The present invention provides an object oriented wrapper interface to the syntax tree 16. This insulates the fingerprinting methodology from changes made to the API (Application Programming Interface) of the parser, and provides a level of abstraction.

The object oriented wrapper of the present invention utilizes two types of objects. One type represents database objects stored in the syntax tree 16, and the other provides a database interface allowing access to the database objects within the syntax tree 16. Database objects may be considered as nodes and relationships within the syntax tree 16.

Database Objects

Database objects are of three forms: Nodes, Relationships, and Types.

Nodes are things which actually exist in the source code. Examples of nodes are:

a) a statement; b) a paragraph label; c) a variable definition; and d) a usage of a variable by a statement

Relationships indicate interaction between nodes. They are directional. Examples include: a) an "OF" relationship points from a variable usage to its definition; b) a "HAS" relationship points from a paragraph or function to a statement; c) another "HAS" relationship points from a statement to

. its variable usage; and d) a "HAVING" relationship is a recursive "HAS" relationship

Types describe the attributes of an object. For example, a type of STATEMENT_MOVE would be used to describe a COBOL move statement, and a type of USAGE_VARIABLE_MOD indicates usage of a variable such that its value is modified.

Database Interface

The following table lists the methods provided by the object oriented wrapper of the present invention that are utilized to obtain information from the syntax tree 16.

The fingerprinting process gets the parser data according to the following algorithm:

Get the programs of interest from the user, save in SetOfProgramsOfInterest Get the code blocks inside SetOfProgramsOfInterest , save in

SetCodeBlocks Get the statement types of interest from the user, save in SetStatementTypes

For each CodeBlock in SetCodeBlocks:

For each StatementType in SetStatementTypes

Tally number of statements of StatementType in CodeBlock Tally size of data referenced by statements of StatementType in CodeBlock

Tally size of data modified by statements of StatementType in CodeBlock Classification Engine Interface

The classification engine 28 of the preferred embodiment is a software package known as Auto Class C v3.2.1, developed by the computational sciences division at NASA Ames Research Center. AutoClass makes use of Bayesian statistical methods. More information on AutoClass may be obtained at the web site: http://ic-www.arc.nasa.gov/ic/projects/bayes- group/autoclass/. The classification engine 28 works best when each item to be classified is represented by a set of characteristics that have numeric values. The input to the classification engine 28 comprises:

i) a set of control information dictating characteristics of the classification process (duration, number of attempts, expected number of classes or groups, convergence algorithm to be used); and ii) a list of items to be classified, where each item is represented by a tag for identification and a value for each of the characteristics of interest for the classification.

The classification engine 28 does not assign any meaning to any of the characteristics. Nor does it, a priori, weight any characteristics differently than others. It determines a weighting during classification based on the utility of the characteristic for grouping things. For example, if a characteristic has the same value for every single case then it is useless and has a weighting of zero. Nor does it care how many characteristics items have, though each item must have a value specified for each characteristic and the characteristics must be in the same order for each of the items being classified.

Non numeric or discretely valued (index) numeric values can be assigned to a characteristic. The classification engine 28 recognizes when such values are the same, but makes no other attempt to compare non numeric input. For the purpose of the present invention, the characteristics of each item being input for classification represents the fingerprint for a block of code.

Output

The output from the classification engine 28 is a set of groupings containing the items that were submitted for classification. Each submitted item will be placed in exactly one group. The set of groupings represents the classification engine's best classification of the input.

For the purpose of the present invention, each group represents a set of blocks of code that have similar fingerprints. If the blocks of code have similar fingerprints, they must contain a similar set of statements. The fact that every member of a group contains similar statements makes them candidates for further review to determine if blocks of code are in fact repetitions of the same code.

Fingerprinting

Create a coarse fingerprint 18 reads the syntax tree 16 and counts the number of occurrences of each distinct type of statement supported by the language. For COBOL this means counting the number of occurrences of:

PERFORM GO IF

EVALUATE MOVE SET

COMPUTE ADD

SUBTRACT MULTIPLY DIVIDE

READ

WRITE

OPEN

CLOSE non COBOL statements, e.g. EXEC CICS

ACCEPT

CALL

DISPLAY

DELETE

ENTRY

INITIALIZE

INSPECT

MERGE

RECEIVE

REWRITE

SEND

SORT

START

STRING

UNSTRING

Create coarse fingerprint 18 then counts the total number of bytes of input to all occurrences of each statement type within the block of code being fingerprinted. Finally, it counts the total number of bytes of output from each statement type within the block of code.

This means that the raw characteristics of a block of code consist of three numbers for each statement type supported by the language of interest:

1. number of occurrences; 2. total bytes of input; and

3. total bytes of output

This information is made persistent by storing it in the Raw fingerprint database 20 to speed up subsequent classifications.

Referring to the following portion of COBOL Example 1 above;

Calculate-Transaction-Fee Section .

Compute Trans-Fee= Transfer-Amount * .001. Perform Special-Discount-Check

Varying Discount-Type From 1 by 1 until

Discount-Type Greater than Num-Discount-Types. If Trans-Fee > 100

Move 100.00 to Trans-Fee. If Account-Type = "M"

Divide Trans-Fee By 2 Giving Trans-Fee. If Account-Type = "Z"

Move 0 to Trans-Fee.

if we use variable sizes given as:

Then we have the raw characteristics for this block as:

With regard to constant values, a constant is treated as having the same number of bytes as the size of its target. Similarly, for the C function within C Example 1 above, given by:

long WrapAround (long lower, long num, long upper) { long NumReturned; // a long is 4 bytes if (num > upper)

NumReturned=lower + (num - upper) - 1; if (num < lower)

NumReturned=upper - (lower - num) + 1; return NumReturned;

Then we have the raw characteristics for this block as:

These characteristics are then stored in the raw fingerprint database 20. To this point, we have discussed those attributes of code groups which are defined by microscopic characteristics of specific types of logic within those groups. We will now discuss attributes which derive from the macroscopic behaviour of the code group.

Specifically, a block of code can be treated as a gestalt, having inputs and outputs like any single statement within it. The characteristics of these inputs and outputs comprise the macroscopic attributes of the block code. Languages such as COBOL do not explicitly declare their inputs and outputs, and even languages such as C do not necessarily distinguish between inputs and outputs. This section describes a method for automatic determination of those types from the structure of the code block and the code which surrounds it.

Loops complicate variables within and surrounding them as they make the order of statements less than clear. The following examples help to illustrate how the preferred embodiment of the present invention determines whether a variable associated with a loop is classified as input or output.

Problem: Loops complicate everything.

Solution: Redefine the problem. For input, given the last out-of-target modification prior to the first in-target reference; we simply require no interposing modification. For output the situation is reversed: There must be no interposing modification between the last in-target modification and the first subsequent out-of-target reference. Algorithm: Search starting at the outermost in-target use, and searching for the corresponding out-of-target use (forward for output, backward for input). Branch whenever a loop is exited (top or bottom), continuing one's search as though there were no loop, and continuing the other search at the opposite end of the loop.

Conditions: An in-loop search branch fails instantly after completely traversing the loop once or returning to the original in-target use. A search branch fails instantly upon striking an in-target modification (input) or non-target modification (output). A search succeeds instantly upon striking a non-target modification (input) or a non-target reference (output). If any branch succeeds, the test has succeeded. It is potentially possible for a target to enclose non-target code. For example, the most general case of an object selected for re-use may not include several lines which appear somewhere in the input code. Therefore, the user may wish to remove them from the definition of the re-usable object. As some parsers, including the Revolve parser, do not provide the ability to create a new parse tree from arbitrary code, (i.e. the re-usable object not including the unwanted code), the existing parse tree of the original source must be utilized. Therefore, the logic of the preferred embodiment of the present invention includes a way to not include the unwanted code, for the purpose of fingerprinting.

Returning now to Figure 1, aggregate fingerprint process 22 then reads the raw fingerprint data 20 output by create coarse fingerprint 18.

In this step the fact that many languages provide more than one way to accomplish the same thing, is taken into account. For an end user of the product, the preferred embodiment has the following aggregations for COBOL: a) PERFORM & GO are aggregated to represent control flow b) IF & EVALUATE are aggregated to represent conditionals c) MOVE & SET are aggregated to represent assignment d) all IO statements are aggregated to represent file access e) all math statements are aggregated to represent arithmetic f) all non COBOL statements are aggregated together

Referring to the calculate-transaction-fee section of COBOL Example 1 above;

Figures 2 and 4 illustrate this use of the aggregation, note for example, that in Figure 2 the "flow" box is checked to enable aggregation of Perform and Go statements, whereas in Figure 4 it is not checked to provide distinct counts for both perform and go statements. The advanced control screen of Figures 2 and 4 has been provided for knowledgeable analysts to allow them to manipulate aggregation for any of the above groups. This allows them to differentiate between occurrences of statements with similar behaviour and in doing so, highlight more subtle differences in blocks of code. This can be important when the analyst is interested in:

a) stylistic differences in code; or b) identifying usage of a particular construct that they may have determined is problematic. Consider, for example a COBOL calculation routine. If there were two versions of the same routine utilized in various places within the application, one of which used COMPUTE statements and the other of which used the older ADD or MULTIPLY statements, with aggregation on they would be lumped together. With aggregation off, they would be categorized separately. Thus, one of the versions may be problematic in that it may be wrong, and needs replacing with the other. The issue here being that there may be variances in how the two statements are implemented. If they are meant to perform the same function, consistency in behaviour can be assured by using the same coding construct.

The filter step 26 filters out characteristics that should be excluded from the current classification. This is done under user control. Excluding some characteristics is useful to draw out different similarities between blocks of code. As illustrated in Figure 6, an end user of the product can filter out the following characteristics:

a) I/O - excluding these statements desensitizes the classification to the stylistic differences between programmers who isolate their IO statements and those who code them inline; b) data - excluding the consideration of data highlights similarities in the types of statements. This is useful when searching for commonality in code structure that may be repeated for use with very different sets of data. For example the same algorithm may be applied against a single instance of data in some cases and against arrays of varying sizes in other cases. Excluding data desensitizes the classification to variances in data; and c) Logic & flow - excluding these statements filters out the control structure around algorithmic logic. This flattens the code structure and highlights similarities in data movement and transformation. In use, the user would select the source code modules 14 that comprise the project of interest, and input them to the parser 12 so that a syntax tree 16 may be generated.

Once the source code 14 has been parsed the create coarse fingerprint process 18 analyzes the syntax tree 16 and creates raw fingerprint data database 20. Raw fingerprint data database 20 contains counts of each type of statement and the number of input and output fields within each functional block of the selected source code 14. The user may then manipulate the data contained in raw fingerprint database 20 to provide the desired input to classification engine 28. The typical user will have access only to the classification setting screen of Figure 6 which allows them to simply select aggregate groups of statements or data. Analysts more familiar with the functionality of classification engine 28 will have access to the advanced classification settings screen illustrated in Figures 2 and 4.

Figure 2 illustrates a screen capture of the advance classification settings available to analysts more familiar with the functionality of the classification engine 28. As shown in Figure 2 the minimum statements option 32 has been set to five statements. The use data option 34 has been turned off so as not to consider the size of data fields when determining the characteristics to be selected for input to the classification engine 28. The statement type menu 36 has all of the high level grouping attributes selected. Thus, for example, both Perform and Go statements will be considered identical for the purpose of determining the characteristic of a block of code. As can readily be appreciated the advanced classification settings screen provides for finer control over the level of granularity than that provided by the default aggregation classification settings provided to the typical user. The autoclass settings section 38 enables a user knowledgable with the functionality of the classification engine 28 to set a plurality of input variables. The classification engine 28 uses a random seed to direct the search for its initial classification. Subsequent classifications are refinements of the first one. Using a known seed, rather than a random one, results in a reproducible, (albeit potentially useless), set of classifications. This facility is particularly useful for demonstration purposes. The search length selection box 41 indicates the duration of the search for new classifications that better satisfy the criteria selected. It can be specified in seconds, or in terms of the number of attempts at refinement. As can be appreciated, any of the input variables accepted by the classification engine 28 may be provided for the user in this advanced classification settings interface.

Figure 3 illustrates the default groupings created by the classification engine 28 on a set of source code 14 input to the parser 12. The formal output process 30 has broken the source code files selected into a number of groups 42. Within each group 42 is a paragraph heading (as this is a COBOL example) 44, the file name 46 which contains the paragraph 44, and a probability ranking 48. The probability ranking is a simple comparison of the attributes of a given code item to the mean values for the classification in which it appears. The displaying of groups 42 in the output screen allows the user to focus on the types of logic that they are primarily interest in. For example, if a user is primarily interested in analyzing business logic, the user can look at a couple of representative members of a group 42, and quickly characterize the group as I/O handling. Such a group can then be deleted and discarded, or deferred for later analysis.

Having selected the groups of interest, the user may now re- classify the code blocks using a finer classification pattern. Referring now to Figure 4 a screen capture of the advanced classification settings screen, the use data option 34 has been selected. Additionally within the statement type menu 36 the high level attributes have been turned off to eliminate grouping of statements and provide much finer statement resolution. The intent of this finer resolution is to create a group of sub- classifications, which in general terms may be similar but differ in minutely different ways. An example might be a coarse classification which finds a classification containing several date handling routines. Fine classification might create two sub groups of the date handling routines, one of which is year 2000 compliant and another which is not.

Figure 5 illustrates the output of the finer classification requested by the user in Figure 4. As can be seen the groups of Figure 5 are much different from the groups of Figure 3. Further group 1 of Figure 5 has been sub-divided into two sub groups. This creation of sub groups is a result of the characteristics of the code items being significantly different, although similar enough to have been grouped together. Re-classifying with different settings changes the nature of the attribute set that is input to the classification engine 28 and thus enables it to find sub groupings.

Figure 6 is a capture of the code classification settings screen available to the user when filtering the output of the aggregate fingerprint process 22. As shown the user may choose to exclude three general types of code from consideration for a particular classification:

a) logic and flow; b) I/O; and c) data.

In this example the user has also requested fast classification. In the preferred embodiment fast classification is a pre-defined setting for the user dialogue indicating a value of ten tries. Non-fast classification equates to fifty tries. The user has also requested that there be a minimum of ten statements in a paragraph (this being a COBOL example) in order for a block of code to be considered for fingerprinting. As can be appreciated to those skilled in the art, various characteristics other than the characteristics selected for the preferred embodiment may be utilized in creating a fingerprint for a block of code. The inventors have found promise in the following characteristics:

a) number of input parameters to a block; b) number of output parameters to a block; c) number of bytes input/ (number of bytes output) to a function or component; d) parameter types passed or returned to a function or component interface; e) label matching to identify similarities between functional blocks due to common labels; f) number of lines of code in a block; g) total variable references within a block; h) the number of flows into the block in control flow graph, i.e. the number of different places in the code that invoke the block, sometimes referred to as fan-in; i) number of flows out of the block in the control flow graph, i.e. the number of different blocks of code invoked by the block under consideration, sometimes referred to as fan-out; j) the McCabe metric for the block, a measure of cyclomatic complexity of the directed acyclic graph which represents the flow of control within each block, this reflects the number of logic decisions within the block; k) (lines of code) /(lines of non-blank comment), a measure of the level of code documentation;

1) (McCabe metric) /(lines of code in block), a measure of the complexity of code density; m) the number of literal string and numeric constants used in the block; n) various Halstead complexity metrics which are based on the number of operators and operands in the block; and o) the value of Halstead /(lines of code).

As will be apparent to those skilled in the art, various modifications and adaptations of the method and system described above are possible without departing from the present invention, the scope of which is defined in the appended claims.

Claims

WE CLAIM:

1. A method for identifying recurring code constructs within the source code of a software application, the method comprising the steps of:

2. The method of claim 1 wherein each fingerprint for said block of code comprises one or more characteristics, each characteristic comprising:

a) a statement type; b) the number of occurrences of said statement type within said block of code; c) the number of bytes of data input to said statement type; and d) the number of bytes output by said statement type.

3. The method of claim 2 which includes aggregating occurrences of statement types counted by step b) into a single group based upon similarity of statement type, when creating said fingerprints.

4. The method of claim 2 or 3 which includes counting the number of bytes of data in selected fingerprints, when creating said fingerprints.

5. The method of claim 3 wherein said groups comprise statements of the type:

a) flow control; b) conditional; c) assignment; ) arithmetic; e) Input /Output; and f) all statements not native to the language being parsed

6. The method of claim 2, which includes selecting the minimum number of statements required within a block of code for fingerprinting said block of code.

7. The method of claim 1 wherein said blocks of code are defined by a Section and/or Paragraph in the COBOL programming language.

8. The method of claim 1 wherein said blocks of code are defined by a function or procedure in the C programming language.

9. The method of claim 2 additionally comprising a filtering step to filter out characteristics or groups of characteristics prior to the submission of said fingerprints to said classification engine.

10. The method of claim 9 wherein said filtering step filters out at least one of the following characteristics:

a) Input /Output; b) Data; and c) Logic and flow.

11. A system for identifying recurring code constructs within the source code of a software application comprising: a) a code parser; b) a create coarse fingerprint module for analyzing the output of said parser to produce a raw fingerprint file; c) a classification engine to classify data contained within the raw fingerprint file; and d) an output module to format the data output from the classification engine to the user for analysis.

12. The system of claim 11 wherein said classification engine is the AutoClass software package provided by NASA.

13. The system of claim 11 or 12, which includes: an aggregate fingerprint module including a user input for selection of code components of interest, the aggregate fingerprint module being connected to the create coarse fingerprint module; and a filter connected between the aggregate fingerprint module and the classification engine, for filtering out selected characteristics.

14. A method for determining whether variables in a block of code are to be considered as input to or output from said block of code, independent upon access to or modification of said variables within and without said block of code.

15. A database interface, said interface providing methods for accessing an object within a syntax tree, said interface comprising methods for: a) retrieving said object from said syntax tree by type or reference; b) retrieving information regarding the attributes of said object; and c) retrieving abstract type or relationship data of said object, given a string representing the name of said object.

16. A method for excluding statements from within a block of code, said statements being excluded from consideration when calculating a fingerprint for said block of code.

17. The method of claim 2 which includes instructing said classification engine to utilize a known seed, rather than a random seed, to create a reproducible set of classifications.