WO2005008572A1 - Method and apparatus for automated feature selection - Google Patents

Method and apparatus for automated feature selection Download PDF

Info

Publication number
WO2005008572A1
WO2005008572A1 PCT/US2004/021981 US2004021981W WO2005008572A1 WO 2005008572 A1 WO2005008572 A1 WO 2005008572A1 US 2004021981 W US2004021981 W US 2004021981W WO 2005008572 A1 WO2005008572 A1 WO 2005008572A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sets
modified
features
feature set
Prior art date
Application number
PCT/US2004/021981
Other languages
French (fr)
Inventor
David E. Huddleston
Ronald J. Cass
Zhou Meng
Yoh-Han Pao
Qian Yang
Xinyu Mao
Original Assignee
Computer Associates Think, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Associates Think, Inc. filed Critical Computer Associates Think, Inc.
Priority to EP04756808A priority Critical patent/EP1654692A1/en
Publication of WO2005008572A1 publication Critical patent/WO2005008572A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features

Definitions

  • TECHNICAL FIELD This application relates to system modeling, pattern recognition and data mining.
  • the application relates to automated feature selection for system modeling, pattern recognition, data mining, etc.
  • Some features may be categorical and thus may be expanded to make each category a separate feature. Some features maybe time series and
  • the task then is to determine what feature or subset of features is to be used as the basis for decision making in classification and for other related data mining tasks such as modeling.
  • objects or data entities maybe described in terms of many features,
  • the weights in a neural network may be used in the process of carrying out the feature selection. For example, one may analyze the weights in a neural network to choose features with small weight for removal. In another case, one may use a genetic algorithm tool to carry out feature selection based on
  • the feature selection process is often carried out in a setting of
  • an apparatus includes a feature set generation module, a feature set evolution module, a feature set scoring module and an optimization module.
  • the feature set generation module selects an initial set of features from a plurality of available features.
  • the feature set evolution module modifies a feature set to generate
  • the feature set scoring module evaluates a selected feature set (that is, one of the initial feature sets or modified feature sets) to determine a selected feature set (that is, one of the initial feature sets or modified feature sets)
  • the optimization module drives the feature set generation module, feature set evolution module and feature set scoring module to obtain
  • a method for automated feature selection includes
  • a method for automated feature selection includes generating one or more initial sets of features, evaluating the initial feature sets to determine quality scores for the initial feature sets, selecting one or more of the feature sets according to the quality scores, modifying the selected feature sets to generate a generation of modified feature sets, and evaluating the modified feature sets to determine updated quality scores for the modified feature sets.
  • FIG. 1A shows a schematic diagram of an apparatus for automated feature
  • FIG. IB shows a flow chart of a method for automated feature selection
  • FIG. 2 shows a flow chart of a method for automated feature selection, according to another embodiment
  • FIG. 3 shows a table of results obtained from a study of an exemplary problem.
  • This application provides tools (in the form of methodologies, apparatuses and systems) for automated feature selection.
  • the tools may be embodied in one or more
  • Apparatus 10 includes a feature set generation module 11 , a feature set evolution module 12, a feature set scoring module 13 and an optimization module 14.
  • the feature set is described below with reference to FIGS. 1 A and IB.
  • Apparatus 10 includes a feature set generation module 11 , a feature set evolution module 12, a feature set scoring module 13 and an optimization module 14.
  • the feature set is described below with reference to FIGS. 1 A and IB.
  • Apparatus 10 includes a feature set generation module 11 , a feature set evolution module 12, a feature set scoring module 13 and an optimization module 14.
  • generation module 11 selects an initial set of features from a plurality of available
  • the feature set evolution module 12 modifies a feature set to generate one or
  • the feature set scoring module 13 evaluates a selected feature set (an initial feature set or modified feature set) to determine a quality score for
  • the optimization module 14 drives the feature set generation module 11, feature set evolution module 12 and feature set scoring module 13 to obtain a satisfactory feature set.
  • the feature set generation module 11 can generate the initial set of features based
  • the feature set evolution module 12 can apply evolution rules and/or a parameter corresponding to a desired amount of change, to generate the modified feature sets.
  • feature set evolution module can generate at least one of the modified feature sets by
  • the optimization module 14 can instruct the feature set generation module 11 to
  • the optimization module can select one or more of the feature sets to be modified
  • the feature set evolution module in order to generate the one or more modified feature
  • the optimization module can drive the feature set evolution module to generate
  • the satisfactory feature set typically has a satisfactory associated quality score.
  • the feature set generation module In the method for automated feature selection, the feature set generation module
  • the feature set scoring module 13 evaluates the initial feature sets to determine quality scores for the initial feature sets (step SI 02).
  • the optimization module 14 selects one or more of the feature sets according to the quality scores (step S 103).
  • the feature set evolution module
  • Step SI 04 modifies the selected feature sets to generate a generation of modified feature sets.
  • the feature set scoring module 13 evaluates the modified feature sets to determine updated quality scores for the modified feature sets (step SI 05). Steps SI 03
  • SI 06, No a modified feature set is satisfactory
  • At least one of the initial sets of features can be selected randomly, based on heuristics, using results from a previous feature selection run as a starting point, and/or
  • At least one of the modified feature sets is generated by applying
  • a parameter corresponding to a desired amount of change can also be applied, and/or one or more time lags and/or features can
  • a modified feature set can be deemed to be satisfactory, if the quality score of the modified feature set is a satisfactory value or if the quality score of the
  • modified feature set converges. At least one of the modified feature sets can be generated using Guided
  • the tools of this disclosure can be used for optimization of input features for
  • model generation and can be adapted to automatically select a group of features (for
  • model input feature set is both effective in terms of predictive accuracy, and parsimonious in order to conserve computing resources.
  • a typical strategy is to utilize statistical tools to look for correlation between candidate inputs and outputs, followed by trial and error to refine the
  • the measure of model effectiveness might be predictive error or R 2 against a validation data set.
  • Other generic objective functions can be used in place of R 2 .
  • the tools can embody a Guided Evolutionary Simulated Annealing (GESA) assisted feature selection approach.
  • GESA Guided Evolutionary Simulated Annealing
  • EPO Planning Optimization
  • EPO relies on an implementation of GESA which allows management of plan generation, evolution, and scoring to be encapsulated externally to GESA. GESA then manages only the optimization process itself. A set of model input features can be thought of as a plan to model the system. Thus, GAFS can leverage EPO, to find an optimal plan to model the system.
  • the first task is generation of completely new plans from scratch, as requested by GESA.
  • code has been developed to build model
  • This embodiment also incorporates a set of procedures to evolve new feature sets from old ones.
  • feature set scoring methodology in the embodiment is the standard R error measure
  • the GAFS approach can be used with Orthogonal Functional Link Net (OFLN) methodologies, and also is applicable to feature selection for any type of supervised learning, such as traditional Feed-Forward
  • NEURAL-NET MODEL GENERATION AND MAINTENANCE which is incorporated herein in its entirety by reference.
  • GAF GESA-Assisted Feature Selection
  • This approach uses the GESA (Guided Evolutionary Simulated Annealing) optimization technique to search for an optimal set of features to
  • a typical implementation of this approach includes a number of modules, including an optimization module, a feature set generation module,
  • the feature set generation module randomly selects a set of features from the
  • the feature set evolution module produces one or more alternative sets of features given an existing set of features and an optional parameter governing how much change is to be introduced in producing the alternative sets of features.
  • scoring module evaluates the quality of a given set of features.
  • the search process can start with the optimization module instructing the feature
  • set generation module to create initial sets of features.
  • users may configure
  • the initial feature sets are evaluated by the feature set scoring module, and the optimization module can instruct the feature set evolution module to generate another generation of feature sets based on existing ones and their scores.
  • the optimization module uses the feature set scoring module to evaluate the new feature sets and to choose some to start the next generation. This process can continue until convergence or until the best feature set found so far is deemed satisfactory.
  • the process according to an exemplary GAFS embodiment, is illustrated in
  • the feature set generation module picks one or more initial sets of features
  • Step S201 the feature set scoring module evaluates the initial sets of features.
  • the feature set generation module In the presence of categorical data and/or time lagged data, it is also desirable to have the feature set generation module to automatically create the derived features.
  • the feature set evolution module introduces some changes to an existing feature set randomly and/or with certain rules and heuristics, in order to generate the next
  • Step S202 Examples of changes may include adding/removing a feature, selecting a different time lag, etc.
  • the quality scores of modified feature sets are examined to determine whether
  • Step S203 any modified feature set has converged or is satisfactory. If there is no modified feature set which has converged or is satisfactory (Step S203, No), the
  • optimization module chooses some of the feature sets and passes the chosen feature sets to the feature set evolution module to serve as the starting point for the next generation
  • Step S205 Step S202 is repeated. After a satisfactory feature set is obtained
  • Step S203 results can be reported (Step S204).
  • One advantage of using the GAFS approach is that additional insights can easily be incorporated in the feature set generation and/or evolution modules. GAFS does not
  • the feature set scoring module is further discussed below.
  • the general goal of GAFS is to create a quality model.
  • One of the tasks is to define a measure of model quality.
  • There are different measures such as system error or R .
  • R system error
  • modeling it is also customary to split the available data into training and validation sets and consider the results from both together in judging model quality.
  • GAFS takes this into account by introducing a penalty in the scoring module that increases with the selected number of features expanded from a single categorical
  • time lags to be used are determined. This information is often unknown in advance. The most useful time lags also may not be continuous. But with GAFS, it is relatively easy to first try out a larger number of possible lags on a reduced data set to find a smaller set of more promising lag
  • the GAFS approach introduces a penalty in the scoring module that increases with the lag number.
  • R 2 t is the R 2 of the training set
  • R 2 V is that of the validation set
  • p c is the
  • penalty related to categorical data and p t is the penalty related to the time lagged data.
  • the weights wl through w4 can be determined based on user preference. This scoring function is maximized during the GAFS process. Since the GAFS scoring module includes model quality, the model is created first.
  • the GAFS approach was applied to a prediction application for an
  • the problem was to predict future e-mail volumes within the system based on past volumes, and on other past behavior of the system.
  • the data for this problem contained 17 possible raw input features. Six of these features were categorical, with a total of 27 separate expansions of the underlying categories. Further, the problem involved time-series prediction. It was decided to consider up to ten lags for any input feature. With 17 apparent inputs, there were actually 380 candidate inputs for this modeling problem.
  • the test data was chosen to be the most recent two weeks of the available data. The procedure followed for this problem was to first do a range-finding GAFS run to
  • FIG. 3 shows a table which summarizes the results of this study. The first run started with five model configurations with an average of 83 inputs, an average train
  • the final tuned result had a slightly decreased number of features, and slightly increased test score.
  • the train score was slightly decreased as well, but in this type of time-series prediction it was thought advisable to emphasize prediction accuracy on recent data.
  • the weighting factors in the scoring function can be used to emphasize train or test scores.

Abstract

A method for automated feature selection is provided. One or more initial sets of features are generated and evaluated to determine quality scores for the feature sets. Selected ones of the feature sets are (i) chosen according to the quality scores and modified to generate a generation of modified feature sets, (ii) the modified feature sets are evaluated to determine quality scores for the modified feature sets, and (i) and (ii) are repeated until a modified feature set is satisfactory.

Description

METHOD AND APPARATUS FOR AUTOMATED FEATURE SELECTION
TECHNICAL FIELD This application relates to system modeling, pattern recognition and data mining.
In particular, the application relates to automated feature selection for system modeling, pattern recognition, data mining, etc.
DESCRIPTION OF RELATED ART Feature selection is of theoretical interest and practical importance in the practice of pattern recognition and data mining. Data objects typically can be described in terms
of a number of feature values. Some features may be categorical and thus may be expanded to make each category a separate feature. Some features maybe time series and
thus may need time lagged values in addition to or in place of the current values. In
practice, even a seemingly small problem may actually have a large number of features.
The task then is to determine what feature or subset of features is to be used as the basis for decision making in classification and for other related data mining tasks such as modeling. Although objects or data entities maybe described in terms of many features,
some features maybe redundant or irrelevant for specific tasks, and therefore instead may
serve primarily as a source of confusion. It is not necessarily true that a larger number of features provides better results in task performance. Inclusion of irrelevant features increases noise and computational complexity. For neural net modeling, it is widely accepted that for the same training error, a model with a small number of input features can generalize better than one with a larger number of input features, or in other words, the former is of higher quality than the latter. Therefore, feature selection is a matter of considerable interest and importance in multivariate data analysis. For example, when a specific behavior or output of a specific system is modeled,
it is typically desirable to include only parameters that contribute to the modeled system behavior and not other parameters which contribute to other behaviors of the system but are not particularly relevant to the specific modeled behavior. Since the number of possible different groupings of features is combinatorial, i.e.
2" groupings for a set of n features, straightforward exhaustive search methods such as breadth-first, depth-first, or A* cannot be applied effectively. Many methods have been
proposed involving or based on neural networks, genetic algorithms, fuzzy sets, or
hybrids of those methodologies. Traditionally, feature selection is mostly associated with classification and different methods may be applied, and even neural networks, genetic algorithm, etc., may
be used in the process of carrying out the feature selection. For example, one may analyze the weights in a neural network to choose features with small weight for removal. In another case, one may use a genetic algorithm tool to carry out feature selection based on
multiple correlation. With wider and wider use of computer models of systems, such as those using
neural net technologies, the feature selection process is often carried out in a setting of
creating an optimal (or at least better) model of the system given an available set of
features, especially when categorical features or time lagged features are present. The disclosures of the following publications in their entireties are hereby incorporated by reference into this application in order to more fully describe the state of the art as known to those skilled therein as of the date of the invention described and claimed herein; R. Battiti, "Using mutual information for selecting features in supervised neural net learning", IEEE Transactions on Neural Networks, Vol. 5, No. 4, 1994;
M. J.A. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales, and Customer Support, John Wiley and Sons, 1997; F. Z. Brill, et. al., "Fast genetic selection of features for neural network classifiers", IEEE Transactions on Neural Networks, Vol. 3, No. 2, 1992; C. Gao, et. al., "A novel approach to intelligent scheduling based on fuzzy feature selection and fuzzy classifier", In Proceedings of the 38th Conference on Decision &
Control, Phoenix, Arizona USA, December 1994; N. Chaikla and Y. Qi. "Genetic Algorithms in Feature Selection". In IEEE
International Conference on Systems, Man, and Cybernetics, pages V 538-540, IEEE,
October 1999; C. Guerra-Salcedo et. al., "Fast and Accurate Feature Selection Using Hybrid Genetic Strategies", In CEC-1999, 1999; T. Masters, Practical Neural Network Recipes in C++, Academic Press, 1993; R. Setiono and H. Liu, "Neural-Network Feature Selector", IEEE Transactions on Neural Networks, Vol. 8, No. 3, 1997;
H. Vafaie and I. Imam, "Feature Selection Methods: Genetic Algorithms vs.
Greedy-like Search", In Proceedings of the International Conference on Fuzzy and Intelligent Control Systems, 199 '4; and
P. D. Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, 1993. SUMMARY
The application provides methods and apparatuses for automated feature selection. In one embodiment, an apparatus includes a feature set generation module, a feature set evolution module, a feature set scoring module and an optimization module.
The feature set generation module selects an initial set of features from a plurality of available features. The feature set evolution module modifies a feature set to generate
one or more modified feature sets. The feature set scoring module evaluates a selected feature set (that is, one of the initial feature sets or modified feature sets) to determine a
quality score for the selected feature set. The optimization module drives the feature set generation module, feature set evolution module and feature set scoring module to obtain
a satisfactory feature set. A method for automated feature selection, according to one embodiment, includes
(a) generating one or more initial sets of features and evaluating the initial feature sets to determine quality scores for the initial feature sets, (b) choosing selected ones of the
feature sets according to the quality scores and modifying the selected feature sets to generate a generation of modified feature sets, (c) evaluating the modified feature sets to
determine updated quality scores for the modified feature sets, and (d) repeating (b) and (c) until a modified feature set is satisfactory.
According to another embodiment, a method for automated feature selection includes generating one or more initial sets of features, evaluating the initial feature sets to determine quality scores for the initial feature sets, selecting one or more of the feature sets according to the quality scores, modifying the selected feature sets to generate a generation of modified feature sets, and evaluating the modified feature sets to determine updated quality scores for the modified feature sets.
BRIEF DESCRIPTION OF THE DRAWINGS The features of the present application can be more readily understood from the
following detailed description with reference to the accompanying drawings wherein: FIG. 1A shows a schematic diagram of an apparatus for automated feature
selection, according to an embodiment of the present application; FIG. IB shows a flow chart of a method for automated feature selection,
according to one embodiment of the present application; FIG. 2 shows a flow chart of a method for automated feature selection, according to another embodiment; and FIG. 3 shows a table of results obtained from a study of an exemplary problem.
DETAILED DESCRIPTION
This application provides tools (in the form of methodologies, apparatuses and systems) for automated feature selection. The tools may be embodied in one or more
computer programs stored on a computer readable medium and/or transmitted via a computer network or other transmission medium.
The following exemplary embodiments are set forth to aid in an understanding of
the subject matter of this disclosure, but are not intended, and should not be construed, to limit in any way the claims which follow thereafter. Therefore, while specific terminology is employed for the sake of clarity in describing some exemplary
embodiments, the present disclosure is not intended to be limited to the specific
terminology so selected, and it is to be understood that each specific element includes all
technical equivalents which operate in a similar manner. An apparatus and a method for automated feature selection, according to an embodiment of this application, is described below with reference to FIGS. 1 A and IB. Apparatus 10 includes a feature set generation module 11 , a feature set evolution module 12, a feature set scoring module 13 and an optimization module 14. The feature set
generation module 11 selects an initial set of features from a plurality of available
features. The feature set evolution module 12 modifies a feature set to generate one or
more modified feature sets. The feature set scoring module 13 evaluates a selected feature set ( an initial feature set or modified feature set) to determine a quality score for
the selected feature set. The optimization module 14 drives the feature set generation module 11, feature set evolution module 12 and feature set scoring module 13 to obtain a satisfactory feature set. The feature set generation module 11 can generate the initial set of features based
on heuristics, by using rules, randomly and/or by using results from a previous feature selection run as a starting point. The feature set evolution module 12 can apply evolution rules and/or a parameter corresponding to a desired amount of change, to generate the modified feature sets. The
feature set evolution module can generate at least one of the modified feature sets by
adding or removing one or more features and/or time lags. The optimization module 14 can instruct the feature set generation module 11 to
generate the initial sets of features, and instruct the feature set evolution module 12 to generate another generation of feature sets based on the quality scores of parent feature sets. The optimization module can select one or more of the feature sets to be modified
by the feature set evolution module, in order to generate the one or more modified feature
sets. The optimization module can drive the feature set evolution module to generate
additional modified feature sets, until the quality score of a modified feature set is a satisfactory value or until the quality score of a modified feature set converges. The satisfactory feature set typically has a satisfactory associated quality score.
In the method for automated feature selection, the feature set generation module
11 initially generates one or more initial sets of features (step SI 01). The feature set scoring module 13 evaluates the initial feature sets to determine quality scores for the initial feature sets (step SI 02). The optimization module 14 selects one or more of the feature sets according to the quality scores (step S 103). The feature set evolution module
12 modifies the selected feature sets to generate a generation of modified feature sets (step SI 04). The feature set scoring module 13 evaluates the modified feature sets to determine updated quality scores for the modified feature sets (step SI 05). Steps SI 03
through SI 05 can be repeated (SI 06, No) until a modified feature set is satisfactory (SI 06, Yes). At least one of the initial sets of features can be selected randomly, based on heuristics, using results from a previous feature selection run as a starting point, and/or
using rules. Similarly, at least one of the modified feature sets is generated by applying
evolution rules and/or by using heuristics. A parameter corresponding to a desired amount of change can also be applied, and/or one or more time lags and/or features can
be added or removed, to generate at least one of the modified feature sets. Generally, a modified feature set can be deemed to be satisfactory, if the quality score of the modified feature set is a satisfactory value or if the quality score of the
modified feature set converges. At least one of the modified feature sets can be generated using Guided
Evolutionary Simulated Annealing assisted feature selection (discussed below).
The tools of this disclosure can be used for optimization of input features for
model generation, and can be adapted to automatically select a group of features (for
example, features with nominal values and features having time lags from available ones), so as to achieve a model of better quality. In real-world modeling problems there is often more data available than is necessary and/or desirable to use when modeling a physical or procedural system. According to some numerical modeling techniques, it is an objective that the model input feature set is both effective in terms of predictive accuracy, and parsimonious in order to conserve computing resources. A typical strategy is to utilize statistical tools to look for correlation between candidate inputs and outputs, followed by trial and error to refine the
set of inputs. The measure of model effectiveness might be predictive error or R2 against a validation data set. Other generic objective functions can be used in place of R2.
Vertical domain specific methods of scoring model effectiveness can of course be
employed instead. In addition, various transformations of the raw candidate inputs can often improve model accuracy. An example might be transforming a date feature to the day-of-the week associated with that date. This process becomes exponentially more
difficult for time series problems (for example, stock market modeling), where time- lagged values of the candidate inputs are often considered as candidate inputs themselves.
Also, while lagged correlation techniques exist they are less effective than traditional
correlation calculations. As mentioned, the tools can embody a Guided Evolutionary Simulated Annealing (GESA) assisted feature selection approach. The GESA-assisted feature selection
(GAFS) approach strives to automate the selection of features in the setting of creating an optimal (or at least better) model. The GAFS methodology can build on Encapsulated
Planning Optimization (EPO), to automate and optimize the feature selection process.
Encapsulated Planning Optimization is described in commonly owned U.S. Provisional
Application No. 60/487,035, filed July 11, 2003 and entitled "ENCAPSULATED PLANNING OPTIMIZATION", which is incorporated herein in its entirety by reference. EPO relies on an implementation of GESA which allows management of plan generation, evolution, and scoring to be encapsulated externally to GESA. GESA then manages only the optimization process itself. A set of model input features can be thought of as a plan to model the system. Thus, GAFS can leverage EPO, to find an optimal plan to model the system.
Three functions are performed by the external encapsulated planning module.
The first task is generation of completely new plans from scratch, as requested by GESA.
Second, procedures are implemented to modify or evolve new plans fiom old ones. In
addition, a means to score each plan is provided. The analogous methods for GAFS are functions to generate new feature sets from scratch, to evolve new feature sets from old
ones, and to score the effectiveness of each feature set. According to one working embodiment, code has been developed to build model
features sets from scratch, both randomly and with heuristic elements. This embodiment also incorporates a set of procedures to evolve new feature sets from old ones. The
feature set scoring methodology in the embodiment is the standard R error measure,
obtained after training a model from the feature set under consideration. Features are also
incorporated to look at average performance across an ensemble of models, to account for
statistical variability inherent in the model training process. The GAFS approach can be used with Orthogonal Functional Link Net (OFLN) methodologies, and also is applicable to feature selection for any type of supervised learning, such as traditional Feed-Forward
Backpropagation neural networks. OFLN is described in commonly owned U.S.
application Serial No. 10/374,406, filed February 26, 2003 and entitled "AUTOMATIC
NEURAL-NET MODEL GENERATION AND MAINTENANCE", which is incorporated herein in its entirety by reference.
Since large numbers of models are trained during the GAFS process, an attempt can be made to control resource intensive parameters such as training and test data set sizes and/or candidate input feature set size. Good results were obtained through the
working embodiments, with a small data set size (~150 records) and a relatively large
feature set size (~400 candidate input features). Results were obtained in tens of hours, on a relatively powerful PC (2.2GHz CPU & 1GB RAM). Depending on the ensemble
sizes for score averaging, feature sets examined range from ~1000 per hour to ~10000 per
hour. Measures to improve performance, such as improved evolution procedures and programmatic parallel processing are under investigation. For larger data sets with large feature sets, a sampling strategy might be adopted during an initial GAFS run, to range over the available input features, followed by a higher resolution GAFS run with a filtered subset of input features. For this type of application, seemingly long cycle times are tolerable due to (i) the promised reduction in manpower for generating models, and (ii) the possibility of more optimized models than can be produced with human efforts
alone. For complex real-world systems, it is often the case that a large number of
features is related to the behaviors of the whole system. However, for a specific behavior
of the system, it is seldom clear what exact set of features affect it. When a specific
behavior is modeled, including features that affect other system behavior but irrelevant to this specific behavior often can degrade the quality of the model. If the features include categorical data and/or time lagged data, the problem is exacerbated because handling of them triggers creation of a large number of additional features, (for example, one feature for each category, or one feature for each time lag). For time lagged data, it is also difficult to know what lags should be used. The GESA-Assisted Feature Selection (GAFS) approach can be introduced to automate and optimize a feature selection process in order to arrive at an optimal (or at
least better) model of a system. This approach uses the GESA (Guided Evolutionary Simulated Annealing) optimization technique to search for an optimal set of features to
include in modeling system, and is especially suitable for cases where categorical inputs
and/or time lagged inputs are present. A typical implementation of this approach includes a number of modules, including an optimization module, a feature set generation module,
a feature set evolution module and a feature set scoring module. The feature set generation module randomly selects a set of features from the
available ones. The feature set evolution module produces one or more alternative sets of features given an existing set of features and an optional parameter governing how much change is to be introduced in producing the alternative sets of features. The feature set
scoring module evaluates the quality of a given set of features. The optimization module
can drive the other modules to carry out search for optimal feature combination. The search process can start with the optimization module instructing the feature
set generation module to create initial sets of features. Optionally users may configure
GAFS to use one or more results from a previous GAFS run as starting points. The
number of initial feature sets is user-configurable. The initial feature sets are evaluated by the feature set scoring module, and the optimization module can instruct the feature set evolution module to generate another generation of feature sets based on existing ones and their scores. The optimization module uses the feature set scoring module to evaluate the new feature sets and to choose some to start the next generation. This process can continue until convergence or until the best feature set found so far is deemed satisfactory. The process, according to an exemplary GAFS embodiment, is illustrated in
FIG. 2.
The feature set generation module picks one or more initial sets of features
randomly and/or with certain rales and heuristics, and the feature set scoring module evaluates the initial sets of features (Step S201). In the presence of categorical data and/or time lagged data, it is also desirable to have the feature set generation module to automatically create the derived features. The feature set evolution module introduces some changes to an existing feature set randomly and/or with certain rules and heuristics, in order to generate the next
generation, and the feature set scoring module evaluates the modified sets of features
(Step S202). Examples of changes may include adding/removing a feature, selecting a different time lag, etc.
The quality scores of modified feature sets are examined to determine whether
any modified feature set has converged or is satisfactory (Step S203). If there is no modified feature set which has converged or is satisfactory (Step S203, No), the
optimization module chooses some of the feature sets and passes the chosen feature sets to the feature set evolution module to serve as the starting point for the next generation
(Step S205), and then Step S202 is repeated. After a satisfactory feature set is obtained
(Step S203, Yes), results can be reported (Step S204). One advantage of using the GAFS approach is that additional insights can easily be incorporated in the feature set generation and/or evolution modules. GAFS does not
require a specific form of problem representation, such as bit string in the case of genetic algorithms which allows one to start with simple feature set generation and evolution modules such as a model which picks random features and to use this approach on a problem of reduced size (for example, with a small training sample). From the initial
results, one can often identify candidates of important features and/or identify rules or
heuristics for more effective feature set evolution. The gained experience can readily be
used to adapt the feature set generation and/or evolution module to tackle the original
problem more effectively. The feature set scoring module is further discussed below. The general goal of GAFS is to create a quality model. One of the tasks is to define a measure of model quality. There are different measures such as system error or R . Depending on situation,
one may select one or a combination to serve the purpose. In the case of neural net
modeling, it is also customary to split the available data into training and validation sets and consider the results from both together in judging model quality.
When categorical features are present, a categorical feature is often converted into
a set of features with each category being a feature itself. This conversion works well for
a small number of categories but may introduce too many features for a large number of categories. GAFS takes this into account by introducing a penalty in the scoring module that increases with the selected number of features expanded from a single categorical
feature. For time lagged data, such as in modeling of time series, the time lags to be used are determined. This information is often unknown in advance. The most useful time lags also may not be continuous. But with GAFS, it is relatively easy to first try out a larger number of possible lags on a reduced data set to find a smaller set of more promising lag
values and concentrate on them with the full scale problem. In practice, it is also desirable to keep the lag number small so that only recent history is used. Therefore, the GAFS approach introduces a penalty in the scoring module that increases with the lag number.
In one implementation of GAFS, the scoring is defined to be the following: S = wl * R2 t + w2 * R2 V - w3 * pc - w4 * pt
Where R2 t is the R2 of the training set, R2 V is that of the validation set, pc is the
penalty related to categorical data and pt is the penalty related to the time lagged data. The weights wl through w4 can be determined based on user preference. This scoring function is maximized during the GAFS process. Since the GAFS scoring module includes model quality, the model is created first.
Since random initialization is used in the neural net training methodology, several models may be tested for the same configuration. This process maybe computationally intensive.
Other parameters such as correlation may alternatively be used. However, with faster computers and fast modeling software such as OFLN technology, this approach becomes increasingly acceptable.
As an example, the GAFS approach was applied to a prediction application for an
e-mail-based problem reporting system. The problem was to predict future e-mail volumes within the system based on past volumes, and on other past behavior of the system. The data for this problem contained 17 possible raw input features. Six of these features were categorical, with a total of 27 separate expansions of the underlying categories. Further, the problem involved time-series prediction. It was decided to consider up to ten lags for any input feature. With 17 apparent inputs, there were actually 380 candidate inputs for this modeling problem. The test data was chosen to be the most recent two weeks of the available data. The procedure followed for this problem was to first do a range-finding GAFS run to
determine the most effective features in the full candidate set. Then a second GAFS run was done to tune only the lags for the features found in the first GAFS run. This is one
variation of the possible two-phase application of GAFS. Other variations might involve using only a sample of the train and test data in the range-finding run, for example. FIG. 3 shows a table which summarizes the results of this study. The first run started with five model configurations with an average of 83 inputs, an average train
score of 87.26, and an average test score of 97.97. Approximately 1400 model
configurations were examined in the first run, and ensembles of 5 model trainings per
configuration, with the total run time being about an hour on a 2.2 MHz PC. The number of inputs was reduced to 15, along with the improvement in scores shown in the table. For the second run the ensemble size was increased to 10, and approximately 300
model configurations were examined. The final tuned result had a slightly decreased number of features, and slightly increased test score. The train score was slightly decreased as well, but in this type of time-series prediction it was thought advisable to emphasize prediction accuracy on recent data. As noted above the weighting factors in the scoring function can be used to emphasize train or test scores.
The above specific embodiments are illustrative, and many variations can be introduced on these embodiments without departing from the spirit of the disclosure or from the scope of the appended claims. Elements and/or features of different illustrative embodiments maybe combined with and/or substituted for each other within the scope of the disclosure and the appended claims. For example, additional variations may be apparent to one of ordinary skill in the
art from reading the following commonly owned applications, which are incorporated herein in their entireties by reference:
U.S. Provisional Application No. 60/486,734, filed July 11, 2003 and entitled "GESA ASSISTED FEATURE SELECTION";
U.S. Application No. 10/418,659, filed April 18, 2003 and entitled "PROCESSING MIXED NUMERIC AND/OR NON-NUMERIC DATA"; U.S. Application No. 10/412,993, filed April 14, 2003 and entitled "METHOD
AND APPARATUS FOR DISCOVERING EVOLUTIONARY CHANGES WITHIN A
SYSTEM"; and U.S. Application No. 10/615,885, filed July 8, 2003 and entitled
"HIERARCHICAL DETERMINATION OF FEATURE RELEVANCY". This application claims the priority of commonly-owned U.S. Provisional
Application No. 60/486,734, filed July 11, 2003 and entitled "GESA ASSISTED
FEATURE SELECTION", which is incorporated herein in its entirety by reference.

Claims

What is claimed is: 1. A method for automated feature selection, comprising:
(a) generating one or more initial sets of features and evaluating the initial feature sets to determine quality scores for the initial feature sets; (b) choosing selected ones of the feature sets according to the quality scores and modifying the selected feature sets to generate a generation of modified feature sets; and (c) evaluating the modified feature sets to determine updated quality scores for the modified feature sets.
2. The method of claim 1, wherein at least one of the initial sets of features is
selected using heuristics.
3. The method of claim 1, wherein at least one of the initial sets of features is
selected randomly.
4. The method of claim 1, wherein at least one of the initial sets of features is selected by using results from a previous feature selection run as a starting point.
5. The method of claim 1, wherein at least one of the initial sets of features are selected by using rules.
6. The method of claim 1, wherein at least one of the modified feature sets is
generated by applying evolution rules.
7. The method of claim 1, wherein at least one of the modified feature sets is generated by using heuristics.
8. The method of claim 1, wherein a parameter corresponding to a desired amount of change is applied to generate at least one of the modified feature sets.
9 The method of claim 1, wherein one or more features are added or removed to
generate at least one of the modified feature sets.
10. The method of claim 1, wherein one or more time lags are added or removed
to generate at least one of the modified feature sets.
11. The method of claim 1, further comprising repeating (b) and (c) until a modified feature set is satisfactory.
12. The method of claim 11, wherein a modified feature set is satisfactory if the quality score of the modified feature set is a satisfactory value.
13. The method of claim 11, wherein a modified feature set is satisfactory if the quality score of the modified feature set converges.
14. The method of claim 1, wherein at least one of the modified feature sets is generated using Guided Evolutionary Simulated Annealing assisted feature selection.
15. A computer system, comprising: a processor; and a program storage device readable by the computer system, tangibly embodying a program of instractions executable by the processor to perform the method claimed in claim 1.
16. A program storage device readable by a machine, tangibly embodying a
program of instractions executable by the machine to perform the method claimed in
claim 1.
17. A computer data signal transmitted in one or more segments in a transmission medium which embodies instractions executable by a computer to perform the method
claimed in claim 1.
18. A method for automated feature selection, comprising: generating one or more initial sets of features; evaluating the initial feature sets to determine quality scores for the initial feature
sets; selecting one or more of the feature sets according to the quality scores; modifying the selected feature sets to generate a generation of modified feature sets; and evaluating the modified feature sets to determine updated quality scores for the modified feature sets.
19. A computer system, comprising: a processor; and a program storage device readable by the computer system, tangibly embodying a program of mstractions executable by the processor to perform the method claimed in
claim 18.
20. A program storage device readable by a machine, tangibly embodying a
program of instractions executable by the machine to perform the method claimed in
claim 18.
21. A computer data signal transmitted in one or more segments in a transmission medium which embodies instractions executable by a computer to perform the method claimed in claim 18.
22. An apparatus for automated feature selection, comprising: a feature set generation module adapted to select an initial set of features from a plurality of available features; a feature set evolution module adapted to modify a feature set to generate one or more modified feature sets; a feature set scoring module adapted to evaluate a selected one of the initial feature set and modified feature sets to determine a quality score for the selected feature set; and an optimization module adapted to drive the feature set generation module, feature set evolution module and feature set scoring module to obtain a satisfactory feature set.
23. The apparatus of claim 22, wherein the feature set generation module generates at least one of the initial set of features using heuristics.
24. The apparatus of claim 22, wherein the feature set generation module
generates at least one of the initial set of features randomly.
25. The apparatus of claim 22, wherein the feature set generation module selects at least one of the initial set of features by using rules.
' 26. The apparatus of claim 22, wherein the feature set generation module generates at least one of the initial set of features by using results from a previous feature selection ran as a starting point.
27. The apparatus of claim 22, wherein the feature set evolution module applies evolution rules to generate at least one of the modified feature sets.
28. The apparatus of claim 22, wherein the feature set evolution module applies a parameter corresponding to a desired amount of change to generate at least one of the modified feature sets.
29. The apparatus of claim 22, wherein the feature set evolution module generate
at least one of the modified feature sets by adding or removing one or more features.
30. The apparatus of claim 22, wherein the feature set evolution module generate
at least one of the modified feature sets by adding or removing one or more time lags.
31. The apparatus of claim 22, wherein the optimization module instructs the feature set generation module to generate the initial sets of features.
32. The apparatus of claim 22, wherein the optimization module instructs the feature set evolution module to generate another generation of feature sets based on the quality scores of parent feature sets.
33. The apparatus of claim 22, wherein the optimization module selects one or
more of the feature sets to be modified by the feature set evolution module, in order to generate the one or more modified feature sets.
34. The apparatus of claim 22, wherein the optimization module drives the
feature set evolution module to generate additional modified feature sets, until the quality score of a modified feature set is a satisfactory value.
35. The apparatus of claim 22, wherein the optimization module drives the feature set evolution module to generate additional modified feature sets, until the quality score of a modified feature set converges.
36. The apparatus of claim 22, wherein the satisfactory feature set has a
satisfactory associated quality score.
PCT/US2004/021981 2003-07-11 2004-07-09 Method and apparatus for automated feature selection WO2005008572A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP04756808A EP1654692A1 (en) 2003-07-11 2004-07-09 Method and apparatus for automated feature selection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US48673403P 2003-07-11 2003-07-11
US60/486,734 2003-07-11

Publications (1)

Publication Number Publication Date
WO2005008572A1 true WO2005008572A1 (en) 2005-01-27

Family

ID=34079289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/021981 WO2005008572A1 (en) 2003-07-11 2004-07-09 Method and apparatus for automated feature selection

Country Status (3)

Country Link
US (1) US7562054B2 (en)
EP (1) EP1654692A1 (en)
WO (1) WO2005008572A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2497516A (en) * 2011-12-05 2013-06-19 Univ Lincoln Generating training data for automation of image analysis
CN110135057A (en) * 2019-05-14 2019-08-16 北京工业大学 Solid waste burning process dioxin concentration flexible measurement method based on multilayer feature selection
CN111242310A (en) * 2020-01-03 2020-06-05 腾讯科技(北京)有限公司 Feature validity evaluation method and device, electronic equipment and storage medium

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060059145A1 (en) * 2004-09-02 2006-03-16 Claudia Henschke System and method for analyzing medical data to determine diagnosis and treatment
US20090083075A1 (en) * 2004-09-02 2009-03-26 Cornell University System and method for analyzing medical data to determine diagnosis and treatment
DE102005031117A1 (en) * 2005-07-04 2007-01-11 Siemens Ag Method and device for determining an operating parameter of a shockwave source
WO2007072256A2 (en) * 2005-12-23 2007-06-28 Koninklijke Philips Electronics N.V. Apparatus and method for classifying data
US8417783B1 (en) * 2006-05-31 2013-04-09 Proofpoint, Inc. System and method for improving feature selection for a spam filtering model
US8019594B2 (en) * 2006-06-30 2011-09-13 Robert Bosch Corporation Method and apparatus for progressively selecting features from a large feature space in statistical modeling
US8117137B2 (en) 2007-04-19 2012-02-14 Microsoft Corporation Field-programmable gate array based accelerator system
US8301638B2 (en) * 2008-09-25 2012-10-30 Microsoft Corporation Automated feature selection based on rankboost for ranking
US8131659B2 (en) * 2008-09-25 2012-03-06 Microsoft Corporation Field-programmable gate array based accelerator system
US10083459B2 (en) * 2014-02-11 2018-09-25 The Nielsen Company (Us), Llc Methods and apparatus to generate a media rank
EP3134823A4 (en) * 2014-06-03 2017-10-25 Excalibur IP, LLC Determining traffic quality using event-based traffic scoring
US10002329B2 (en) * 2014-09-26 2018-06-19 Facebook, Inc. Selection and modification of features used by one or more machine learned models used by an online system
US11288573B2 (en) * 2016-05-05 2022-03-29 Baidu Usa Llc Method and system for training and neural network models for large number of discrete features for information rertieval
KR102470145B1 (en) * 2017-01-03 2022-11-24 한국전자통신연구원 Data meta-scaling Apparatus and method for continuous learning
CN110837909A (en) * 2018-08-17 2020-02-25 北京京东尚科信息技术有限公司 Method and device for predicting order quantity
US11669758B2 (en) 2019-11-12 2023-06-06 Rockwell Automation Technologies, Inc. Machine learning data feature reduction and model optimization
US11742081B2 (en) 2020-04-30 2023-08-29 International Business Machines Corporation Data model processing in machine learning employing feature selection using sub-population analysis
US11429899B2 (en) 2020-04-30 2022-08-30 International Business Machines Corporation Data model processing in machine learning using a reduced set of features
CN111817805B (en) * 2020-07-08 2022-07-19 中国人民解放军国防科技大学 Method, device and medium for adjusting channel propagation model parameters

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0310366A (en) * 1989-05-19 1991-01-17 Philips Gloeilampenfab:Nv Artificial neural network
JP2000064933A (en) * 1998-08-19 2000-03-03 Yamaha Motor Co Ltd Method for starting two-cycle direct injection engine
US6886003B2 (en) * 2000-06-28 2005-04-26 Yamaha Hatsudoki Kabushiki Kaisha Method for controlling machine with control module optimized by improved evolutionary computing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MICHALEWICZ Z ET AL: "Evolutionary computation techniques and their applications", INTELLIGENT PROCESSING SYSTEMS, 1997. ICIPS '97. 1997 IEEE INTERNATIONAL CONFERENCE ON BEIJING, CHINA 28-31 OCT. 1997, NEW YORK, NY, USA,IEEE, US, 28 October 1997 (1997-10-28), pages 14 - 25, XP010276493, ISBN: 0-7803-4253-4 *
SIEDLECKI W ET AL: "ON AUTOMATIC FEATURE SELECTION", INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, SINGAPORE, XX, vol. 2, no. 2, June 1988 (1988-06-01), pages 197 - 220, XP008034768, ISSN: 0218-0014 *
YIP P P ET AL: "COMBINATORIAL OPTIMIZATION WITH USE OF GUIDED EVOLUTIONARY SIMULATED ANNEALING", IEEE TRANSACTIONS ON NEURAL NETWORKS, IEEE INC, NEW YORK, US, vol. 6, no. 2, 1 February 1995 (1995-02-01), pages 290 - 295, XP000492663, ISSN: 1045-9227 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2497516A (en) * 2011-12-05 2013-06-19 Univ Lincoln Generating training data for automation of image analysis
US9367765B2 (en) 2011-12-05 2016-06-14 University Of Lincoln Method and apparatus for automatic detection of features in an image and method for training the apparatus
CN110135057A (en) * 2019-05-14 2019-08-16 北京工业大学 Solid waste burning process dioxin concentration flexible measurement method based on multilayer feature selection
CN110135057B (en) * 2019-05-14 2021-03-02 北京工业大学 Soft measurement method for dioxin emission concentration in solid waste incineration process based on multilayer characteristic selection
CN111242310A (en) * 2020-01-03 2020-06-05 腾讯科技(北京)有限公司 Feature validity evaluation method and device, electronic equipment and storage medium
CN111242310B (en) * 2020-01-03 2023-04-18 深圳市雅阅科技有限公司 Feature validity evaluation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
US20050049913A1 (en) 2005-03-03
EP1654692A1 (en) 2006-05-10
US7562054B2 (en) 2009-07-14

Similar Documents

Publication Publication Date Title
US7562054B2 (en) Method and apparatus for automated feature selection
US7526461B2 (en) System and method for temporal data mining
US6865582B2 (en) Systems and methods for knowledge discovery in spatial data
Kuehn Digital twins for decision making in complex production and logistic enterprises
US10921759B2 (en) Computer system and method for monitoring key performance indicators (KPIs) online using time series pattern model
Harvey et al. Automated feature design for numeric sequence classification by genetic programming
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN110571792A (en) Analysis and evaluation method and system for operation state of power grid regulation and control system
Dutta Integrating AI and optimization for decision support: A survey
Su et al. Intelligent scheduling controller for shop floor control systems: a hybrid genetic algorithm/decision tree learning approach
Nagahara et al. Toward data-driven production simulation modeling: dispatching rule identification by machine learning techniques
Thamarai et al. An evolutionary computation approach for project selection in analogy based software effort estimation
CN116861924A (en) Project risk early warning method and system based on artificial intelligence
Liu et al. Residual useful life prognosis of equipment based on modified hidden semi-Markov model with a co-evolutional optimization method
CN108363738B (en) Recommendation method for industrial equipment data analysis algorithm
Nagahara et al. Toward data-driven modeling of material flow simulation: automatic parameter calibration of multiple agents from sparse production log
Pradeep et al. Optimal Predictive Maintenance Technique for Manufacturing Semiconductors using Machine Learning
Elwakil Knowledge discovery based simulation system in construction
CN111047011B (en) Scene variable automatic deep mining engine system based on machine learning model
CN114548494A (en) Visual cost data prediction intelligent analysis system
KR20220014744A (en) Data preprocessing system based on a reinforcement learning and method thereof
CN117609100B (en) Automatic code merging and deploying method
CN116703470B (en) Method, device, equipment and storage medium for predicting supply information
TWI787669B (en) System and method of automated machine learning based on model recipes
Chang et al. Genetic algorithm and case-based reasoning applied in production scheduling

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004756808

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2004756808

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 2004756808

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2004756808

Country of ref document: EP