US20110119281A1 - Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database - Google Patents

Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database Download PDF

Info

Publication number
US20110119281A1
US20110119281A1 US12/775,125 US77512510A US2011119281A1 US 20110119281 A1 US20110119281 A1 US 20110119281A1 US 77512510 A US77512510 A US 77512510A US 2011119281 A1 US2011119281 A1 US 2011119281A1
Authority
US
United States
Prior art keywords
hop
view
dimensions
chain
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/775,125
Inventor
Cliff A. Joslyn
John S. Burke
Terence J. Critchlow
Emilie Hogan
Nicolas Hengartner
Judith Cohn
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Battelle Memorial Institute Inc
Triad National Security LLC
Original Assignee
Battelle Memorial Institute Inc
Los Alamos National Security LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Battelle Memorial Institute Inc, Los Alamos National Security LLC filed Critical Battelle Memorial Institute Inc
Priority to US12/775,125 priority Critical patent/US20110119281A1/en
Assigned to BATTELLE MEMORIAL INSTITUTE reassignment BATTELLE MEMORIAL INSTITUTE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JOSLYN, CLIFF A, BURKE, JOHN S., CRITCHLOW, TERENCE J., HOGAN, EMILIE
Assigned to U.S. DEPARTMENT OF ENERGY reassignment U.S. DEPARTMENT OF ENERGY CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: BATTELLE MEMORIAL INSTITUTE, PACIFIC NORTHWEST DIVISION
Assigned to U.S. DEPARTMENT OF ENERGY reassignment U.S. DEPARTMENT OF ENERGY CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: LOS ALAMOS NATIONAL SECURITY
Publication of US20110119281A1 publication Critical patent/US20110119281A1/en
Assigned to TRIAD NATIONAL SECURITY, LLC reassignment TRIAD NATIONAL SECURITY, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LOS ALAMOS NATIONAL SECURITY, LLC
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data

Definitions

  • the present invention is related to the field of relational database technology.
  • OLAP technology is commonly attributed with the ability to provide analysts with rapid access to summary, aggregated data views of a single large multi-dimensional database, and is recognized for its ability to provide knowledge representation and discovery in high-dimensional relational databases.
  • OLAP tools can provide intuitive and graphical access to the massively complex set of possible summary views available in large relational structured data repositories.
  • the ability to handle such data complexity also presents a wide-ranging, combinatorially vast space of options that can seem impossible to comprehend and/or analyze. Accordingly, there is a need for knowledge discovery techniques that guide users' knowledge discovery tasks and that assist in finding relevant patterns, trends, and anomalies.
  • Embodiments of the present invention address the challenge of navigating a combinatorially vast space of data views of a multi-dimensional database by casting the space of data views as a combinatorial object comprising all projections and subsets and by casting the discovery of analyst-significant data views as a search process over that object.
  • Statistical information theoretical measures are provided with the object and are sufficient to support a combinatorial optimization process. Accordingly, users can be guided, or taken automatically, across a permutation of the dimensions by searching for successive data views having two or more dimensions.
  • a multi-dimensional database comprises a plurality of records with dimensions and is stored on a memory device.
  • An exemplary multi-dimensional database is an online analytical processing (OLAP) database.
  • a data view can refer to a subset of dimensions and data records from a multi-dimensional database and can represent a portion of the database that is significant to an analyst. In some embodiments, the data view comprises at most two dimensions because analysts typically experience difficulty comprehending additional dimensions.
  • the method for discovering portions of a multi-dimensional database that are significant to an analyst is computer-implemented and includes specifying a data view having at least two dimensions and all records of the database.
  • a plurality of operation iterations are then performed on the data view, wherein each iteration is a chain operation, a hop operation or an anti-hop operation.
  • the operation iterations are ceased upon satisfaction of a termination criteria.
  • the termination criteria can include, but are not limited to, a command from an analyst, a uniform distribution of all remaining records across all remaining dimensions, a lack of remaining dimensions, or a lack of remaining records.
  • the resulting data view can then be presented to an analyst.
  • a chain operation can comprise calculating a chain statistical significance measure for each value of each of the dimensions in the data view, selecting one or more chain values for a dimension in the view, adding the chain values to a filter, and removing the dimension of the chain values from the view.
  • Exemplary chain statistical significance measures can include, but are not limited to, Hellinger distance, Hellinger distance augmented by p-value significance, relative entropy, and generalized alpha divergence.
  • the selecting of one or more chain values occurs automatically based on the values having maximal chain statistical significance measures.
  • a hop operation can comprise calculating a hop statistical significance measure, relative to the dimensions in the view and constrained by the filter, for each of the dimensions that is neither in the data view nor in the filter.
  • the hop operation can further comprise selecting a hop dimension from the dimensions that are not in the view or in the filter and adding the hop dimension to the data view.
  • Exemplary hop statistical significance measures can include, but are not limited to, conditional entropy and model likelihood metric.
  • the selecting of a hop dimension occurs automatically based on the dimensions having minimal hop statistical significance measures.
  • An anti-hop operation can comprise calculating an anti-hop statistical significance measure, relative to other dimensions in the view and constrained by the filter, for each of the dimensions in the view.
  • Exemplary anti-hop statistical significance measures can include, but are not limited to, relative entropy.
  • the anti-hop operation can further comprise selecting an anti-hop dimension from the dimensions in the view and removing the anti-hop dimension from the view. In some embodiments, the selecting of an anti-hop dimension occurs automatically based on maximal relative entropy.
  • a hop operation and a chain operation are performed in alternating order.
  • Embodiments of the present invention can be utilized at various degrees of automation for the analyst user.
  • the data view can be initially populated with dimensions arbitrarily rather than relying on an analyst to specify the initial dimensions.
  • an empty filter can be created and arbitrarily populated with values for a dimension.
  • the chain, hop, and anti-hop operations can proceed substantially automatically as describe above, the selection of one or more chain values, the selection of a hop dimension, or the selection of an anti-hop dimension can occur manually based on input from an analyst.
  • the chain, hop, and/or anti-hop statistical significance measures can be considered by the analyst or they can be disregarded in favor of the analyst's knowledge or preference.
  • An analyst guided approach can involve the present invention presenting suggested options, which the analyst can accept or override with manual selections.
  • FIG. 1 is an illustration depicting projection, extension, filtering, and flushing operations as well as an exemplary view operation according to embodiments of the present invention.
  • FIG. 2 is an illustration depicting the structure 3 [2] .
  • FIG. 3 is a screenshot of a first view of a data set as represented in a data visualization tool.
  • FIG. 4 is a plot showing the distribution of alarm counts by month.
  • FIG. 5 is a plot showing frequency distributions of radiation portal monitor (RPM) roles.
  • FIG. 6 is a plot showing frequency distributions of months.
  • FIG. 7 a is a plot showing Hellinger distances of rows and columns against their marginals
  • OLAP uses a mathematical formalism that is similar to the mathematical tools required to analyze OLAP databases, but is different in a number of ways as well. For example, projections, I, on dimensions and restrictions, J, on records are combined into a lattice-theoretical object called a view, D I,J .
  • OLAP concerns databases organized around collections of variables which can be distinguished as: dimensions, which have a hierarchical structure, and whose Cartesian product forms the data cube's schema; and measures, which can be numerically aggregated within different slices of that schema.
  • the present description considers cubes with a single integral measure, which in some embodiments is the count of a number of records in the underlying database. However, any numerical measure could yield, through appropriate normalization, frequency distributions for use in the view discovery technique of the present invention.
  • data cubes are multi-dimensional models of an underlying relational database. They are built by identifying a number of dimensions representing categories of interest from the database, each with a possibly hierarchical structure, and then forming their cross-product to represent all possible combinations of values of those dimensions, thus facilitating aggregation of critical quantities over multiple projections of interest.
  • the dimensions used included dimensions for multiple time representations, spatial hierarchies of collections of RPMs at different locations, and RPM attributes such as vendor.
  • RPM radiation portal monitors
  • Count and frequency functions convey to the projected count and frequency functions denoted c[I]: X ⁇ I ⁇ and f[I]:X ⁇ I ⁇ [0,1], so that
  • any set of record indices J ⁇ is called a filter.
  • the filtered count function can be considered c J :X ⁇ 0, 1, . . . ⁇ and frequency function ⁇ J :X ⁇ [0,1] whose values are reduced by the restriction in J ⁇ , now determining
  • each projector I ⁇ can be cast as a point in the Boolean lattice B N of dimension N called a projector lattice.
  • each filter J ⁇ is a point in a Boolean lattice B M called a filter lattice.
  • Operations on data views can then be defined as transitions from an initial view to another or , corresponding to a move in the view lattice B:
  • Flushing Addition of records by weakening (reversing, flushing) the filter, so that J′ ⁇ J. This corresponds to moving potentially multiple steps up in .
  • filters J defining which records to include in a view can be specified arbitrarily, for example through any SQL or MDX where clause, or through OLAP operations like top n, including the n records with the highest value of some feature.
  • filters are specified as relational expressions in terms of the dimensional values, as expressed in MDX where clauses.
  • each relational filter expression references a certain set of variables, in this case RPM Mfr and Month, denoted as R ⁇ .
  • RPM Mfr and Month denoted as R ⁇ .
  • Filtering expressions can have many sources, such as Show Only or Hide. It is common in full (hierarchical) OLAP to select a collection of siblings within a particular sub-branch of a hierarchical dimension. For example for a spatial dimension, the user within an OLAP database software system, such as ProClarity, might select All ⁇ USA ⁇ California, or its children California ⁇ Cities, all siblings. But those portions of filter expressions involving background variables do not change which rows or columns are displayed, but only serve to reduce the values shown in cells. In ProClarity, these are shown in the Background pane.
  • I ⁇ RPM Mfr, Location ⁇
  • R ⁇ RPM Mfr, Month ⁇
  • R f ⁇ RPM Mfr ⁇
  • the filter J is fixed and the superscript on f is suppressed.
  • I 2 ](x) is the probability of the vector x ⁇ I 1 ⁇ I 2 restricted to the I 1 ⁇ I 2 dimensions given that it is known that one can only choose vectors whose restriction to I 2 is x ⁇ I 2 .
  • ⁇ ](x) f[I 1 ](x),f[ ⁇
  • I 2 ] f[I 1 ⁇ I 2
  • conditional views live in a different combinatorial structure than the view lattice . Describing I 1
  • I 2 and J in a conditional view requires three sets I 1 ,I 2 ⁇ and J ⁇ with I 1 and I 2 disjoint. So define : 3 [N] ⁇ 2 M where 3 [N] is a graded poset with the following structure:
  • An element in the poset 3 [N] corresponds to an I 1
  • This poset is called 3 [N] because it's size is 3 N and it really corresponds to partitioning into three disjoint sets, the first being I 1 , the second being I 2 and the third being ⁇ (I 1 ⁇ I 2 ).
  • the structure 3 [2] is shown in FIG. 2 .
  • a view ⁇ B which is identified with its frequency f J [I]
  • a conditional view ⁇ A which is identified with its conditional frequency f J [I 1
  • the aim is measuring how “interesting” or “unusual” it is, as measured by departures from a null model.
  • Such measures can be used for combinatorial search over the view structures B, A to identify noteworthy features in the data.
  • conditional entropy H(f J [I 1
  • I 2 ]): H ( f J [I 1 ⁇ I 2 ]) ⁇ H ( f J [I 2 ]).
  • D ⁇ ( P ⁇ ⁇ Q ) ⁇ ap ⁇ ( x ) + ( 1 - ⁇ ) ⁇ q ⁇ ( x ) - p ⁇ ( x ) ⁇ ⁇ q ⁇ ( x ) 1 - ⁇ ⁇ ⁇ ( 1 - ⁇ ) ⁇ ⁇ ⁇ ( ⁇ x ) .
  • the Hellinger metric ⁇ square root over (D 1/2 ) ⁇ is symmetric in both p and q, and satisfies the triangle inequality.
  • a variety of user-guided, and/or automated, navigational tasks can be embodied by the present invention.
  • “drill-down paths” can be described as creating a series of views with projectors I 1 ⁇ I 2 ⁇ I 3 of increasingly specified dimensional structure.
  • many analysts are challenged by complex views of high dimensionality, while still needing to explore many possible data interactions.
  • embodiments of the present invention can restrict analysts to two-dimensional views only, producing a sequence of projectors I 1 , I 2 , I 3 where
  • 2 and
  • 1, thus affecting a permutation of the variables X i .
  • each calculated entropy can be supplemented with the probability that under the null distribution that the row has the same distribution as the marginal, of observing an empirical entropy larger or equal to actual value. When that probability is large, say greater than 5%, then its value can be considered spurious and be set to zero before proceeding with the algorithm.
  • a hop operation and a chain operation can be performed in alternating order (i.e., a hop-chain operation).
  • One way of performing the hop-chain view discovery can be performed as described below.
  • the marginal distribution is f X 1 x k 1 [I] of that individual row, using the superscript to indicate the relational expression filter. Also, the marginal f J [I ⁇ ⁇ X 1 ⁇ ] over all the rows for the current filter J is known. In light of the discussion just above, all the Hellinger distances can be calculated between each of the rows and this row marginal as
  • i′′ represents the variable with the most constraint against i′′, that may be the most appropriate selection, or it can be selected automatically.
  • ProClarity® is used in conjunction with SQL Server Analysis Services (SSAS) 2005 and the R statistical platform v. 2.7 (see http://www.r-project.org).
  • ProClarity® is a visual analytics tool that provides a flexible and friendly GUI environment with extensive API support which is used to gather current display contents and query context for row, column and background filter selections.
  • R is currently used in either batch or interactive mode for statistical analysis and development.
  • Microsoft Visual Studio .Net 2005® is used to develop plug-ins to ProClarity® to pass ProClarity® views to R for hop-chain calculations.
  • FIG. 3 is a screenshot from the ProClarity® tool.
  • the database is a collection of 1.9M records of RPM events.
  • the 15 available dimensions are shown on the left of the screen (e.g. “day of the month”, “RPM hierarchy”), tracking such things as the identities and characteristics of particular RPMs, time information about events, and information about the hardware, firmware, and software used at different RPMs.
  • FIG. 5 and FIG. 6 show the same distributions, but now in terms of their frequencies f relative to their corresponding marginals, allowing a comparison of the shapes of the distributions normalized by their absolute sizes. While the months still seem identical, the RPM roles are clearly different, although it is difficult to discern which one is most unusual with respect to the marginal (bold line).
  • the RPM roles “ECCF” and “Mail” are clearly the most significant, which can be verified by examining the anomolously shaped plots in FIG. 5 .
  • the most significant month is December, although this is hardly evident in FIG. 6 .
  • ⁇ i′′′ ⁇ ]) is calculated for all i′′′ ⁇ 3, 4, . . . , 15 ⁇ , which are shown in FIG. 7 b for all significant dimensions. On that basis, X 3 is selected as Day of Month with minimal H 3.22.

Abstract

Methods for discovering portions of a multi-dimensional database that are significant to an analyst can be computer-implemented. The methods can include specifying a data view having at least two dimensions and all records of the database. A plurality of operation iterations are then performed on the data view, wherein each iteration is a chain operation, a hop operation or an anti-hop operation. The operation iterations are ceased upon satisfaction of a termination criteria. The resulting data view can then be presented to an analyst. The methods can facilitate a users' knowledge discovery tasks and assist in finding relevant patterns, trends, and anomalies.

Description

    PRIORITY
  • This invention claims priority from U.S. Provisional Patent Application No. 61/262,403, entitled Methods for Discovering Significant Portions of a Multi-Dimensional Database, filed Nov. 18, 2009.
  • STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • This invention was made with Government support under Contract DE-AC0576RL01830 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.
  • BACKGROUND
  • The present invention is related to the field of relational database technology. OLAP technology is commonly attributed with the ability to provide analysts with rapid access to summary, aggregated data views of a single large multi-dimensional database, and is recognized for its ability to provide knowledge representation and discovery in high-dimensional relational databases. OLAP tools can provide intuitive and graphical access to the massively complex set of possible summary views available in large relational structured data repositories. However, the ability to handle such data complexity also presents a wide-ranging, combinatorially vast space of options that can seem impossible to comprehend and/or analyze. Accordingly, there is a need for knowledge discovery techniques that guide users' knowledge discovery tasks and that assist in finding relevant patterns, trends, and anomalies.
  • SUMMARY
  • Embodiments of the present invention address the challenge of navigating a combinatorially vast space of data views of a multi-dimensional database by casting the space of data views as a combinatorial object comprising all projections and subsets and by casting the discovery of analyst-significant data views as a search process over that object. Statistical information theoretical measures are provided with the object and are sufficient to support a combinatorial optimization process. Accordingly, users can be guided, or taken automatically, across a permutation of the dimensions by searching for successive data views having two or more dimensions.
  • As used herein, a multi-dimensional database comprises a plurality of records with dimensions and is stored on a memory device. An exemplary multi-dimensional database is an online analytical processing (OLAP) database. A data view can refer to a subset of dimensions and data records from a multi-dimensional database and can represent a portion of the database that is significant to an analyst. In some embodiments, the data view comprises at most two dimensions because analysts typically experience difficulty comprehending additional dimensions.
  • In a particular embodiment of the present invention, the method for discovering portions of a multi-dimensional database that are significant to an analyst is computer-implemented and includes specifying a data view having at least two dimensions and all records of the database. A plurality of operation iterations are then performed on the data view, wherein each iteration is a chain operation, a hop operation or an anti-hop operation. The operation iterations are ceased upon satisfaction of a termination criteria. Examples of the termination criteria can include, but are not limited to, a command from an analyst, a uniform distribution of all remaining records across all remaining dimensions, a lack of remaining dimensions, or a lack of remaining records. The resulting data view can then be presented to an analyst.
  • A chain operation can comprise calculating a chain statistical significance measure for each value of each of the dimensions in the data view, selecting one or more chain values for a dimension in the view, adding the chain values to a filter, and removing the dimension of the chain values from the view. Exemplary chain statistical significance measures can include, but are not limited to, Hellinger distance, Hellinger distance augmented by p-value significance, relative entropy, and generalized alpha divergence. In some embodiments, the selecting of one or more chain values occurs automatically based on the values having maximal chain statistical significance measures.
  • A hop operation can comprise calculating a hop statistical significance measure, relative to the dimensions in the view and constrained by the filter, for each of the dimensions that is neither in the data view nor in the filter. The hop operation can further comprise selecting a hop dimension from the dimensions that are not in the view or in the filter and adding the hop dimension to the data view. Exemplary hop statistical significance measures can include, but are not limited to, conditional entropy and model likelihood metric. In some embodiments, the selecting of a hop dimension occurs automatically based on the dimensions having minimal hop statistical significance measures.
  • An anti-hop operation can comprise calculating an anti-hop statistical significance measure, relative to other dimensions in the view and constrained by the filter, for each of the dimensions in the view. Exemplary anti-hop statistical significance measures can include, but are not limited to, relative entropy. The anti-hop operation can further comprise selecting an anti-hop dimension from the dimensions in the view and removing the anti-hop dimension from the view. In some embodiments, the selecting of an anti-hop dimension occurs automatically based on maximal relative entropy.
  • In a preferred embodiment, a hop operation and a chain operation are performed in alternating order.
  • Embodiments of the present invention can be utilized at various degrees of automation for the analyst user. For example, in some embodiments, the data view can be initially populated with dimensions arbitrarily rather than relying on an analyst to specify the initial dimensions. Similarly, prior to performing the plurality of operation iterations, an empty filter can be created and arbitrarily populated with values for a dimension. In another example, while the chain, hop, and anti-hop operations can proceed substantially automatically as describe above, the selection of one or more chain values, the selection of a hop dimension, or the selection of an anti-hop dimension can occur manually based on input from an analyst. When the selections are manual, the chain, hop, and/or anti-hop statistical significance measures can be considered by the analyst or they can be disregarded in favor of the analyst's knowledge or preference.
  • An analyst guided approach can involve the present invention presenting suggested options, which the analyst can accept or override with manual selections.
  • The purpose of the foregoing abstract is to enable the United States Patent and Trademark Office and the public generally, especially the scientists, engineers, and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The abstract is neither intended to define the invention of the application, which is measured by the claims, nor is it intended to be limiting as to the scope of the invention in any way.
  • Various advantages and novel features of the present invention are described herein and will become further readily apparent to those skilled in this art from the following detailed description. In the preceding and following descriptions, the various embodiments, including the preferred embodiments, have been shown and described. Included herein is a description of the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of modification in various respects without departing from the invention. Accordingly, the drawings and description of the preferred embodiments set forth hereafter are to be regarded as illustrative in nature, and not as restrictive.
  • DESCRIPTION OF DRAWINGS
  • Embodiments of the invention are described below with reference to the following accompanying drawings.
  • FIG. 1 is an illustration depicting projection, extension, filtering, and flushing operations as well as an exemplary view operation according to embodiments of the present invention.
  • FIG. 2 is an illustration depicting the structure 3[2].
  • FIG. 3 is a screenshot of a first view of a data set as represented in a data visualization tool.
  • FIG. 4 is a plot showing the distribution of alarm counts by month.
  • FIG. 5 is a plot showing frequency distributions of radiation portal monitor (RPM) roles.
  • FIG. 6 is a plot showing frequency distributions of months.
  • FIG. 7 a is a plot showing Hellinger distances of rows and columns against their marginals
  • FIG. 7 b is a plot showing relative entropy of months against each other significant dimension, given the RPM role=ECCF.
  • FIG. 8 is a screenshot of a subsequent view on the X2=Months×X3=Day of Month projector. Note the new background filter is RPM Role=ECCF.
  • DETAILED DESCRIPTION
  • The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore the present description should be seen as illustrative and not limiting. While the invention is susceptible of various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention as defined in the claims.
  • The following description of the present invention uses a mathematical formalism that is similar to the mathematical tools required to analyze OLAP databases, but is different in a number of ways as well. For example, projections, I, on dimensions and restrictions, J, on records are combined into a lattice-theoretical object called a view, DI,J. Furthermore, OLAP concerns databases organized around collections of variables which can be distinguished as: dimensions, which have a hierarchical structure, and whose Cartesian product forms the data cube's schema; and measures, which can be numerically aggregated within different slices of that schema. The present description considers cubes with a single integral measure, which in some embodiments is the count of a number of records in the underlying database. However, any numerical measure could yield, through appropriate normalization, frequency distributions for use in the view discovery technique of the present invention.
  • The following examples and description are given in the context of a analyst and/or decision-maker responsible for analyzing a large relational database of records of events of personal vehicles, cargo vehicles, and others passing through radiation portal monitors (RPM) at US ports of entry. In OLAP database methodology, data cubes are multi-dimensional models of an underlying relational database. They are built by identifying a number of dimensions representing categories of interest from the database, each with a possibly hierarchical structure, and then forming their cross-product to represent all possible combinations of values of those dimensions, thus facilitating aggregation of critical quantities over multiple projections of interest. In this example database, the dimensions used included dimensions for multiple time representations, spatial hierarchies of collections of RPMs at different locations, and RPM attributes such as vendor. In this context, a vast collection of different views, focusing on different combinations of dimensions, and different subsets of records, are available to the user.
  • Operations that can be performed in the view lattice of data tensor cubes can be described according to the following. Let
    Figure US20110119281A1-20110519-P00001
    ={1, 2, . . . },
    Figure US20110119281A1-20110519-P00002
    :={1, 2, . . . , N}. For some N∈
    Figure US20110119281A1-20110519-P00003
    , define a data cube as an N-dimensional tensor
    Figure US20110119281A1-20110519-P00004
    :=
    Figure US20110119281A1-20110519-P00005
    (X,
    Figure US20110119281A1-20110519-P00006
    , c
    Figure US20110119281A1-20110519-P00007
    where:
      • Figure US20110119281A1-20110519-P00006
        :={Xi}i=1 N is a collection of N variables or columns with Xi:={xk i }k i =1 L i
        Figure US20110119281A1-20110519-P00006
        ;
      • X:=×x i
        Figure US20110119281A1-20110519-P00006
        Xi is a data space or data schema whose members are N-dimensional vectors x=
        Figure US20110119281A1-20110519-P00005
        xk 1 , xk 2 , . . . , xkN)
        Figure US20110119281A1-20110519-P00007
        =
        Figure US20110119281A1-20110519-P00005
        xk i
        Figure US20110119281A1-20110519-P00007
        i=1 N∈X called slots;
      • c:X→{0, 1, . . . } is a count function.
  • Let M:=Σx∈Xc(x) be the total number of records in the database. Then
    Figure US20110119281A1-20110519-P00004
    also has relative frequencies f on the cells, so that f:X→[0,1], where
  • f ( x ) = c ( x ) M ,
  • and thus Σx∈Xf(x)=1. An example of a data tensor with simulated data for our RPM cube is shown in Table 1, for
    Figure US20110119281A1-20110519-P00006
    ={X1, X2, X3}={RPM Manufacturer, Location, Month}, with RPM Mfr={Ludlum, SAIC}, Location={New York, Seattle, Miami}, and Month={January, February, March, April}, so that N=3. The table shows the counts c(x), so that M=74, and the frequencies f(x).
  • TABLE 1
    An example data tensor involving RPM data. Blank entries repeat
    the elements above, and rows with zero counts are suppressed.
    RPM Mfr Location Month c(x) f(x)
    Ludlum New York Jan 1 0.014
    Mar 3 0.041
    Apr 7 0.095
    Seattle Jan 9 0.122
    Apr 15 0.203
    Miami Jan 2 0.027
    Feb 8 0.108
    Mar 4 0.054
    Apr 1 0.014
    SAIC New York Jan 1 0.014
    Seattle Feb 4 0.054
    Mar 3 0.041
    Apr 3 0.041
    Miami Jan 6 0.081
    Feb 2 0.027
    Mar 4 0.054
    Apr 1 0.014
  • At any time, it is possible to look at a projection of
    Figure US20110119281A1-20110519-P00004
    along a sub-cross-product involving only certain dimensions with indices I⊂
    Figure US20110119281A1-20110519-P00008
    . Call I a projector, and denote x↓I=
    Figure US20110119281A1-20110519-P00005
    xk
    Figure US20110119281A1-20110519-P00007
    i∈I∈X↓I, where X↓I:=×i∈IXi, as a projected vector and data schema. One can write x↓i for x↓{i}, and for projectors II′ and vectors x,
    Figure US20110119281A1-20110519-P00009
    ∈X, x↓I
    Figure US20110119281A1-20110519-P00009
    ↓I′ is used to mean ∀i∈I, x↓i=
    Figure US20110119281A1-20110519-P00010
    ↓i.
  • Count and frequency functions convey to the projected count and frequency functions denoted c[I]: X↓I→
    Figure US20110119281A1-20110519-P00011
    and f[I]:X↓I→[0,1], so that

  • c[I](x↓I)=
    Figure US20110119281A1-20110519-P00012
    c(x′)  (1)

  • f[I](x↓I)=
    Figure US20110119281A1-20110519-P00013
    f(x′)  (2)
  • and Σx↓I∈X↓I f[I](x↓I)=1. In other words, the counts (i.e., resp. frequencies) are added over all vectors in
    Figure US20110119281A1-20110519-P00009
    ∈X such that
    Figure US20110119281A1-20110519-P00009
    ↓I=x↓I. This is just the process of building the I-marginal over f, seen as a joint distribution over the Xi for i∈I.
  • Any set of record indices J
    Figure US20110119281A1-20110519-P00014
    is called a filter. Then the filtered count function can be considered cJ:X→{0, 1, . . . } and frequency function ƒJ:X→[0,1] whose values are reduced by the restriction in J
    Figure US20110119281A1-20110519-P00014
    , now determining

  • M′:=Σ x∈X c J(x)=|J|≦M.  (3)
  • The frequencies fJ can be renormalized over the resulting M′ to derive
  • f J ( x ) = c J ( x ) M , ( 4 )
  • so that still Σx∈XfJ(x)=1. Finally, when both a selector I and filter J are available, then cJ[I]:X↓I→{0, 1, . . . }, fJ[I]:x↓I→[0,1] defined analogously, where now Σx↓∈X↓IfJ[I](x↓I)=1. Given a data cube
    Figure US20110119281A1-20110519-P00015
    , denote
    Figure US20110119281A1-20110519-P00016
    as a view of
    Figure US20110119281A1-20110519-P00004
    , restricting attention to just the J records projected onto just the I dimensions X↓I, and determining counts cJ[I] and frequencies fJ[I].
  • In a lattice theoretical context, each projector I
    Figure US20110119281A1-20110519-P00017
    can be cast as a point in the Boolean lattice BN of dimension N called a projector lattice. Similarly, each filter J
    Figure US20110119281A1-20110519-P00014
    is a point in a Boolean lattice BM called a filter lattice. Thus each view
    Figure US20110119281A1-20110519-P00016
    maps to a unique node in the view lattice
    Figure US20110119281A1-20110519-P00004
    :=
    Figure US20110119281A1-20110519-P00018
    ×
    Figure US20110119281A1-20110519-P00019
    =2N×2M, the Cartesian product of the projector and filter lattices.
  • Operations on data views can then be defined as transitions from an initial view
    Figure US20110119281A1-20110519-P00016
    to another
    Figure US20110119281A1-20110519-P00020
    or
    Figure US20110119281A1-20110519-P00021
    , corresponding to a move in the view lattice B:
  • Projection: Removal of a dimension so that I′=I\{i} for some i∈I. This corresponds to moving a single step down in
    Figure US20110119281A1-20110519-P00018
    , and to marginalization in statistical analyses. This results in ∀x′↓I′∈X↓I′,

  • c J [I′](x′↓I′)=Σx↓Ix′↓I′ c J [I](x).  (5)
  • This is also identified as an “anti-hop” operation.
  • Extension: Addition of a dimension so that I′=I∪{i} for some i∉I. This corresponds to moving a single step up in
    Figure US20110119281A1-20110519-P00018
    , which results in a desegregating or distributing of information about the I dimensions over the I′\I dimensions. Notationally, this is the converse of (5), so that ∀x↓I∈X↓I,

  • Σx′↓I′x↓I c J [I′](x′)=c J [I](x↓I).
  • This is also identified as a “hop” operation.
  • Filtering: Removal of records by strengthening the filter, so that J′J. This corresponds to moving potentially multiple steps down in
    Figure US20110119281A1-20110519-P00022
    .
  • Flushing: Addition of records by weakening (reversing, flushing) the filter, so that J′J. This corresponds to moving potentially multiple steps up in
    Figure US20110119281A1-20110519-P00022
    .
  • Repeated view operations thus map to trajectories in B. Consider the example shown in FIG. 1 for N=M=2 with dimensions
    Figure US20110119281A1-20110519-P00006
    ={X,Y} and two N-dimensional data vectors a,b∈X×Y, and denote e.g. X/ab={a↓{X}, b↓{X}}. The left side of FIG. 1 shows the separate projector and selector lattices (bottom nodes φ not shown), with extension as a transition to a higher rank in the lattice and projection as a downward transition. Similarly, filtering and flushing are the corresponding operations in the filter lattice. The view lattice is shown on the right, along with a particular view operation
    Figure US20110119281A1-20110519-P00023
    , which projects the subset of records {a} from the two-dimensional view {X,Y}=
    Figure US20110119281A1-20110519-P00006
    to the one-dimensional view {X}
    Figure US20110119281A1-20110519-P00006
    .
  • Regarding relational expressions and background filtering, typically M>>N, so that there are far more records than dimensions (in the present example, M=74 >3=N). In principle, filters J defining which records to include in a view can be specified arbitrarily, for example through any SQL or MDX where clause, or through OLAP operations like top n, including the n records with the highest value of some feature. In practice, filters are specified as relational expressions in terms of the dimensional values, as expressed in MDX where clauses. An example of a filter can include where RPM Mfr=“Ludlum” and (Month<=“February” and Month>=“January”), using chronological order on the Month variable to determine a filter J specifying just those 20 out of the total possible 74 records. For notational purposes, sometimes these relational expressions will be used to indicate the corresponding filters.
  • Note that each relational filter expression references a certain set of variables, in this case RPM Mfr and Month, denoted as R
    Figure US20110119281A1-20110519-P00008
    . Compared to the projector I, R naturally divides into two groups of variables:
  • Foreground: Those variables in Rf:=R∩I which appear in both the filter expression and are included in the current projection.
  • Background: Those variables in Rb:=R\I which appear only in the filter expression, but are not part of the current projection.
  • The portions of filter expressions involving foreground variables restrict the rows and columns displayed in the OLAP tool. Filtering expressions can have many sources, such as Show Only or Hide. It is common in full (hierarchical) OLAP to select a collection of siblings within a particular sub-branch of a hierarchical dimension. For example for a spatial dimension, the user within an OLAP database software system, such as ProClarity, might select All→USA→California, or its children California→Cities, all siblings. But those portions of filter expressions involving background variables do not change which rows or columns are displayed, but only serve to reduce the values shown in cells. In ProClarity, these are shown in the Background pane.
  • EXAMPLE
  • Table 2 shows the results of four view operations from the example data in Table 1, including a projection I={1,2,3}
    Figure US20110119281A1-20110519-P00024
    I′={1,2}, a filter using relational expressions, and a filter using a non-relational expression. Table 2d shows a hybrid result of applying both the projector I′={1,2} and the relational filter expression where RPM Mfr=“Ludlum” and (Month<=“February” and Month>=“January”). Compare this to Table 2a, where there is only a quantitative restriction for the same dimensionality because of the use of a background filter. Here I={RPM Mfr, Location}, R={RPM Mfr, Month}, Rf={RPM Mfr}, Rb={Month}, M′=20.
  • Table 2a
    RPM Mfr Location c[I′](x) f[I′](x)
    Ludlum New York 11 0.150
    Seattle 24 0.325
    Miami 15 0.203
    SAIC New York 1 0.014
    Seattle 10 0.136
    Miami 13 0.176
    Table 2b
    RPM Mfr Location Month cJ′(x) fJ′(x)
    Ludlum New York Jan 1 0.050
    Seattle Jan 9 0.450
    Miami Jan 2 0.100
    Feb 8 0.400
    Table 2c
    RPM Mfr Location Month cJ′(x) fJ′(x)
    Ludlum Seattle Apr 15 0.333
    Jan 9 0.200
    Miami Feb 8 0.178
    New York Apr 7 0.156
    SAIC New York Jan 6 0.133
    Table 2d
    RPM Mfr Location cJ′[I′](x) fJ′[I′](x)
    Ludlum New York 1 0.050
    Seattle 9 0.450
    Miami 10 0.500
    Table 2a-2d: Results from view operations
    Figure US20110119281A1-20110519-P00025
     from the data cube in Table 1. Projection: (Table 2a) I′ = {1, 2}, M′ = M = 74. (Table 2b) Filter: J′ = where RPM Mfr = “Ludlum” and (Month <= “Feb” and Month >= “Jan”), M′ = 20. (Table 2c) Filter: J′ determined from top 5 most frequent entries, M′ = 45. (Table 2d) I′ = {1, 2} and J′ determinued by the relational expression where RPM Mfr = “Ludlum” and (Month <= “Feb” and Month >= “Jan”), M′ = 20.
  • In some instances, the filter J is fixed and the superscript on f is suppressed. The frequencies f:X→[0,1] represent joint probabilities f(x)=f(xk 1 , xk 2 , . . . , xk N ), so that from (2) and (5), f[I](x↓I) expresses the I-way marginal over a joint probability distribution f. Now consider two projectors I1,I2
    Figure US20110119281A1-20110519-P00008
    , so that a conditional frequency f[I1|I2]:X↓I1∪I2→[0,1] where
  • f [ I 1 I 2 ] := f [ I 1 I 2 ] f [ I 2 ]
  • can be defined. Individual vectors can be described as follows.
  • f [ I 1 I 2 ] ( x ) = f [ I 1 I 2 ] ( x I 1 I 2 ) := f [ I 1 I 2 ] ( x I 1 I 2 ) f [ I 2 ] ( x I 2 ) .
  • f[I1|I2](x) is the probability of the vector x↓I1∪I2 restricted to the I1∪I2 dimensions given that it is known that one can only choose vectors whose restriction to I2 is x↓I2. Note that f[I1|φ](x)=f[I1](x),f[φ|I2]≡1, and since f[I1|I2]=f[I1 \I2|I2], in general assume that I1 and I2 are disjoint.
  • The concept of a view can then be extended to a conditional view
    Figure US20110119281A1-20110519-P00026
    as a view on
    Figure US20110119281A1-20110519-P00027
    , which is further equipped with the conditional frequency fJ[I1|I2]. Conditional views
    Figure US20110119281A1-20110519-P00026
    live in a different combinatorial structure than the view lattice
    Figure US20110119281A1-20110519-P00028
    . Describing I1|I2 and J in a conditional view requires three sets I1,I2
    Figure US20110119281A1-20110519-P00008
    and J∈
    Figure US20110119281A1-20110519-P00014
    with I1 and I2 disjoint. So define
    Figure US20110119281A1-20110519-P00029
    :=3[N]×2M where 3[N] is a graded poset with the following structure:
      • N+1 levels numbered from the bottom 0, 1, . . . N.
      • The ith level contains all partitions of each of the sets in
  • ( [ N ] i ) ,
  • that is the i-element subsets of
    Figure US20110119281A1-20110519-P00008
    , into two parts where
      • 1. The order of the parts is significant, so that [{1,3}, {4}] and [{4}, {1,3}] of {1,3,4} are not equivalent.
      • 2. The empty set is an allowed member of a partition, so [{1,3,4},φ] is in the third level of 3[N] for N≧4.
      • The two sets are written without set brackets and with a | separating them.
      • The partial order is given by an extended subset relation: if I1 I′1 and I2 I′2, then I1|I2
        Figure US20110119281A1-20110519-P00030
        I′1|I′2, e.g. 1 2|3
        Figure US20110119281A1-20110519-P00031
        1 2 4|3.
  • An element in the poset 3[N] corresponds to an I1|I2 by letting I1 (resp. I2) be the elements to the left (resp. right) of the |. This poset is called 3[N] because it's size is 3N and it really corresponds to partitioning
    Figure US20110119281A1-20110519-P00008
    into three disjoint sets, the first being I1, the second being I2 and the third being
    Figure US20110119281A1-20110519-P00008
    \(I1∪I2). The structure 3[2] is shown in FIG. 2.
  • For a view
    Figure US20110119281A1-20110519-P00032
    ∈B, which is identified with its frequency fJ[I], or a conditional view
    Figure US20110119281A1-20110519-P00033
    ∈A, which is identified with its conditional frequency fJ[I1|I2], the aim is measuring how “interesting” or “unusual” it is, as measured by departures from a null model. Such measures can be used for combinatorial search over the view structures B, A to identify noteworthy features in the data. The entropy of an unconditional view DI,J

  • H(f J [I]):=−Σx∈X↓I f J [I](x)log(f J [I](x)).
  • is a well-established measure of the information content of that view. A view has maximal entropy when every slot has the same expected count. Given a conditional view
    Figure US20110119281A1-20110519-P00034
    , we define the conditional entropy, H(fJ[I1|I2]) to be the expected entropy of the conditional distribution fJ[I1|I2], which operationally is related to the unconditional entropy as

  • H(f J [I 1 |I 2]):=H(f J [I 1 ∪I 2])−H(f J [I 2]).
  • Given two views
    Figure US20110119281A1-20110519-P00035
    of the same dimensionality I, but with different filters J and J′, the relative entropy (Kullback-Leibler divergence)
  • D ( f J [ I ] f J [ I ] ) := x X I f J [ I ] ( x ) log ( f J [ I ] ( x ) f J [ I ] ( x ) )
  • is a well-known measure of the similarity of fJ[I] to fJ′[I]. D is zero if and only if fJ[I]=fJ′[I], but it is not a metric because it is not symmetric, i.e., D(fJ[I]∥fJ′[I])≠D(fJ′[I]∥fJ[I]).
  • D is a special case of a larger class of a-divergence measures between distribution. Given two probability distributions P and Q, write the density with respect to the dominating measure μ=P=Q as p=dP/d(P+Q) and q=dQ/d(P+Q). For any a∈
    Figure US20110119281A1-20110519-P00036
    , the a-divergence is
  • D α ( P Q ) = ap ( x ) + ( 1 - α ) q ( x ) - p ( x ) α q ( x ) 1 - α α ( 1 - α ) μ ( x ) .
  • a-divergence is convex with respect to both p and q, is non-negative, and is zero if and only p=q μ-almost everywhere. For a≠0,1, the a-divergence is bounded. The limit when a→1 returns the relative entropy between P and Q. There are other special cases that are of interest to us:
  • D 2 ( P Q ) = 1 2 ( p ( x ) - q ( x ) ) 2 q ( x ) μ ( x ) D - 1 ( P Q ) = 1 2 ( q ( x ) - p ( x ) ) 2 p ( x ) μ ( x ) D 1 / 2 ( P Q ) = 2 ( p ( x ) - q ( x ) ) 2 μ ( x ) .
  • In particular the Hellinger metric √{square root over (D1/2)} is symmetric in both p and q, and satisfies the triangle inequality. We prefer the Hellinger distance over the relative entropy because it is a bonified metric and remains bounded. In our case and notation, we have the Hellinger distance as
  • G ( f J [ I ] , f J [ I ] ) := x X I ( f J [ I ] ( x ) - f J [ I ] ( x ) ) 2 .
  • Example: Hop-Chain View Discovery
  • Based on the data views, conditional views, and information measures described herein, a variety of user-guided, and/or automated, navigational tasks can be embodied by the present invention. For example, “drill-down paths” can be described as creating a series of views with projectors I1 I2 I3 of increasingly specified dimensional structure. In practice, many analysts are challenged by complex views of high dimensionality, while still needing to explore many possible data interactions. Accordingly, embodiments of the present invention can restrict analysts to two-dimensional views only, producing a sequence of projectors I1, I2, I3 where |Ik|=2 and |Ik∩Ik+1|=1, thus affecting a permutation of the variables Xi.
  • An arbitrary permutation of the i∈
    Figure US20110119281A1-20110519-P00008
    can be assumed so that one can refer to the dimensions X1, X2, . . . , XN in order. The choice of the initial variables X1, X2 is a free parameter to the method, acting as a kind of “seed”.
  • One thing that is critical to note is the following. Consider a view
    Figure US20110119281A1-20110519-P00037
    which is then filtered to include only records for a particular member x0 i 0 ∈Xi 0 of a particular dimension Xi 0
    Figure US20110119281A1-20110519-P00006
    ; in other words, let J′ be determined by the relational expression where Xi 0 =x0 i 0 . Then in the new view
    Figure US20110119281A1-20110519-P00038
    fJ′[I] is positive only on the fibers of the tensor X where Xi 0 =x0 i 0 , and zero elsewhere. Thus the variable Xi 0 is effectively removed from the dimensionality of
    Figure US20110119281A1-20110519-P00039
    , or rather, it is removed from the support of
    Figure US20110119281A1-20110519-P00040
    .
  • Notationally, it can be said that
    Figure US20110119281A1-20110519-P00041
    =
    Figure US20110119281A1-20110519-P00042
    Under the normal convention that 0·log(0)=0, information measures H and G above are insensitive to the addition of zeros in the distribution. This allows for a comparison of the view
    Figure US20110119281A1-20110519-P00043
    to any other view of dimensionality I\{i0}.
  • This is illustrated in Table 3 through the continuing example, now with the filter where Location=“Seattle”. Although formally still an RPM Mfr×Location×Month cube, in fact this view lives in the RPM Mfr×Month plane, and so can be compared to the RPM Mfr×Month marginal.
  • TABLE 3
    Our example data tensor from Table 1 under
    the filter where Location = “Seattle”; M′ = 34
    RPM Mfr Location Month c(x) f(x)
    Ludlum Seattle Jan 9 0.265
    Apr 15 0.441
    SAIC Feb 4 0.118
    Mar 3 0.088
    Apr 3 0.088
  • Finally, some caution is necessary when the relative entropy D(fJ[I]∥fJ′[I]) or Hellinger distance G(fJ[I],fJ′[I]) is calculated from data, as their magnitudes between empirical distributions is strongly influenced by small sample sizes. To counter spurious effects, in preferred embodiments, each calculated entropy can be supplemented with the probability that under the null distribution that the row has the same distribution as the marginal, of observing an empirical entropy larger or equal to actual value. When that probability is large, say greater than 5%, then its value can be considered spurious and be set to zero before proceeding with the algorithm.
  • In the instant example, a hop operation and a chain operation can be performed in alternating order (i.e., a hop-chain operation). One way of performing the hop-chain view discovery can be performed as described below.
  • 1. Set the initial filter to J=
    Figure US20110119281A1-20110519-P00014
    . Set the initial projector I={1,2}, determining the initial view fJ[I] as just the initial X1×X2 grid.
  • 2. For each row xk 1 ∈X1, the marginal distribution is fX 1 x k 1[I] of that individual row, using the superscript to indicate the relational expression filter. Also, the marginal fJ[I\{X1}] over all the rows for the current filter J is known. In light of the discussion just above, all the Hellinger distances can be calculated between each of the rows and this row marginal as

  • G(f X 1 x k 1 [I],f J [I \ {X 1}])=G(f X 1 =x k 1 [I \ {X 1}],fJ [I \ {X 1}]),
  • and retain the maximum row value G1:=maxx k 1∈X 1 G(fX 1 =x k 1[I],fJ[I\{X1}]). It can be dually done so for columns against the column marginal:

  • G(f X 2 x k 2 [I],f J [I \ {X 2}])=G(f X 2 =x k 2 [I \ {X 2}],fJ [I \ {X 2}]),
  • retaining the maximum value G2:=maxx k 22∈X 2 G(fX 2 =x k 2[I],fJ[I\{X2}]).
  • 3. The user can be prompted to select either a row x0 1∈X1 or a column x0 2∈X2. Since G1 (resp. G2) represents the row (column) with the largest distance from its marginal, selecting the global maximum max(G1, G2) might be most appropriate; or this can be selected automatically. Letting x′0, be the selected value from the selected variable (row or column) i′∈I, then J′ is set to where Xi′=x′0, and this is placed in the background filter.
  • 4. Let i″∈I be the variable not selected by the user, so that I={i′,i″}.
  • 5. For each dimension i′″∈
    Figure US20110119281A1-20110519-P00008
    \I, that is, for each dimension which is neither in the background filter Rb={i′} nor retained in the view through the projector {i″}, calculate the conditional entropy of the retained view fJ′[{i″}] against that variable: H(fJ′[{i″}|{I′″}]).
  • 6. The user is prompted to select a new variable i′″∈
    Figure US20110119281A1-20110519-P00008
    \I to add to the projector {i″}. Since
  • argmin i ′′′ N \ I H ( f J [ { i ′′ } { i ′′′ } ] )
  • represents the variable with the most constraint against i″, that may be the most appropriate selection, or it can be selected automatically.
  • 7. Let I′={i″,i′″}. Note that I′ is a sibling to I in
    Figure US20110119281A1-20110519-P00044
    , thus the name “hop-chaining”.
  • 8. Let I′,J′ be the new I,J and go to step 2.
  • Keeping in mind the arbitrary permutation of the Xi, then the repeated result of applying this method is a sequence of hop-chaining steps in the view lattice, building up an increasing background filter:

  • I={1,2},J=
    Figure US20110119281A1-20110519-P00014
      1

  • I′={2,3},J′=where X1=x0 1  2.

  • I″={3,4},J″=where X1=x0 1,X2=x0 2  3.

  • I′″={4,5},J′″=where X1=x0 1,X2=x0 2,X3=x0 3  4
  • In a particular example of the hop-chain operation, ProClarity® is used in conjunction with SQL Server Analysis Services (SSAS) 2005 and the R statistical platform v. 2.7 (see http://www.r-project.org). ProClarity® is a visual analytics tool that provides a flexible and friendly GUI environment with extensive API support which is used to gather current display contents and query context for row, column and background filter selections. R is currently used in either batch or interactive mode for statistical analysis and development. Microsoft Visual Studio .Net 2005® is used to develop plug-ins to ProClarity® to pass ProClarity® views to R for hop-chain calculations.
  • A first view of the data set used in the instant example is shown in FIG. 3, which is a screenshot from the ProClarity® tool. The database is a collection of 1.9M records of RPM events. The 15 available dimensions are shown on the left of the screen (e.g. “day of the month”, “RPM hierarchy”), tracking such things as the identities and characteristics of particular RPMs, time information about events, and information about the hardware, firmware, and software used at different RPMs.
  • For purposes of this description, only a single step for the hop-chaining procedure against the alarm summary data cube is shown.
  • FIG. 3 shows the two-dimensional projection of the X1=“RPM Role”×X2=“Month” dimensions within the 15-dimensional overall cube, drilled down to the first level of the hierarchies. Its plot shows the distributions of count c of alarms by RPM role (Busses Primary, Cargo Secondary, etc.) X1, while FIG. 4 shows the distribution by Month X2.
  • The distributions for roles seem to vary at most by overall magnitude, rather than shape, while the distributions for months appear almost identical. However, FIG. 5 and FIG. 6 show the same distributions, but now in terms of their frequencies f relative to their corresponding marginals, allowing a comparison of the shapes of the distributions normalized by their absolute sizes. While the months still seem identical, the RPM roles are clearly different, although it is difficult to discern which one is most unusual with respect to the marginal (bold line).
  • FIG. 7 a shows the Hellinger distances G(fx i =x k i[I],fJ[I\{Xi}]) for i∈{1,2} for each row or column against its marginal. The RPM roles “ECCF” and “Mail” are clearly the most significant, which can be verified by examining the anomolously shaped plots in FIG. 5. The most significant month is December, although this is hardly evident in FIG. 6. The maximal row-wise Hellinger value, G1=0.011, is selected for ECCF so that i′=1,x0 1=ECCF. Xi′=X1=“RPM Role” is added to the background filter, Xi″=X2=Months is retained in the view, and H(fJ′[{2}|{i′″}]) is calculated for all i′″∈{3, 4, . . . , 15}, which are shown in FIG. 7 b for all significant dimensions. On that basis, X3 is selected as Day of Month with minimal H=3.22.
  • The subsequent view for X2=Months×X3=Day of Month is then shown in FIG. 8. Note the strikingly divergent plot for April: it in fact does have the highest Hellinger distance at 0.07, an aspect which is completely invisible from the overall initial view, e.g. in FIG. 5.
  • While a number of embodiments of the present invention have been shown and described, it will be apparent to those skilled in the art that many changes and modifications may be made without departing from the invention in its broader aspects.

Claims (16)

1. A computer-implemented method for discovering portions of a multi-dimensional database that are significant to an analyst, wherein the multi-dimensional database comprises a plurality of records with dimensions and is stored on a memory device, the method characterized by the steps of:
Specifying a data view comprising at least two dimensions and all records of the database;
Performing a plurality of operation iterations on the data view, wherein each iteration is a chain operation, a hop operation, or an anti-hop operation;
Ceasing said operation iterations upon satisfaction of a termination criteria; and
Presenting to the analyst the data view resulting from said performing;
Wherein the chain operation comprises the steps of:
Calculating a chain statistical significance measure for each value of each of the dimensions in the data view;
Selecting one or more chain values for a dimension in the view;
Adding the chain values to a filter;
Removing the dimension of the chain values from the view;
Wherein the hop operation comprises the steps of:
Calculating a hop statistical significance measure, relative to the dimension(s) in the view and constrained by the filter, for each of the dimensions that is neither in the view nor in the filter;
Selecting a hop dimension from the dimensions that are not in the view or in the filter;
Adding the hop dimension to the data view; and
Wherein the anti-hop operation comprises the steps of:
Calculating an anti-hop statistical significance measure relative to other dimensions in the view and constrained by the filter, for each of the dimensions in the view;
Selecting an anti-hop dimension from the dimensions in the view; and
Removing the anti-hop dimension from the view.
2. The method of claim 1, wherein the chain statistical significance measure is a Hellinger distance.
3. The method of claim 1, wherein the chain statistical significance measure is a Hellinger distance augmented by p-value significance.
4. The method of claim 1, wherein the chain statistical significance measure is a relative entropy.
5. The method of claim 1, wherein the chain statistical significance measure is a generalized alpha divergence.
6. The method of claim 1, wherein the hop statistical significance measure is a conditional entropy measure.
7. The method of claim 1, wherein the hop statistical significance measure is a model likelihood metric.
8. The method of claim 1, wherein said selecting one or more chain values for a dimension in the view occurs automatically based on the values having maximal chain statistical significance measures.
9. The method of claim 1, wherein said selecting a hop dimension occurs automatically based on the dimensions having minimal hop statistical significance measures.
10. The method of claim 1, wherein said selecting one or more chain values, said selecting a hop dimension, or both occur manually based on input from an analyst.
11. The method of claim 1, wherein the termination criteria is a command from an analyst, a uniform distribution of all remaining records across all remaining dimensions, a lack of remaining dimensions, or a lack of remaining records.
12. The method of claim 1, further comprising performing hop and chain operations in alternating order.
13. The method of claim 1, wherein the data view is initially populated with dimensions arbitrarily.
14. The method of claim 1, prior to said performing, further comprising creating an empty filter and arbitrarily populating the empty filter with values for a dimension.
15. The method of claim 1, wherein the data view comprises two dimensions.
16. The method of claim 1, wherein the data view comprises three dimensions.
US12/775,125 2009-11-18 2010-05-06 Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database Abandoned US20110119281A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/775,125 US20110119281A1 (en) 2009-11-18 2010-05-06 Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US26240309P 2009-11-18 2009-11-18
US12/775,125 US20110119281A1 (en) 2009-11-18 2010-05-06 Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database

Publications (1)

Publication Number Publication Date
US20110119281A1 true US20110119281A1 (en) 2011-05-19

Family

ID=44012105

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/775,125 Abandoned US20110119281A1 (en) 2009-11-18 2010-05-06 Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database

Country Status (1)

Country Link
US (1) US20110119281A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150026207A1 (en) * 2013-07-22 2015-01-22 International Business Machines Corporation Managing sparsity in an multidimensional data structure

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633882B1 (en) * 2000-06-29 2003-10-14 Microsoft Corporation Multi-dimensional database record compression utilizing optimized cluster models
US20050262057A1 (en) * 2004-05-24 2005-11-24 Lesh Neal B Intelligent data summarization and visualization
US20070005566A1 (en) * 2005-06-27 2007-01-04 Make Sence, Inc. Knowledge Correlation Search Engine
US20070136256A1 (en) * 2005-12-01 2007-06-14 Shyam Kapur Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy
US20080195654A1 (en) * 2001-08-20 2008-08-14 Microsoft Corporation System and methods for providing adaptive media property classification
US20100121868A1 (en) * 2008-11-07 2010-05-13 Yann Le Biannic Converting a database query to a multi-dimensional expression query
US20100254573A1 (en) * 2009-04-07 2010-10-07 Centre National De La Recherche Scientifique Method for measuring the dissimilarity between a first and a second images and a first and second video sequences
US20100332474A1 (en) * 2009-06-25 2010-12-30 University Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633882B1 (en) * 2000-06-29 2003-10-14 Microsoft Corporation Multi-dimensional database record compression utilizing optimized cluster models
US20080195654A1 (en) * 2001-08-20 2008-08-14 Microsoft Corporation System and methods for providing adaptive media property classification
US20050262057A1 (en) * 2004-05-24 2005-11-24 Lesh Neal B Intelligent data summarization and visualization
US20070005566A1 (en) * 2005-06-27 2007-01-04 Make Sence, Inc. Knowledge Correlation Search Engine
US20070136256A1 (en) * 2005-12-01 2007-06-14 Shyam Kapur Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy
US20100121868A1 (en) * 2008-11-07 2010-05-13 Yann Le Biannic Converting a database query to a multi-dimensional expression query
US20100254573A1 (en) * 2009-04-07 2010-10-07 Centre National De La Recherche Scientifique Method for measuring the dissimilarity between a first and a second images and a first and second video sequences
US20100332474A1 (en) * 2009-06-25 2010-12-30 University Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150026207A1 (en) * 2013-07-22 2015-01-22 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US20150026116A1 (en) * 2013-07-22 2015-01-22 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US10169406B2 (en) * 2013-07-22 2019-01-01 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US10275484B2 (en) * 2013-07-22 2019-04-30 International Business Machines Corporation Managing sparsity in a multidimensional data structure

Similar Documents

Publication Publication Date Title
US8234298B2 (en) System and method for determining driving factor in a data cube
US20040181519A1 (en) Method for generating multidimensional summary reports from multidimensional summary reports from multidimensional data
US20130207980A1 (en) Visualization of data clusters
CA2614060A1 (en) System and method for analyzing data in a report
Yang et al. A modified clustering method based on self-organizing maps and its applications
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
US20110119281A1 (en) Methods for Discovering Analyst-Significant Portions of a Multi-Dimensional Database
Bhatnagar et al. An efficient map-reduce algorithm for computing formal concepts from binary data
Samson et al. Spatial databases: An overview
Bimonte et al. From volunteered geographic information to volunteered geographic OLAP: A VGI data quality-based approach
Stoica Business intelligence and olap
US20180322435A1 (en) Performance &amp; predictive dimensions for business intelligence data
Zhang et al. Selectivity estimation for relation-tree joins
Mohamed et al. Optimization challenge in decision supporting systems: An overview
Minartz et al. Multivariate correlations discovery in static and streaming data
Frentzos et al. On the effect of location uncertainty in spatial querying
De Lima et al. Graph-based relational data visualization
Joslyn et al. View discovery in OLAP databases through statistical combinatorial optimization
Abdullahi Banded Pattern Mining For N-Dimensional Zero-One Data
Djenouri et al. Organizing association rules with meta-rules using knowledge clustering
Baltzer et al. OLAP for trajectories
Dau et al. Combining business intelligence with semantic technologies: the CUBIST project
Kumar Scalable map-reduce algorithms for mining formal concepts and graph substructures
Vinh et al. Incremental spatial clustering in data mining using genetic algorithm and R-tree
CN114116757B (en) Data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: BATTELLE MEMORIAL INSTITUTE, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JOSLYN, CLIFF A;BURKE, JOHN S.;CRITCHLOW, TERENCE J.;AND OTHERS;SIGNING DATES FROM 20100423 TO 20100503;REEL/FRAME:024347/0682

AS Assignment

Owner name: U.S. DEPARTMENT OF ENERGY, DISTRICT OF COLUMBIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:BATTELLE MEMORIAL INSTITUTE, PACIFIC NORTHWEST DIVISION;REEL/FRAME:024586/0031

Effective date: 20100526

AS Assignment

Owner name: U.S. DEPARTMENT OF ENERGY, DISTRICT OF COLUMBIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:LOS ALAMOS NATIONAL SECURITY;REEL/FRAME:025745/0097

Effective date: 20110111

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION

AS Assignment

Owner name: TRIAD NATIONAL SECURITY, LLC, NEW MEXICO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LOS ALAMOS NATIONAL SECURITY, LLC;REEL/FRAME:047447/0201

Effective date: 20181031