WO1993007585A1

WO1993007585A1 - Method for determining sensor motion and scene structure and image processing system therefor

Info

Publication number: WO1993007585A1
Application number: PCT/US1992/008214
Authority: WO
Inventors: Keith James Hanna
Original assignee: David Sarnoff Research Center, Inc.
Priority date: 1991-10-04
Filing date: 1992-10-02
Publication date: 1993-04-15
Also published as: EP0606385B1; US5259040A; JP3253962B2; JPH06511339A; EP0606385A4; DE69231812T2; EP0606385A1; DE69231812D1

Abstract

The invention is a method for determining the motion of an image sensor (302) through a scene directly from brightness derivatives of an image pair. A global image sensor motion constraint is combined with the local brightness constancy constraint to relate local surface models with the global image sensor motion model and local brightness derivatives. In an iterative process (314), the local surface models are refined using the image sensor motion as a constraint, and then the image sensor motion model is refined using the local surface models as constraints. The analysis is performed at multiple resolutions to enhance the speed of the process.

Description

METHOD FOR DETERMINING SENSOR MOTION AND SCENE STRUCTURE AND IMAGE PROCESSING SYSTEM THEREFOR

The invention is a method for determining the motion of an imag sensor through a scene and the structure of the scene from two or mor images of the scene. The invention is also a system for determining th motion of the image sensor in the scene and the structure of the scene. BACKGROUND OF THE INVENTION Techniques for recognizing pattern shapes of objects graphicall represented in image data are known in the art. Further, techniques for discriminating between moving and stationary objects having a preselected angular orientation, or objects having any other predetermined feature o interest, are also known in the art.

A well known technique for locating a single moving object (undergoing coherent motion), contained in each of successive frames of a motion picture of an imaged scene, is to subtract the level value of each of the spatially corresponding image data pixels in one of two successive image frames from the other to remove those pixels defining stationary objects in the given scene and leave only those pixels defining the single moving object in the given scene in the difference image data. Further, by knowing the frame rate and the displacement of corresponding pixels of the single moving object in the difference image data, the velocity of the single moving object can be computed. However, when the image data of the successive frames define two motions, for example a background region which moves with a certain global velocity pattern in accordance with the movement (e.g., translation, rotation and zoom) of the camera recording the scene, the problem is more difficult. In this case, a scene- region occupied by a foreground object that is locally moving with respect to the background region will move in the motion picture with a velocity which is a function of both its own velocity with respect to the background region and the global velocity pattern of the background region itself. The global velocity pattern due to motion of the image sensor can be very complex since it depends upon the structure of the scene.

A problem is to employ, in real time, the image data in the series of successive frames of the motion picture to (1) measure and remove the effects (including those due to parallax) of the global motion and (2) detect and then track the locally-moving foreground object to the exclusion of this global motion. A conventional general image-motion analysis technique is to compute a separate displacement vector for each image pixel of each £ frame of a video sequence. This is a challenging task, because it requir pattern matching between frames in which each pixel can move different from one another. More recently, a "majority-motion" approach has be developed for solving the aforesaid problem in real time as disclosed Burt et al., in Proceedings of the Workshop on Visual Motion, Irvin California, March, 1989 and in Pattern Recognition for Advanced Missi Systems Conference, Huntsville, November, 1988.

The specific approaches disclosed by Burt et al. rely on segmenti the image data contained in substantially the entire area of each frame in a large number of separate contiguous small local-analysis window area This segmentation is desirable to the extent that it permits the motion i each local-analysis window to be assumed to have only its own compute single translational-motion velocity. The closer the size of each loca analysis window approaches that occupied by a single pixel (i.e., th greater the segmentation), the closer this assumption is to the trut However, in practice, the size of each local-analysis window i substantially larger than that occupied by a single image pixel, so that th computed single translational-motion velocity of a local-analysis window i actually an average velocity of all the image pixels within that window This segmentation approach is artificial in that the periphery of a locally moving imaged object in each successive frame is unrelated to th respective boundary borders of those local-analysis windows it occupies i that frame. If it happens to occupy the entire area of a particular window the computed single translational-motion velocity for that window will b correct. However, if it happens to occupy only some unresolved part of particular window, the computed single translational-motion velocity fo that window will be incorrect. Nevertheless, despite its problems, the "majority-motion" and other approaches employing segmentation disclosed by Burt et al. are useful in certain dynamic two-motion image analysis, such as in removing the effects of the global motion so that a locally-moving foreground object can be detected and then tracked to the exclusion of this global motion.

For many problems in computer vision, it is important to determine the motion of an image sensor using two or more images recorded from different viewpoints or recorded at different times. The motion of an image sensor moving through an environment provides useful information for tasks like moving-obstacle detection and navigation. For moving-obstacle detection, local inconsistencies in the image sensor motion model can pinpoint some potential obstacles. For navigation, the image sensor motion can be used to estimate the surface orientation of an approachi object like a road or a wall.

Prior art techniques have recovered image sensor motion and sce structure by fitting models of the image sensor motion and scene depth to pre-determined flow-field between two images of a scene. There are ma techniques for computing a flow-field, and each technique aims to recov corresponding points in the images. The problem of flow-field recovery not fully constrained, so that the computed flow-fields are not accurate. a result, the subsequent estimates of image sensor motion and thre dimensional structure are also inaccurate.

One approach to recovering image sensor motion is to fit a glob image sensor motion model, to a flow field computed from an image pai An image sensor motion recovery scheme that used both image flo information and local image gradient information has been proposed. T contribution of each flow vector to the image sensor motion model w weighted by the local image gradient to reduce errors in the recovere image sensor motion estimate that can arise from local ambiguities i image flow from the aperture problem.

There is, however, a need in the art for a method and apparatus accurately determine the motion of an image sensor when the motion i the scene, relative to the image sensor, is non-uniform. There is also need in the art for a method and apparatus to accurately determine th structure of the scene from images provided by the image system. system possessing these two capabilities can then automatically navigat itself through an environment containing obstacles.

SUMMARY OF THE INVENTION The invention is a method for accurately determining the motion an image sensor through a scene using local scene characteristics such a the brightness derivatives of an image pair. A global image sensor motio constraint is combined with the a local scene characteristic constanc constraint to relate local surface structures with the global image senso motion model and local scene characteristics. The method for determinin a model for image sensor motion through a scene and a scene-structur model of the scene from two or more images of the scene at a given imag resolution comprises the steps of:

(a) setting initial estimates of local scene models and an image senso motion model; H

(b) determining a new value of one of said models by minimizing t difference between the measured error in the images and the err predicted by the model;

(c) resetting the initial estimates of the local scene models and t image sensor motion model using the new value of the one of said mode determined in step Ob);

(d) determining a new value of the second of said models using t estimates of the models determined in step (b) by minimizing the differen between the measured error in the images and the error predicted by t model;

(e) warping one of the images towards the other image using th current estimates of the models at the given image resolution; and

(f) repeating steps b), (c), (d) and (e) until the differences between th new values of the models and the values determined in the previou iteration are less than a certain value or until a fixed number of iteration have occurred.

The invention is also an image processing system for determinin the image sensor motion and structure of a scene comprising imag sensor means for obtaining more than one images of a scene; means fo setting the initial estimate of a local scene model and an image senso motion model at a first image resolution; means for refining the loca scene models and the image sensor motion model iteratively; means fo warping the first image towards the second image using the current, refined estimates of the local scene models and image sensor motio model.

BRIEF DESCRIPTION OF THE DRAWING Fig. 1 diagrammatically illustrates the segmentation of a frame area into local-analysis windows employed by the prior-art "majority-motion" approach; Fig. 2 is a block diagram of a prior-art feedback loop for implementing the prior-art "majority-motion" approach;

Fig. 3 is a block diagram of a feedback loop for implementing the invention;

Fig. 4a shows an image pair which have been synthesized and resampled from a known depth map and known image sensor motion parameters;

Fig. 4b shows the difference image between the original image pair;

Fig. 4c shows the absolute value of the percentage difference between the recovered depth map and the actual depth map; Fig. 4d shows the an image of the local surface parameters (inver depths) such that bright points are nearer the image sensor than da points;

Fig. 4e shows the recovered image sensor motion at each resoluti and also the actual image sensor motion;

Fig. 5a - e show the results for a second image of a natural ima pair where the image center has been estimated, and where the preci image sensor motion is unknown;

Fig. 6a - e show the results for a road sequence where there is le image texture;

Fig. 7a - e show the results for another road sequence.

DESCRIPTION OF THE PREFERRED EMBODIMENTS Figs. 1 and 2 illustrate a prior art approach to motion detectio which will be helpful in understanding the present invention. In Fig. 1 is assumed that a moving image sensor (e.g., a video camera) is viewi the ground below from aboard an aircraft in search of an object, such as automobile, which is locally moving with respect to the ground, for t purpose of detecting the locally-moving object and then tracking its motio with respect to the ground in real time. In this case, the camera produc a sequence of image frames of the ground area at a relatively high ra

(e.g., 30 frames per second) so that the area being viewed changes only small amount between any pair of successive frames. The frame area 10 of each of the successive image frames is divided into a majority regio which is moving at a global velocity determined by the coherent motion the aircraft, and a minority region occupied by locally- moving automobil

101 on the ground. The frame-area 100 of of each of a pair of successiv frames, excluding border-area 102 thereof, is divided into an array of su area windows 104-11 104-mn, and the local velocity (designated in Fig. by its vector) for each of these sub-area windows is computed. This may b done by displacing the image data in each sub-area window of one of th pair of successive frames with respect to the image data in it corresponding sub-area windows of the other of the pair of successiv frames to provide a match therebetween. Border-area 102 is excluded i order to avoid boundary problems. Further, the image data included in sub-area window of a frame may overlap to some extent the image dat included in an adjacent sub-area window of that frame. In any event, th size of each sub-area window is large compared to the maximu displacement of image data between a pair of successive frames. (p

The average velocity of all the local velocities is calculated and t size of the difference error between each local velocity and this avera velocity determined. In general, these errors will be small and result fro such effects as parallax and the fact that the ground viewed by the movi camera is not flat. However, as shown in Fig. 1, the error for those t sub-area windows which include locally-moving automobile 101 is qui large, because the computed velocities therefor include both the glob velocity of the moving camera on the aircraft and the local velocity moving on the ground. Therefore, the two sub-area windows whic include locally-moving automobile 101 are excluded by the fact that thei respective errors exceed a given threshold, and the average velocity is the recomputed from only the remaining sub-area windows. This recompute average velocity constitutes an initial estimate of the global velocity of th motion picture due to the movement of the camera. Because only an initia estimate of the global velocity is being derived, the image data of each of th sub-area windows 104-11....104-mn employed for its computation i preferably of relatively low resolution in order to facilitate the require matching of the image data in each of the large number of correspondin sub-area windows 104-11....104-mn of the pair of successive frames. In Fig. 2 a feedback loop for carrying out the prior-art approach i shown in generalized form. The feedback loop comprises motion model 20 (that is derived in whole or at least in part by the operation of the feedbac loop), residual motion estimator 202, summer 204, image warper 206, frame delays 208 and 210, and image data from a current frame and from previous frame that has been shifted by image warper 206. Residual motion estimator 202, in response to image data from the current frame and from the previous shifted frame applied as inputs thereto, derives a current residual estimate, which is added to the previous estimate output from motion model 200 by summer 204 and then applied as a warp control input to image warper 206. Current-frame image data, after being delayed by frame delay 208, is applied as an input to image warper 206. Image warper 206 shifts the frame-delayed current-frame image data in accordance with its warp-control input, and then frame-delays the output therefrom by frame delay 210 to derive the next previous shifted frame. The feedback loop of Fig. 2 performs an iterative process to refine the initial estimate of the global velocity to the point that substantially all of that portion of the respective computed sub-area windows velocities of the minority region due to global velocity is eliminated. This iterative process derives the respective local residual velocities of the sub-area windows 104- 11....104-mn of each consecutively-occurring pair of successive frames, a then uses each of these residual velocities to derive a current estimate the residual global velocity. More specifically, the respective local velocit of each pair of successive frames are computed and a current estimate residual global velocity is made during the each cycle of the iterati process as described above, after the previous estimate of global veloc has, in effect, been subtracted out. In the case of the first cycle, t previous estimate of global velocity is zero, since no previous estimate global velocity has been made. Therefore, in this case, the residual veloc itself constitutes the initial estimate of the global velocity discussed above.

The effect of this iterative process is that the magnitude of t residual velocities become smaller and smaller for later and lat occurring cycles. It is, therefore, preferable that residual motion estimat 202 employ image data of the lowest resolution during the first cycle of t iterative process, and during each successive cycle employ high resolution image data than was employed during the immediate preceding cycle, in order to minimize the required precision for t matching of the image data in each successive cycle.

Residual motion estimator 202 may comprise hardware and/ software. Several alternative implementation species of residual moti estimator 202 are disclosed in the aforesaid Burt et al. articles. Each these species provides effective division of the computational burd between general and special purpose computing elements. The first step ascertaining local motion within the respective sub-area windows is ideal suited for implementation within custom hardware. Data rates are hi because the analysis is based on real-time video-rate image data, b processing is simple and uniform because only local translations need estimated. The second step, in which a global model must be fit to t entire set of of local-motion vectors of all the sub-area windows, is w suited for software implementation in a microprocessor because t computations are relatively complex and global, but the local-motion vect data set is relatively small. Further, as is brought out on the aforesa Burt et al. articles, the adjustment of the image-data resolution preferab employed in the different cycles of the iteration process, can be efficient performed by Laplacian and Gaussian pyramid techniques known in t image-processing art as shown for example by Anderson et al in U. Patent No. 4,692,806 and by van der Wal in U.S. Patent No. 4,703,514.

Burt et al. also describe an improvement of the "majority-motio approach which employs a foveation technique where, after the each cyc δ of the above-described iterative process has been completed, only t minority portion of the entire analysis area that has been determin during that cycle not to define the global motion (i.e., automobile 101 contained within this minority portion) is employed as the entire analy region during the next cycle of the iterative process. Further, the size each of the sub-area windows is decreased during each successive cycle, that the smaller analysis area during each successive cycle can still divided into the same number of sub-area windows.

This ability in the prior art to determine the motion of an ima sensor from analysis of a sequence of image of a scene is needed to enab an image sensor to navigate through a scene. The complexity arise however, as the sensor moves through the scene, that objects at varyi distances and orientation from the sensor (scene-structure) will move wit different velocities (both speed and direction). These non-uniformiti create substantial complexities in the analysis and necessitate usin different techniques other than those disclosed by Burt et al. I hav developed a method and apparatus that fits image sensor motion an scene-structure models directly to the images to determine the local scen structure and the global image sensor motion. A global image senso motion constraint is combined with the local scene characteristi constraint to relate local surface models with the global image senso motion model and local scene characteristics. In an iterative process, th local surface models are first refined using the image sensor motion as constraint, and then the image sensor motion model is refined using th local surface models as constraints. The estimates of image sensor motio and scene-structure at a given resolution are refined by an iterative proces to obtain increasingly more accurate estimates of image sensor motion an scene-structure; ie. there is a "ping-pong" action between the local model o scene characteristics and the global image sensor model with successiv warps of the images to bring them into acceptable congruence with on another. The refinement process starts with estimates of initial imag sensor and local scene structure models, estimates from previous frame or any other source of an a priori estimate. This iterative process is then repeated at successively higher resolution until an acceptable accuracy is obtained. Specifically, the models are fitted to an image pair represented at a coarse resolution, and the resultant models are then refined using the same fitting procedure at the next finest resolution.

Image flow is bypassed as the intermediary between local scene characteristic changes and the global image sensor motion constraint. The local scene characteristic constancy constraint is combined with t image sensor motion constraint to relate local-planar or local-constan depth models with an image sensor motion model and local sce characteristic derivatives. A local-planar model assumes that the sce locally has the shape a flat planar surface, such as a wall. A loca constant-depth surface model is a special case of a local-planar model. assumes that the flat planar surface is oriented parallel to the surface rhe sensor. Beginning with initial estimates of the image sensor moti and the local surface parameters, the local surface models are refine using the global image sensor motion model as a constraint. The glob image sensor motion model is then refined using the local surface mode as constraints.

The following analysis uses changes in the local brightness as t local scene characteristic to illustrate the principles of the inventio Other local scene characteristics include edges, corners, landmarks an other features. The image brightness is related to local surface models an an image sensor motion model as follows. From the first order Taylor' expansion of the brightness constancy assumption, the brightnes constraint equation is VF du + I_t = 0 (1) where VI^T is the gradient vector of the image brightness values, du is th incremental image motion vector, and I_t is the time derivative of the imag brightness values. Using the perspective projection image sensor mod and the derivative of the three dimensional position of a moving object, th image motion u of a static object that results from image sensor translatio

T and image sensor rotation Ω can be written as u = KTZ^"1 + AΩ (2) where Z is the depth of the object,

and x, y are image coordinates and f is the focal length of the image sensor For a local planar patch model,

RTP = 1 R = (X, Y, Z)τ P = (a, b, c (4) where R^τ is a point in world coordinates and P defines the orientation a depth of the plane. By combining Eq. 4 with the standard perspecti projection equations, x=XfZZ, y=Yf/Z, and by eliminating X,Y,

Z^"1 = ^TP F = (x/f, y/f,l)^τ (5) Inserting Eq. 5 into Eq. 2 gives the image motion in terms of ima sensor image sensor motion, local surface orientation and depth: u = FTP + A Ω (6)

From a previous resolution or iteration an estimate of the glob image sensor motion parameters, T₀, Ω₀, and also an estimate, Po, for eac local surface model may exist. Eq. 6 can be used to write an increment image sensor motion equation: du = (KTFτp + AΩ) - Uo = (KTF^TP + AΩ) - (KT₀F^TP₀ + AΩ₀) (7) where Uo is the image motion corresponding to the previous estimates the local surface and image sensor motion models. Inserting thi incremental image sensor motion equation into the brightness constrain equation (Eq. 1) FKTF^TP + VFAΩ - I^TRToF^TPo -V A Ω₀ + It = 0 (8)

The error in this equation is used to refine both the local surfac models and the global image sensor motion model. Specifically, the least squared error in Eq. 8 is minimized with respect to the local surfac parameters over each local region. The least squares error is the minimized with respect to the image sensor motion parameters over all th local regions. In each local image region, the least squares error measur is minimize as follows eι_ocaι = Σ [VFKTFⁱP + VFA Ω -VFKT₀Fτp₀ -VFA Ω₀ + IJ² (9) local with respect to P. Differentiating Eq. 9 with respect to P-^ gives de/dP=2∑

F^τ P₀- F A Ωo + IJ (10) local At the minimum de/dP is zero and P_mi_n is

Pmϊn = -G-¹ Σ VFKTF [VFA Ω -VIτKFτPo - VFAΩo + It] (11) local where

G = Σ (V F T)²FF^T (12) local

The planar patch model is simplified to a constant depth model so that P = (0,0,c)^τ. Eq. 11 then becomes II

-mii- = -Σ V FKT [V FAΩ - V FKTo c₀ -V FAΩ₀ + I_t / ∑ι_ocaι (V FKT) (13 local where c₀ is an estimate of the local depth from a previous scale or iterati In the global image region, the minimized least squares er measure is βgiobai = ∑ [VFKT_Cmin + VFAΩ - VFKTo c₀ - VFAΩ + IJ* (14) global with respect to T and Ω, where c_min for each local region is given by Eq. Eq. 14 is valid only for the local-constant-depth model. As formulated he the error is quadratic in Ω but non-quadratic in T, and a non-line minimization technique is required. In the current implementation of t method, the Gauss-Newton minimization is done using Ω₀ and T₀ as init starting values. It is to be understood that other minimization techniqu can also be used. If initial estimates of Ω₀ and T₀ are not available, example from a previous frame in a sequence, trial translation values a inserted into Eq. 14, solve for Ω - Ω₀ (in closed form since Eq. 14 is quadra in Ω - Ω₀) and choose as our initial estimates the T and Ω - Ω₀ that give t lowest error in Eq. 14. Preferably the local and global minimization performed within a multi-resolution pyramid framework. The invention is method for determining a model for image sens motion through a scene and a scene-structure model of the scene from t or more images of the scene at a given image resolution comprising t steps of:

(a) setting initial estimates of local scene models and an image sens motion model;

(c) resetting the initial estimates of the local scene models and t image sensor motion model using the new value of the one of said mode determined in step (b);

(e) warping one of the images towards the other image using t current estimates of the models at the given image resolution; IZ (f) repeating steps (b), (c), (d) and (e) until the differnces between t new values of the models and the values determined in the previo iteration are less than a certain value or until a fixed number of iteratio have occurred; (g) expanding the images to a higher resolution; and

(h) repeating steps Go), (c), (d), (e) and (f) at the higher resolution usi the current estimates of the models as the initial starting values.

The invention is also an image processing system for determinin the image sensor motion and structure of a scene comprising imag sensor means for obtaining one or more images of a scene; means fo setting the initial estimate of a local scene model and the motion of th image sensor at a first image resolution; means for warping the firs image towards the second image using the current estimates of the loca scene models and image sensor motion model at a first image resolution means for refining all local scene models and refining the image senso motion model by performing one minimization step; and iteration mean for repeating steps (b) and (c) several times.

In the local minimization, the global image sensor motio constraint is constraining the refinement of the surface parameter locally. Conversely in the global minimization, the local constraint provided by local image structures constrain the refinement of the globa image sensor motion parameters.

In the first part of the method, the image sensor motion constrain and the local image brightness derivatives are used to refine each local surface parameter c. The incremental image sensor motion equation (Eq.

7) can be rewritten for the simplified local-constant-depth model so that du = (KTc + A Ω) - (KT₀Co + A Ωo) (15)

At Ω = Ωo andT = T₀ duo = KT₀ (c-Co) (16) where duo is the incremental motion introduced by an increment in the parameter c. Therefore, the increment in local motion is constrained to lie along a line in velocity space in the direction of vector KT₀ (the image sensor motion constraint line). The vector KT₀ points towards the current estimate of the focus-of-expansion of the image pair. Within a local region containing a single edge-like image structure, the brightness constraint equation constrains the motion to lie along a line in velocity space in the direction of the edge (perpendicular to VI). By combining the image sensor motion and brightness motion constraint, the surface parameter, c, is refined such that the incremental motion introduced by the refinement lies at the intersection of the image sen motion constraint line and the local brightness constraint line. In t case, a local motion ambiguity arising from the aperture problem has be resolved using only the image sensor motion constraint. However, lo motion ambiguities cannot be resolved using the image sensor moti constraint when the image sensor motion constraint line and the lo motion constraint line are parallel. In this case, ∑ioc_ai (VFKTo²) « 0, a the denominator in Eq. 13 tends to zero. The physical interpretation is t the local edge structure is aligned in the direction of the current estimate the focus-of-expansion. The local surface parameter cannot be refin reliably because the image sensor motion estimate adds little or constraint to the local brightness constraint. In the curre implementation of the method, the local surface parameter is not refined the denominator in Eq. 13 is below a threshold. Within a local region containing a corner-like image structure, bo motion components can be resolved from local information and the loc brightness constraint constrains the incremental motion to lie at a sin point in velocity space. However, the image sensor motion estima constrains the incremental motion to lie along the image sensor moti constraint line in velocity space. If the point and line intersect in veloci space, then the incremental motion introduced by the refineme corresponds to the point in velocity space. If the point and line do n intersect, then the incremental motion lies between the line and the poi in velocity space. Within a local region containing a single edge-like image structur the brightness constraint equation (Eq. 1) shows that the error in t equation will remain constant for any du that is perpendicular to t gradient vector (VI) of the edge. As a result, only the local moti component normal to the edge is used to refine the global image sens motion estimate. Since there is no contribution from the motio component along the edge direction, fewer errors in the global ima sensor motion estimate are caused by local motion ambiguities arisi from the aperture problem.

Within a local region containing a corner-like image structure, bot motion components can be resolved from only local information, and bot motion components contribute to the refinement of the global image sens motion estimate.

We use a Gaussian or Laplacian pyramid to refine the image sens motion estimate and local surface parameters at multiple resolutions. I the pyramid framework, large pixel displacements at the resolution of t original image are represented as small pixel displacements at coar resolutions. Therefore, the first order Taylor's expansion of the brightne constancy constraint (Eq. 1 - approximately true only for small du) becom valid at coarse resolutions even when the image motion is large at t original resolution. The local depth estimates from previous resolutio are used to bring the image pair into closer registration at the next fine resolution. As a result, the first order Taylor's expansion is to be valid all resolutions in the pyramid framework, disregarding basic violations i the brightness assumption that will occur at occlusion boundaries, f example. In addition, independently moving objects in the scene will als violate the image sensor motion constraint. Preliminary results hav shown that the recovered image sensor motion estimate is not greatl sensitive to such failures in the models. In the image sensor motion recovery method presented here, th additional change in image brightness introduced by Gaussian o Laplacian blurring within the pyramid have not been determined. Th recovered image sensor motion estimates are often similar at eac resolution, and the error surfaces computed as a function of image senso translation using a flow-based, multi-resolution, image sensor motio recovery method are similar at all resolutions.

In Fig. 3, a feedback loop 300 for implementing the inventio comprises an image sensor 302, such as a video camera, whose output is sequence of images of a scene at a given resolution. Other types of imag sensors include radar detectors, optical line sensors or othe electromagnetic or sonic detectors or any other source of signals. Th images are alternately applied by switch 304 to a first pyramid processo 306 and to a frame delay 308 and then to a second pyramid processor 310. Such pyramid processors are known in the image-processing art as show for example by Anderson et al in U.S. Patent No. 4,692,806 and by van de Wal in U.S. Patent No. 4,703,514. The two pyramid processors have as their output images separated in time by the delay provided by the frame delay 308 and corresponding to the original images but at a resolution r which is typically less than that of the original image. The time delayed image is applied through a warper 312 and then to the estimator 314. While the warper is shown operating on the time delayed image, it can equally operate on the other image. The other image is applied directly to estimator 314. In the estimator 314 the first step the error function for the mismatch between the actual image motion and the models of the image sensor motion and the local scene structure is minimized with respect each local scene model, keeping the current estimate of the global im sensor motion constant. In the second step the error function for mismatch between the global image sensor motion and the models of image sensor motion and the local scene structure is minimized w respect to the global image sensor motion, keeping the current estimate the local scene models constant. Estimator 314 provides as its outp estimates of the global motion model and the local scene structure model local depth model for the images. The initiator 315 provides the init constraints on the local scene structure and the global motion model to t estimator 314. This information may be embedded in the initiator 315 may come from another sensor. The outputs of the estimator 314 are n estimates of the global sensor motion and the local scene structure mod

These new estimators are then applied to synthesizer 316 which derive warp-control signal which is applied to warper 312. The warper 312 th distorts the time delayed image, bringing it closer to congruence with t other image. The cycle is then repeated until the required number iterations have been completed or the differences between the two imag has been reduced below a certain value. The local depth model informati is then available at port 318 and the global motion model information available at 319. The images are then recalculated at a higher resoluti and the iterative cycle is repeated. This sequence of iteration at a giv resolution level and iteration at successively higher resolutions is repeat until the differences in the models between successive iterations is le than a certain value or a sufficient level of resolution RE has been attaine

The image sensor motion method was tested on both natural a computer-rendered image sequences. The motion in the image sequenc ranges from about 4 to 8 pixels at the original resolution, so that analysis only the original resolution will be inaccurate since the motion will outside the range of the incremental motion estimator. In the resul presented here, four resolutions are used. T = (0,0,l)^τ and Ω = (0,0,0)^τ a used as the initial image sensor motion estimate, unless stated otherwis All local inverse depth estimates are initialized to zero.

Results of the method are shown on computer-rendered images th have size 256 x 256 pixels, and also on natural images that have size 256 240 pixels. A Laplacian pyramid was used to produce reduced-resoluti images of size 128 x 128, 64 x 64 and 32 x 32 pixels for the compute rendered images, and size 128 x 120, 64 x 60 and 32 x 30 pixels for t natural images. We fit the local surface models to 5 x 5 pixel windo /G centered on each point in the image, and the image sensor motion model fitted to each point in the image. For example, as part of a vehic navigation system, analysis would be restricted to a number of larger loc windows directed purposively at image regions like the road ahead or a oncoming object . The global image sensor model is fitted to each point i the image.

We have found that the method can converge to an incorrect solutio or fails to converge when analysis begins at a very coarse resolutio (corresponding to 16 x 16 pixels for the image sizes presented here). Thi behavior may result from excessive blurring of the image intensities a very coarse scales, and also from the limited number of sample points a very coarse resolutions.

Fig. 4a shows an image pair which have been synthesized an resampled from a known depth map and known image sensor motio parameters. For this image pair, an initial image sensor motion estimat was recovered by sampling 17 x 17 = 289 translation values at the coarses resolution. Fig. 4b shows the difference image between the original imag pair. Fig. 4c shows the absolute value of the percentage difference betwee the recovered depth map and the actual depth map. Fig. 4c illustrates th percentage depth error measure, [100 (real depth - computed depth) / (rea depth)] demonstrating that the errors in the recovered depth map in the foreground portion of the scene are fairly uniform, and actual measurement in a 180 x 100 window in the foreground gives an rms error of approximately 1%. In the background portion of the scene (just over the ridge) the error is much larger, and measurement in a 100 x 15 window gives an rms error of approximately 8%. This difference is explained by observing that in both regions the difference between the actual motion and the recovered motion is approximately 0.05 - 0.1 pixels, whereas the actual motion is approximately 4 - 8 pixels in the foreground, and approximately 1 pixel in the background. We expect such accuracy in the recovered motion in the foreground and background portions of the image since the image is heavily textured there, but there are large errors in the recovered depth and motion at the very top of the image where there is no texture at all.

Fig. 4d shows the an image of the local surface parameters (inverse depths) such that bright points are nearer the image sensor than dark points. The bottom portion of the image shows a surface sloping away from the "camera' towards a ridge at which point the depth changes rapidly. The very top of the image shows the parameters recovered at a blank portion of the image where there is no texture. n

Fig. 4e shows the recovered image sensor motion at each resoluti and also the actual image sensor motion. The estimate of image sens motion component at the final resolution is very close to the actual ima sensor motion of the "camera' despite an occlusion boundary across t center of the image where the brightness constancy assumption is violate In general, the least squares minimization technique should be sensitive measurement outliers that might be introduced by such deviations in t model. Similar robustness to measurement outliers has also been observ in other motion-fitting techniques that use the same incremental moti estimator (Eq. 1) within the same coarse-fine analysis framework.

Fig. 5a shows the second image of a natural image pair where t image center has been estimated, and where the precise image sens motion is unknown. The image motion in the foreground is approximate 5 pixels towards the image sensor. Fig. 5b shows the inverse depth ima recovered at the finest resolution. The recovered depths are plausib almost everywhere except at the image border and near the recovered foc of expansion (near the gate at the image center). The bright dot at t bottom right hand side of the inverse depth map corresponds to a leaf in t original image that is blowing across the ground towards the ima sensor. We might expect such plausible results from a scene that heavily textured almost everywhere. Fig. 5c shows the computed ima sensor motion at each resolution. The initial image sensor motio estimate is close to the estimates recovered at the two finest scales, yet th recovered estimates are different at the two coarsest resolutions. At thes coarse resolutions, the minimization procedure followed a low-gradien incorrect direction in the error surface that led to the incorrect estimate While this shows how the estimation procedure can recover from followin the incorrect minimization path, it also shows how the error surfaces ca differ very slightly between resolutions due to differences introduced b image blurring. Fig. 4e shows the recovered image sensor motion at eac resolution and also the actual image sensor motion. Fig. 5e shows th recovered image sensor motion at each resolution and also the actu image sensor motion.

Fig. 6a shows the second image of a road sequence where there i less image texture. The image motion in the foreground is approximatel 9 pixels towards the image sensor. Fig. 6b shows the inverse depth imag recovered at the finest resolution. In this case the inverse dept parameters corresponding to the top portion of the image (sky) are clearl incorrect, and in fact the local surface parameters should probably not b .8 refined in image regions containing such small image gradients, but f the same reason, such regions have minimal effect on the recovered ima sensor motion estimate. Fig. 6c shows that in this case, the recover solutions remain close to the initial image sensor motion estimate at ea resolution. We computed the focus of expansion to lie at the end of t visible portion of the road, at the road center. Fig. 6e shows the recover image sensor motion at each resolution and also the actual image sens motion.

Fig. 7 presents the results for another road sequence. In this cas the recovered solutions remain close to the initial image sensor motio estimate. The inverse depth parameters corresponding to the top portion the image (sky) are clearly incorrect, and in fact the local surfac parameters should probably not be refined in image regions containin such small gradients, but for the same reason, such regions have minim effect on the recovered image sensor motion estimate. We determined tha the focus of expansion to he at the end of the visible portion of the road, a the road center.

An iterative, multi-resolution method that estimates image senso motion directly from image gradients in two images, and how constraint from different local image structures interact with the image senso motion constraint is disclosed. The image sensor motion and depths wer recovered quite accurately in the computer rendered example, and th recovered image sensor motion and depths appeared plausible in th natural image sequences where ground truth was unavailable. The main advantages of the multi-resolution analysis disclosed here over existing single-resolution analyses are: a) Increased range of motion: At the resolution of the original image, limitations on the fitting model means that motions of greater than approximately 1 image pixel cannot be measured accurately. Using multi- resolution analysis, our range of motion is increased to approximately 16 pixels (or more) at the resolution of the original image. This allows the image sensor motion and scene-structure recovery methods to be used for applications in which single-resolution analysis would not work. In fact, image disparity or image motion is greater than a pixel in many, if not most, applications. b) Accuracy of motion estimates: Because results from previous resolutions are used to update image sensor motion and scene-structure estimates at the next finest resolution, the estimates of image sensor motion and scene structure at the finest resolution (the resolution of the \°l original images) are significantly more accurate than in single-resoluti analysis where estimates are computed only at the resolution of t original images without refining previous estimates. c) Efficiency of method: At coarse resolutions, the representation the image is small so the method runs very quickly at such resolution

We can stop processing at a coarse resolution if our refined estimates image sensor motion and scene-structure are sufficiently accurate for particular task. Compared to single-resolution analysis, therefore, there flexibility in trading hardware resources and/or computing power vers accuracy of the image sensor motion and scene-structure estimates.

In summary, because of multi-resolution refinement (as well single resolution refinement) of the depth and image sensor motio estimates, the warped depth approach is much more accurate tha alternative approaches; because of multi-resolution analysis, the allowab range of motion over which the method works is much greater than man alternative methods. As a result, the warped depth method can be used i many applications where alternative methods cannot be used and th method is efficient. As a result, real-time implementation in hardwar and software is relatively simple. The image sensor motion and scene-structure models do not have t be fitted to the entire images; specific regions can be selected an processing is only performed in those regions. The coarse-fine resolutio refinement of image sensor motion and scene-structure estimates can b extended to include refinement over time; that is refine the estimates ove an image sequence rather than just an image pair. This method can b applied to many problems that require estimation of scene-structure and/o image sensor motion from two of more image pairs. Applications includ vehicle navigation, obstacle detection, depth recovery and imag stabilization. The method disclosed here made use of local image brightness an image brightness derivatives as the local scene characteristic o constraint. It is understood that other image characteristics can also b used in the method of the invention. It is also understood that method other than pyramid processing for expanding the local local scen characteristic characteristics into a higher resolution can be used.

Claims

I CLAIM:

1. A method for determining a model for image sensor motion throug a scene and a scene-structure model of the scene from two or more image of the scene at a given image resolution comprising the steps of:

(a) setting initial estimates of local scene models and an image senso motion model;

(b) dete__τ__ining a new value of one of said models by minimizing th difference between the measured error in the images and the erro predicted by the model;

(c) resetting the initial estimates of the local scene models and th image sensor motion model using the new value of the one of said model determined in step 0>);

(d) determining a new value of the second of said models using th estimates of the models determined in step (b) by minimizing the difference between the measured error in the images and the error predicted by the model;

(e) warping one of the images towards the other image using the current estimates of the models at the given image resolution; and (f) repeating steps (b), (c), (d) and (e) until the differnces between the new values of the models and the values determined in the previous iteration are less than a certain value or until a fixed number of iterations have occurred.

2. The method of Claim 1 further comprising the steps of: (g) expanding the images to a higher resolution; and

(h) repeating steps (b), (c), (d), (e) and (f) at the higher resolution using the current estimates of the models as the initial starting values.

3. The method of Claim 1 wherein step Ob) comprises determining a new value of one of said models by minimizing with respect to the first of said models the difference between the change in the scene characteristic measured between two or more images and the changes predicted by the models.

4. The method of Claim 1 wherein step (d) comprises determining a new value of one of said models by minimizing with respect to the second of said models the difference between the change in the scene characteristic measured between two or more images and the changes predicted by the models.

5. A method for determining a model for the scene-structure model the scene from two or more images of the scene at a given image resoluti comprising the steps of:

(a) setting initial estimates of a local scene models and a image sens motion model;

Ob) determining a new value of one of said scene-structure model or image sensor motion model by minimizing the difference between t measured error in the images and the error predicted by the model;

(c) resetting the initial estimates of the local scene models and t image sensor motion model using the new value of the one of said mod determined in step Ob);

(e) warping one of the images towards the other image using t current estimates of the models at the given image resolution; and

(f) repeating steps (b), (c), (d) and (e) until the differnces between t new values of the models and the values determined in the previo iteration are less than a certain value or until a fixed number of iteratio have occurred.

6. The method of Claim 5 further comprising the steps of:

(e) expanding the images into a higher resolution; and

(f) repeating steps (b), (c), (d), (e) and (f) at the higher resolution usi the current estimates of the models as the initial starting values.

7. The invention is also an image processing system for determini the image sensor motion and structure of a scene comprising: image sensor means for obtaining more than one images of a scene means for setting the initial estimate of a local scene model and t image sensor motion model at a first image resolution; means for refining all local scene models and the image sens motion model iteratively by performing minimization steps; and means for warping the first image towards the second image usin the current, refined estimates of the local scene models and image senso motion model at a first image resolution, iteration means for repeating steps (b) and (c).

8. The apparatus of Claim 7 further comprising: processing apparatus for expanding the image from a firs resolution into a higher resolution.