US20120129605A1

US20120129605A1 - Method and device for detecting and tracking non-rigid objects in movement, in real time, in a video stream, enabling a user to interact with a computer system

Info

Publication number: US20120129605A1
Application number: US13/300,509
Authority: US
Inventors: Nicolas Livet; Thomas Pasquier
Original assignee: Total Immersion
Current assignee: Qualcomm Inc
Priority date: 2010-11-19
Filing date: 2011-11-18
Publication date: 2012-05-24
Also published as: JP5967904B2; FR2967804B1; KR20120054550A; JP2012113714A; EP2455916B1; EP2455916A1; FR2967804A1

Abstract

The invention relates in particular to the detection of interactions with a software application according to a movement of an object situated in the field of an image sensor. After having received a first and a second image and having identified a first region of interest in the first image, a second region of interest, corresponding to the first region of interest, is identified in the second image. The first and second regions of interest are compared and a mask of interest characterizing a variation of at least one feature of corresponding points in the first and second regions of interest is determined. A movement of the object is then determined from said mask of interest. The movement is analyzed and, in response, a predetermined action is triggered or not triggered.

Description

The present invention concerns the detection of objects by the analysis of images, and their tracking, in a video stream representing a sequence of images and more particularly a method and a device for detecting and tracking non-rigid objects in movement, in real time, in a video stream, enabling a user to interact with a computer system.
Augmented reality in particular seeks to insert one or more virtual objects in images of a video stream representing a sequence of images. According to the type of application, the position and orientation of those virtual objects may be determined by data that are external to the scene represented by the images, for example coordinates obtained directly from a game scenario, or by data linked to certain elements of that scene, for example coordinates of a particular point in the scene such as the hand of a player. When the nature of the objects present in the real scene has been identified and the position and the orientation have been determined by data linked to certain elements of that scene, it may be necessary to track those elements according to movements of the video camera or movements of those elements themselves in the scene. The operations of tracking elements and embedding virtual objects in the real images may be executed by different computers or by the same computer.
Furthermore, in such applications, it may be proposed to users to interact, in the real scene represented, at least partially, by the stream of images, with a computer system in order in particular to trigger particular actions or scenarios which for example enable the interaction with virtual elements superposed on the images.
The same applies in numerous other types of applications, for example in video game applications.
With these aims, it is necessary to identify particular movements such as hand movements to identify one or more predetermined commands. Such commands are comparable to those initiated by a computer pointing device such as a mouse.
The applicant has developed algorithms for visual tracking of textured objects, having varied geometries, not using any marker and whose originality lies in the matching of particular points between a current image of a video stream and a set of key images which are automatically obtained on initializing the system. However, such algorithms, described in French patent applications 0753482, 0752810, 0902764, 0752809 and 0957353, do not enable the detection of movements of objects that are not textured or that have a practically uniform texture such as the hands of a user. Furthermore, they are essentially directed to the tracking of rigid objects.
Although solutions are known enabling a user to interact with a computer system, in a scene represented by a sequence of images, those solutions are generally complex to implement.
More particularly, a first solution consists in using tactile sensors which are associated, for example, with the joints of a user or actor. Although this approach is often dedicated to movement tracking applications, in particular for cinematographic special effects, is it also possible to track the position and the orientation of an actor and, in particular, of his hands and feet to enable him to interact with a computer system in a virtual scene. However, the use of this technique proves to be costly since it requires the insertion, in the scene represented by the stream of images analyzed, of cumbersome sensors which may furthermore suffer from disturbance linked to their environment (for example electromagnetic interference)
Another solution, developed in particular in the European projects “OCETRE” and “HOLONICS” consists in using several image sources, for example several video cameras, to enable real time three dimensional reconstruction of the environment and of the spatial movements of the users. An example of such approaches is in particular described in the document entitled “Holographic and action capture techniques”, T. Rodriguez, A. Cabo de Leon, B. Uzzan, N. Livet, E. Boyer, F. Geffray, T. Balogh, Z. Megyesi and A. Barsi, August 2007, SIGGRAPH '07, ACM SIGGRAPH 2007, Emerging Technologies. It is to be noted that these applications may enable the geometry of the real scene to be reproduced but do not currently enable precise movements to be identified. Furthermore, to meet real time constraints, it is necessary to set up complex and costly hardware architectures.
Touch screens are also known for viewing augmented reality scenes which enable interactions of a user with a computer system to be determined. However, these screens are costly and poorly adapted to the applications of augmented reality.
As regards the interactions of users in the field of video games, an image is typically captured from a webcam type video camera connected to a computer or to a console. After having been stored in a memory of the system to which the video camera is connected, this image is generally analyzed by an object tracking algorithm, also referred to as blobs tracking, to compute in real time the contours of certain elements of the user who is moving in the image by using, in particular, an optical flow algorithm. The position of those shapes in the image enables certain parts of the displayed image to be modified or deformed. This solution thus enables the disturbance in a zone of the image to be located in two degrees of freedom.
However, the limits of this approach are mainly the lack of precision since it is not possible to maintain the proper execution of the process during a displacement of the video camera and the lack of semantics since it is not possible to distinguish the movements between the foreground and the background. Furthermore, this solution uses optical flow image analysis which, in particular, does not provide robustness to changes in lighting or noise.
Also known is an approach to real time detection of an interaction between a user and a computer system in an augmented reality scene, based on an image of a sequence of images, the interaction resulting from the modification of the appearance of the representation of an object present in the image. However, this method, described in particular in French patent application No. 0854382, does not enable precise movements of the user to be identified and only applies to sufficiently textured zones of the image.
The invention enables at least one of the problems set forth above to be solved.
The invention is thus directed to a computer method for detecting interactions with a software application according to a movement of at least one object situated in the field of an image sensor connected to a computer implementing the method, said image sensor providing a stream of images to said computer, the method comprising the following steps,

- receiving at least one first image from said image sensor;
- identifying at least one first region of interest in said first image, said at least one first region of interest corresponding to a part of said at least one first image;
- receiving at least one second image from said image sensor;
- identifying at least one second region of interest of said at least one second image, said at least one second region of interest corresponding to said at least one first region of interest of said at least one first image;
- comparing said at least one first and second regions of interest and determining a mask of interest characterizing a variation of at least one feature of corresponding points in said at least one first and second regions of interest;
- determining a movement of said at least one object from said mask of interest, said at least one object being at least partially represented in at least one of said at least one first and second regions of interest; and
- analyzing said movement and, in response to said analyzing step, triggering or not triggering a predetermined action.

The method according to the invention thus enables objects to be tracked, in particular deformable objects with little texture, in particular for augmented reality applications. Furthermore, the limited quantity of processing enables the method to be implemented in devices having limited resources (in particular in terms of computation) such as mobile platforms. Moreover, the method may be used with an image sensor of low quality.
The method according to the invention enables fast movements of objects to be tracked, even in the presence of blur in the images acquired by the image sensor. In addition, the processing according to the method of the invention does not depend on specific color properties of the moving objects, and it is thus possible to track objects such as a hand or a textured object in movement in front of the image sensor used.
The number of degrees of freedom defining the movements of each tracked object may be set for each region of interest.
It is possible to track several zones of interest simultaneously in particular in order to enable multiple control. Thus, for example, the tracking of two hands enables the number of possible iterations between a user and a software application to be increased.
Advantageously, said step of determining a movement comprises a step of determining and matching at least one pair of points of interest in said at least one first and second images, at least one point of said at least one pair of points of interest belonging to said mask of interest. The method according to the invention thus enables the advantages linked to the tracking of points of interest to be combined while limiting the zones where those points are located in order to limit the processing and to concentrate on the tracked object.
According to a particular embodiment, said step of determining a movement comprises a step of determining and matching a plurality of pairs of points of interest in said at least one first and second images, at least one point of each of said pairs of points of interest belonging to said mask of interest, said movement being estimated on the basis of a transformation of a first set of points of interest into a second set of points of interest, the points of interest of said first and second sets belonging to said plurality of pairs of points of interest, the points of interest of said first set of points of interest furthermore belonging to said at least one first image and the points of interest of said second set of points of interest furthermore belonging to said at least one second image. The general movement of a part of an object may thus be determined from the movements of a set of points of interest.
Said transformation preferably implements a weighting function based on a distance between two points of interest from the same pairs of points of interest of said plurality of pairs of points of interest in order to improve the estimation of the movement of the tracked object.
Still according to a particular embodiment, the method further comprises a step of validating at least one point of interest of said at least one first image, belonging to said at least one pair of points of interest, according to said determined movement, said at least one validated point of interest being used to track said object in at least one third image following said at least one second image and said at least one validated point of interest being used for modifying a mask of interest created on the basis of said at least one second and third images. It is thus possible to use points of interest which are the same from image to image if they efficiently contribute to the general movement estimation of the tracked object. Furthermore, the validated points of interest are used to select new points of interest in order to avoid an excessive accumulation of points of interest in a limited region.
Said step of comparing said at least one first and second regions of interest comprises a step of performing subtraction, point by point, of values of corresponding points of said at least one first and second regions of interest and a step of comparing a result of said subtraction to a predetermined threshold. Such an embodiment makes it possible to combine the effectiveness of the method and limiting processing resources.
According to a particular embodiment, the method further comprises a step of detecting at least one predetermined feature in said at least one first image, said at least one first region of interest being at least partially identified in response to said detecting step. The method according to the invention may thus be automatically initialized or re-initialized according to elements of the content of the processed image. Such a predetermined feature is, for example, a predetermined shape and/or a predetermined color.
Advantageously, the method further comprises a step of estimating at least one modified second region of interest in said at least one second image, said at least one modified second region of interest of said at least one second image being estimated according to said at least one first region of interest of said at least one first image and of said at least one second region of interest of said at least one second image. The method according to the invention thus makes it possible to anticipate the processing of the following image for the object tracking. Said estimation of said at least one modified second region of interest of said at least one second image for example implements an object tracking algorithm of KLT type.
Said movement may in particular be characterized by a translation, a rotation and/or a scale factor.
When said movement is characterized by a scale factor, whether or not said predetermined action is triggered may be determined on the basis of said scale factor. Thus, a scale factor may, for example, characterize a mouse click.
According to a particular embodiment, the movements of at least two objects situated in the field of said image sensor are determined, whether or not said predetermined action is triggered being determined according to a combination of the movements associated with said at least two objects. It is thus possible to determine a movement of an object on the basis of movements of other objects, in particular other objects subjected to constraints of relative position.
The invention is also directed to a computer program comprising instructions adapted to the implementation of each of the steps of the method described earlier when said program is executed on a computer as well as a device comprising means adapted to the implementation of each of the steps of the method described earlier. The advantages of this computer program and of this method are similar to those referred to earlier.

Other advantages, objects and features of the present invention will emerge from the following detailed description, given by way of non-limiting example, relative to the accompanying drawings in which:

FIG. 1, comprising FIGS. 1 a and 1 b, illustrates two successive images of a stream of images that may be used to determine the movement of objects and the interaction of a user;

FIG. 2, comprising FIGS. 2 a to 2 d, illustrates examples of variation in a region of interest of an image with the corresponding region of interest of a following image;

FIG. 3 is a diagrammatic illustration of the determination of a movement of an object of which at least one part is represented in a region and in a mask of interest of two consecutive images;

FIG. 4 is a diagrammatic illustration of certain steps implemented in accordance with the invention to identify, in continuous operation, variations in position of objects between two consecutive (or close) images of a sequence of images;

FIG. 5 illustrates certain aspects of the invention when four parameters characterize a movement of an object tracked in consecutive (or close) images of a sequence of images;

FIG. 6, comprising FIGS. 6 a, 6 b and 6 c, illustrates an example of implementation of the invention in the context of a driving simulation game in which two regions of interest enable the tracking of a user's hands in real time, characterizing a vehicle steering wheel movement, in a sequence of images; and,

FIG. 7 illustrates an example of a device adapted to implement the invention.

In general terms, the invention concerns the tracking of objects in particular regions of images in a stream of images, those regions, termed regions of interest, comprising a part of the tracked objects and a part of the scene represented in the images. It has been observed that the analysis of regions of interest makes it possible to speed up the processing time and to improve the movement detection of objects.
The regions of interest are, preferably, defined as two-dimensional shapes, in an image. These shapes are, for example, rectangles or circles. They are preferably constant and predetermined. The regions of interest may be characterized by points of interest, that is to say singular points, such as points having a high luminance gradient, and the initial position of the regions of interest may be predetermined, be determined by a user, by an event such as the appearance of a shape or a color or according to predefined features, for example using key images. These regions may also be moved depending on the movement of tracked objects or have a fixed position and orientation in the image. The use of several regions of interest makes it possible, for example, to observe several concomitant interactions of a user (a region of interest may correspond to each of his hands) and/or several concomitant interactions of several users.
The points of interest are used in order to find the variation of the regions of interest, in a stream of images, from one image to a following (or close) image, according to techniques of tracking points of interest based, for example, on algorithms known under the name of FAST, for the detection, and KLT (initials of Kanade, Lucas and Tomasi), for tracking in the following image. The points of interest of a region of interest may vary over the images analyzed, in particular according to the distortion of the objects tracked and their movements which may mask parts of the scene represented in the images and/or make parts of those objects leave the zones of interest.
Furthermore, the objects whose movements may create an interaction are tracked in each region of interest according to a mechanism for tracking points of interest in masks defined in the regions of interest.
FIGS. 1 and 2 illustrate the general principle of the invention.
FIG. 1, comprising FIGS. 1 a and 1 b, illustrates two successive images of a stream of images that may be used to determine the movement of objects and the interaction of a user.
As illustrated in FIG. 1 a, image 100-1 represents a scene having fixed elements (not represented) such as elements of decor and mobile elements here linked to animate characters (real or virtual). The image 100-1 here comprises a region of interest 105-1. As indicated previously, several regions of interest may be processed simultaneously however, in the interest of clarity, a single region of interest is represented here, the processing of the regions of interest being similar for each of them. It is considered that the shape of the region of interest 105-1 as well as its initial position are predetermined.
Image 100-2 of FIG. 1 b represents an image following the image 100-1 of FIG. 1 a in a sequence of images. It is possible to define, in the image 100-2, a region of interest 105-2, corresponding to the position and to the dimensions of the region of interest 105-1 defined in the preceding image, in which disturbances may be estimated. The region of interest 105-1 is thus compared to the region of interest 105-2 of FIG. 1 b, for example by subtracting those image parts one from another, pixel by pixel (pixel being an acronym for PICture ELement), in order to extract therefrom a map of pixels that are considered to be in movement. These pixels in movement constitute a mask of pixels of interest (presented in FIG. 2).
Points of interest, generically referenced 110 in FIG. 1 a, may be determined in the image 100-1, in particular in the region of interest 105-1, according to standard algorithms for image analysis. These points of interest may be advantageously detected at positions in the region of interest which belong to the mask of pixels of interest.
The points of interest 110 defined in the region of interest 105-1 are tracked in the image 100-2, preferably in the region of interest 105-2, for example using the KLT tracking principles by comparing portions of the images 100-1 and 100-2 that are associated with the neighborhoods of the points of interest.
These matches denoted 115 between the image 100-1 and the image 100-2 make it possible to estimate the movements of the hand represented with the reference 120-1 in image 100-1 and the reference 120-2 in image 100-2. It is thus possible to obtain the new position of the hand in the image 100-2.
The movement of the hand may next be advantageously used to move the region of interest 105-2 from the image 100-2 to the modified region of interest 125 which may be used for estimating the movement of the hand in an image following the image 100-2 of the image stream. The method of tracking objects may thus continue recursively.
It is to be noted here that, as stated earlier, certain points of interest present in the image 100-1 have disappeared from the image 100-2 due, in particular, to the presence and movements of the hand.
The determination of points of interest in an image is, preferably, limited to the zone corresponding to the corresponding region of interest as located on the current image or to a zone comprising all or part thereof when a mask of interest of pixels in movement is defined in that region of interest.
According to a particular embodiment, estimation is made of information characterizing the relative positions and orientations of the objects to track (for example the hand referenced 120-1 in FIG. 1 a) in relation to a reference linked to the video camera from which the images come. Such information is, for example two-dimensional position information (x, y), orientation information (θ) and information on distance to the video camera, that is to say scale(s) of the objects to track.
Similarly, it is possible to track the modifications that have occurred in the region of interest 125 that is defined in the image 100-2 relative to the region of interest 105-1 of the image 100-1 according to a movement estimated between the image 100-2 and the following image of the stream of images. For these purposes, a new region of interest is first of all identified in the following image on the basis of the region of interest 125. When the region of interest has been identified, it is compared with the region of interest 125 in order to determine the modified elements, forming a mask comprising parts of objects whose movements must be determined.
FIG. 2, comprising FIGS. 2 a to 2 c, illustrates the variation of a region of interest of one image in comparison with the corresponding region of interest, at the same position, of a following image, as described with reference to FIG. 1. The image resulting from this comparison, having the same shape as the region of interest, is formed of pixels which here may take two states, a first state being associated, by default, with each pixel. A second state is associated with the pixels corresponding to the pixels of the regions of interest whose variation exceeds a predetermined threshold. This second state forms a mask used here to limit the search for points of interest to zones which are situated on tracked objects or that are close to those tracked objects in order to characterize the movement of the tracked objects and, possibly, to trigger particular actions.
FIG. 2 a represents a region of interest of a first image whereas FIG. 2 b represents the corresponding region of interest of a following image, at the same position. As illustrated in FIG. 2 a, the region of interest 200-1 comprises a hand 205-1 as well as another object 210-1. Similarly, the corresponding region of interest, referenced 200-2 and illustrated in FIG. 2 b, comprises the hand and the object, here referenced 205-2 and 210-2, respectively. The hand, generically referenced 205, has moved substantially whereas the object, generically referenced 210, has only moved slightly.
FIG. 2 c illustrates the image 215 resulting from the comparison of the regions of interest 200-1 and 200-2. The black part, forming a mask of interest, represents the pixels whose difference is greater than a predetermined threshold whereas the white part represents the pixels whose difference is less than that threshold. The black part comprises in particular the part referenced 220 corresponding to the difference in position of the hand 205 between the regions of interest 200-1 and 200-2. It also comprises the part 225 corresponding to the difference in position of the object 210 between those regions of interest. The part 230 corresponds to the part of the hand 205 present in both these regions of interest.
The image 215 represented in FIG. 2 c may be analyzed to deduce therefrom an interaction between the user who moved his hand in the field of the video camera from which come the images from which are extracted the regions of interest 200-1 and 200-2 and a computer system processing those images. Such an analysis may in particular consist in identifying the movement of points of interest belonging to the mask of interest so formed, the search for points of interest then preferably being limited to the mask of interest.
However, a skeletonizing step making it possible in particular to eliminate adjoining movements such as the movement referenced 225 is, preferably, carried out before analyzing the movement of the points of interest belonging to the mask of interest. This skeletonizing step may take the form of a morphological processing operation such as for example operations of opening or closing applied to the mask of interest.
Furthermore, advantageously, the mask of interest obtained is modified in order to eliminate the parts situated around the points of interest identified recursively between the image from which is extracted the region of interest 200-1 and the image preceding it.
FIG. 2 d thus illustrates the mask of interest represented in FIG. 2 c, here referenced 235, to which the parts 240 situated around the points of interest identified by 245 have been eliminated. The parts 240 are, for example, circular. They are of predetermined radius here.
The mask of interest 235 thus has cropped from it zones in which are situated already detected points of interest and where it is thus not necessary to detect new ones. In other words, this modified mask of interest 235 has just excluded a part of the mask of interest 220 in order to avoid the accumulation of points of interest in the same zone of the region of interest.
Again, the mask of interest 235 may be used to identify points of interest whose movements may be analyzed in order to trigger, the case arising, a particular action.
FIG. 3 is again a diagrammatic illustration of the determination of a movement of an object of which at least one part is represented in a region and a mask of interest of two consecutive (or close) images. The image 300 here corresponds to the mask of interest resulting from the comparison of the regions of interest 200-1 and 200-2 as described with reference to FIG. 2 d. However, a skeletonizing step has been carried out to eliminate the disturbances (in particular the disturbance 225). Thus, the image 300 comprises a mask 305 which may be used for identifying new points of interest whose movements characterize the movement of objects in that region of interest.
By way of illustration, the point of interest corresponding to the end of the user's index finger is shown. Reference 310-1 designates this point of interest according to its position in the region of interest 200-1 and reference 310-2 designates that point of interest according to its position in the region of interest 200-2. Thus, by using standard techniques for tracking points of interest, for example an algorithm for tracking by optical flow, it is possible, on the basis of the point of interest 310-1 of the region of interest 200-1, to find the corresponding point of interest 310-2 of the region of interest 200-2 and, consequently, to find the corresponding translation.
The analysis of the movements of several points of interest, in particular of the point of interest 310-1 and of the points of interest detected and validated beforehand, for example the points of interest 245, makes it possible to determine a set of movement parameters for the tracked object, in particular which are linked to a translation, a rotation and/or a change of scale.
FIG. 4 is a diagrammatic illustration of certain steps implemented in accordance with the invention to identify, in continuous operation, variations in arrangement of objects between two consecutive (or close) images of a sequence of images.
The images here are acquired via an image sensor such as a video camera, in particular a video camera of webcam type, connected to a computer system implementing the method described here.
After having acquired a current image 400 and if that image is the first to be processed, that is to say if a preceding image 405 from the same video stream has not been processed beforehand, a first step of initializing (step 410) is executed. An object of this step is in particular to define features of at least one region of interest, for example a shape, a size and an initial position.
As described earlier, a region of interest may be defined relative to a corresponding region of interest determined in a preceding image (in recursive phase of tracking, in this case the initializing 410 is not necessary) or according to predetermined features and/or particular events (corresponding to the initializing phase).
Thus, by way of illustration, it is possible for a region of interest not to be defined in an initial state, the system being on standby for a triggering event, for example a particular movement of the user facing the video camera (the moving pixels in the image are analyzed in search for a particular movement), the location of a particular color such as the color of skin or the recognition of a particular predetermined object whose position defines that of the region of interest. Like the position, the size and the shape of the region of interest may be predefined or be determined according to features of the detected event.
The initializing step 410 may thus take several forms depending on the object to track in the image sequence and depending on the application implemented.
It may in particular be a static initialization. In this case, the initial position of the region of interest is predetermined (off-line determination) and the tracking algorithm is on standby for a disturbance.
The initializing phase may also comprise a step of recognizing objects of a specific type. For example, the principles of detecting descriptors of Haar wavelet type may be implemented. The principle of these descriptors is in particular described in the paper by Viola and Jones, “Rapid object detection using boosted cascade of simple features”, Computer Vision and Pattern Recognition, 2001. These descriptors in particular enable the detection of a face, the eyes or a hand in an image or a part of an image. During the initializing phase, it is thus possible to search for particular objects either in the whole image in order to position the region of interest on the detected object or in a region of interest itself to trigger the tracking of the recognized object.
Another approach consists in segmenting an image and in identifying certain color properties and certain predefined shapes. When a shape and/or a segmented region of the processed image is similar to the object searched for, for example the color of the skin and the outline of the hand, the tracking process is initialized as described earlier.
In a following step (step 415), a region of interest whose features have been determined beforehand (on initialization or in the preceding image) is positioned in the current image to extract the corresponding image part. If the current image is the first image of the video stream to be processed, that image becomes the preceding image, a new image current is acquired and step 415 is repeated.
This image part thus extracted is then compared with the corresponding region of interest of the preceding image (step 420). Such a comparison may in particular consist of subtracting each pixel from the region of interest considered of the current image with the corresponding pixel of the corresponding region of interest of the preceding image.
The detection of the points in movement is thus carried out, according to this example, by the absolute difference of parts of the current image and of the preceding image. This difference makes it possible to create a mask of interest capable of being used to distinguish a moving object from the decor, which is essentially static. However, as the object/decor segmentation is not expected to be perfect, it is possible to update such a mask of interest recursively on the basis of the movements in order to identify the movements of the pixels of the tracked object and the movements of the pixels which belong to the background of the image.
Thresholding is then preferably carried out on the difference between pixels according to a predetermined threshold value (step 425). Such thresholding may, for example, be carried out on the luminance. If coding over 8 bits is used, its value is, for example, 100. It makes it possible to isolate the pixels having a movement considered to be sufficiently great between two consecutive (or close) images. The difference between the pixels of the current and preceding images is then binary coded, for example black if the difference exceeds the predetermined threshold characterizing the movement and white in the opposite case. The binary image formed by the pixels whose difference exceeds the predetermined threshold forms a mask of interest or tracking in the region of interest considered (step 430).
If points of interest have been validated beforehand, the mask is modified (step 460) in order to exclude from the mask zones in which points of interest are recursively tracked. Thus, as represented by the use of dashed line, step 460 is only carried out if there are validated points of interest. As indicated earlier, this step consists in eliminating zones from the mask created, for example disks of a predetermined diameter, around points of interest validated beforehand.
Points of interest are then searched for in the region of the preceding image corresponding to the mask of interest so defined (step 435), the mask of interest here being the mask of interest created at step 430 or the mask of interest created at step 430 and modified during step 460.
The search for points of interest is, for example, limited to the detection of twenty points of interest. Naturally, this number may be different and may be estimated according to the size of the mask of interest.
This search is advantageously carried out with the algorithm known by the name FAST. According to this algorithm, a Bresenham circle, for example with a perimeter of 16 pixels, is constructed around each pixel of the image. If k contiguous pixels (k typically having a value of 9, 10, 11 or 12) contained in that circle all have either greater intensity than the central pixel, or all have lower intensity than the central pixel, that central pixel is considered as a point of interest. It is also possible to identify points of interest with an approach based on image gradients as provided in the approach known by the name of Harris points detection.
The points of interest detected in the preceding image according to the mask of interest as well as, where applicable, the points of interest detected and validated beforehand are used to identify the corresponding points of interest in the current image.
A search for corresponding points of interest in the current image is thus carried out (step 440), preferably using a method known under the name of optical flow. The use of this technique gives better robustness when the image is blurred, in particular thanks to the use of pyramids of images smoothed by a Gaussian filter. This is for example the approach implemented by Lucas, Kanade and Tomasi in the algorithm known under the name KLT.
When the points of interest of the current image, corresponding to the points of interest of the preceding image (which are determined according to the mask of interest or by recursive tracking), have been identified, movement parameters are estimated for objects tracked in the region of interest of the preceding image relative to the region of interest of the current image (step 445). Such parameters, also termed degrees of freedom, comprise, for example, a parameter of translation along the x-axis, a parameter of translation along the y-axis, a rotation parameter and/or a scale parameter, the transformation making a set of bi-directional points pass from one plane to another, grouping together these four parameters, being termed the similarity. These parameters are, preferably, estimated using the method of Nonlinear Least Squares Error (NLSE) or Gauss-Newton method. This method is directed to minimizing a re-projection error over the set of the tracked points of interest. In order to improve the estimation of the parameters of the model (position and orientation), it is advantageous, in a specific embodiment, to search for those parameters in a distinct manner. Thus, for example, it is relevant to apply the least squares error, in a first phase, in order to estimate only the translation parameters (x,y), these latter being easier to identify, then, during a second iteration, to compute the parameters of scale change and/or of rotation (possibly less precisely).
In a following step, the points of interest of the preceding image, for which a match has been found in the current image, are preferably analyzed in order to recursively determine valid points of interest relative to the movement estimated in the preceding step. For these purposes, it is verified, for each previously determined point of interest of the preceding image (determined according to the mask of interest of by recursive tracking), whether the movement, relative to that point of interest, of the corresponding point of interest of the current image is in accordance with the identified movement. In the affirmative, the point of interest is considered as valid whereas in the opposite case, it is considered as not valid. A threshold, typically expressed in pixels and having a predetermined value, is advantageously used to authorize a certain margin of error between the theoretical position of the point in the current image (obtained by applying the parameters of step 445) and its real position (obtained by the tracking method of step 440).
The valid points of interest, here referenced 455, are considered as belonging to an object whose movement is tracked whereas the non-valid points (also termed outliers), are considered as belonging to the image background or to portions of an object which are not visible in the image.
As indicated earlier, the valid points of interest are tracked in the following image and are used to modify the mask of interest created by comparison of a region of interest of the current image with the corresponding region of interest of the following image (step 460) in order to exclude from the portions of mask, pixels in movement between the current and following images as described with reference to FIG. 2 d. This modified mask of interest makes it possible to eliminate portions of images in which points of interest are recursively tracked. The valid points of interest are thus kept for several processing operations on successive images and in particular enable stabilization of the tracking of objects.
The new region of interest (or modified region of interest) which is used for processing the current image and the following image is then estimated thanks to the previously estimated degrees of freedom (step 445). For example, if the degrees of freedom are x and y translations, the new position of the region of interest is estimated according to the previous position of the region of interest, using those two items of information. If a change (or changes) of scale is estimated and considered in this step, it is possible, according to the scenario considered, also to modify the size of the new region of interest which is used in the current and following images of the video stream.
In parallel, when the different degrees of freedom have been computed, it is possible to estimate a particular interaction according to those parameters (step 470).
According to a particular embodiment, the estimation of a change (or changes) of scale is used for detecting the triggering of an action in similar manner to the click of a mouse. Similarly, it is possible to use changes of orientation, particularly those around the viewing axis of the video camera (referred to as roll) in order, for example, to enable the rotation of a virtual element displayed in a scene or to control a button of “potentiometer” type in order, for example, to adjust a volume of sound of an application.
This detection of interactions according to scale factor to detect an action such as a mouse click may, for example, be implemented in the following manner, by counting the number of images over which the norm of the movement vector (translation) and the scale factor (determined according to corresponding regions of interest) are less than certain predetermined values. Such a number characterizes a stability in the movement of the tracked objects. If the number of images over which the movement is stable exceeds a certain threshold, the system enters a state of standby for the detection of a click. A click is then detected by measuring the average of the absolute differences of the scale factors between current and preceding images, this being performed over a given number of images. If the sum thus computed exceeds a certain threshold, the click is validated.
When an object is no longer tracked in a sequence of images (either because it disappears from the image, or because it has been lost), the algorithm preferably returns to the initializing step. Furthermore, loss of tracking leading to the initializing step being re-executed may be identified by measuring the movements of a user. Thus, it may be decided to reinitialize the method when those movements are stable or non-existent for a predetermined period or when a tracked object leaves the field of view of the image sensor.
FIG. 5 illustrates more precisely certain aspects of the invention when four parameters characterize a movement of an object tracked in consecutive (or close) images of a sequence of images; These four parameters here are a translation denoted (T_x, T_y), a rotation denoted θ around the optical axis of the image sensor and a scale factor denoted s. These four parameters represent a similarity which is the transformation enabling a point M to be transformed from a plane to a point M′.
As illustrated in FIG. 5, O represents the origin of a frame of reference 505 for the object in preceding image and O′ represents the origin of a frame of reference 510 of the object in the current image, which frame of reference 510 being obtained in accordance with the object tracking method, the image frame of reference here bearing the reference 500. It is then possible to express the transformation of the point M to the point M′ by the following system of non-linear equations:
X _M′ =s·(X _M −X _O)·cos(θ)−s·(Y _M −Y _O)·sin(θ)+T _x +X _O
Y _M′ =s·(X _M −X _O)·sin(θ)+s·(Y _M −Y _O)·cos(θ)+T _y +Y _O
where (X_M, Y_M) are the coordinates of the point M expressed in the image frame of reference, (X₀, Y₀) are the coordinates of the point O in the image frame of reference and (X_M′, Y_M′) are the coordinates of the point M′ in the image frame of reference.
The points M_sand M_sθ respectively represent the transformation of the point M according to the change in scale s and the change of scale s combined with the rotation θ, respectively.
As described earlier, it is possible to use the nonlinear least squares error approach to solve this system by using all the points of interest tracked in step 440 described with reference to FIG. 4.
To compute the new position of the object in the current image (step 465 of FIG. 4), it suffices theoretically to apply the estimated translation (T_x,T_y) to the previous position of the object in the following manner:
X _0′ =X ₀ +T _x
Y _0′ =Y ₀ +T _y
where (X_0′, Y_0′) are the coordinates of the point O′ in the image frame of reference.
Advantageously, the partial derivatives of each point considered, that is to say the movements associated with each of those points, are weighted according to the associated movement. Thus, the points of interest moving the most have greater importance in the estimation of the parameters, which avoids the points of interest linked to the background disturbing the tracking of objects.
It has thus been observed that it is advantageous to add an influence of the center of gravity of the points of interest tracked in the current image to the preceding equation. This center of gravity approximately corresponds to the local center of gravity of the movement (the points tracked in the current image come from moving points in the preceding image). The center of the region of interest thus tends to translate to the center of the movement so long as the distance of the object to the center of gravity is greater than the estimated translation movement. The origin of the frame of reference in the current image, characterizing the movement of the tracked object, is advantageously computed according to the following relationship:
X _O′ =X _O+W _GC·(X _GC·(X _GC −X _O)+W _T ·T _x
Y _O′ =Y _O +W _GC·(Y _GC −Y _O)+W _T ·T _y
where (X_GC, Y_GC) represent the center of gravity of the points of interest in the current image and W_GCrepresents the weight on the influence of the current center of gravity and W_Tthe weight on the influence of the translation. The parameter W_GCis positively correlated here with the velocity of movement of the tracked object whereas the parameter W_Tmay be fixed depending on the desired influence of the translation.
FIG. 6, comprising FIGS. 6 a, 6 b and 6 c, illustrates an example of implementation of the invention in the context of a driving simulation game in which two regions of interest enable the tracking of a user's hands in real time, characterizing a vehicle steering wheel movement, in a sequence of images.
More specifically, FIG. 6 a is a pictorial presentation of the context of the game, whereas FIG. 6 b represents the display of the game as perceived by a user. FIG. 6 c illustrates the estimation of the movement parameters, or degrees of freedom, of the tracked objects to deduce therefrom a movement of a vehicle steering wheel.
FIG. 6 a comprises an image 600 extracted from the sequence of images provided by the image sensor used. The latter is placed facing the user, as if it were fastened to the windshield of the vehicle driven by the user. This image 600 here contains a zone 605 comprising two circular regions of interest 610 and 615 associated with a steering wheel 620 drawn in overlay by computer graphics. The image 600 also comprises elements of the real scene in which the user is situated.
The initial position of the regions 610 and 615 is fixed on a predetermined horizontal line, at equal distances on respective opposite sides of a point representing the center of the steering wheel, while awaiting a disturbance. When the user positions his hands in these two regions, he is able to turn the steering wheel either to the left, or to the right. The movement of the regions 610 and 615 is here constrained by the radius of the circle corresponding to the steering wheel 620. The image representing the steering wheel moves with the hands of the user, for example according to the average movement of both hands.
The radius of the circle corresponding to the steering wheel 620 may also vary when the user moves his hands towards or away from the center of that circle.
These two degrees of freedom are next advantageously used to control the orientation of a vehicle (position of the hands on the circle corresponding to the steering wheel 620) and its velocity (scale factor linked to the position of the hands relative to the center of the circle corresponding to the steering wheel 620).
FIG. 6 b, illustrating the display 625 of the application, comprises the image portion 605 extracted from the image 600. This display enables the user to observe and control his movements. The image portion 605 may advantageously be represented as a car rear-view mirror in which the driver may observe his actions.
The regions 610 and 615 of the image 600 enable the movements of the steering wheel 620 to be controlled, that is to say to control the direction of the vehicle referenced 630 on the display 625 as well as its velocity relative to the elements 635 of the decor, the vehicle 630 and the elements 635 of the decor being created here by computer graphics. In accordance with the standard driving applications, the vehicle may move in the decor and hit certain elements.
FIG. 6 c more precisely describes the estimation of the parameters of freedom linked to each of the regions of interest and to deduce the degrees of freedom of the steering wheel. In this implementation, the parameters to estimate are the orientation θ of the steering wheel and its diameter D.
In order to analyze the components of the movement, several frames of reference are defined. The frame of reference Ow here corresponds to an overall frame of reference (“world” frame of reference), the frame of reference Owh is a local frame of reference linked to the steering wheel 620 and the frames of reference Oa1 and Oa2 are two local frames of reference linked to the regions of interest 610 and 615. The vectors Va1(Xva1, Yva1) and Va2(Xva2, Yva2) are the movement vectors resulting from the analysis of the movement of the user's hands in the regions of interest 610 and 615, expressed in the frames of reference Oa1 and Oa2, respectively.
The new orientation θ′ of the steering wheel is computed relative to its previous orientation θ and on the basis of the movement of the user's hands (determined via the two regions of interest 610 and 615). The movement of the steering wheel is thus a constrained movement linked to the movement of several regions of interest. The new orientation θ′ may be computed in the following manner:
θ′=θ+((Δθ1+Δθ2)/2)
where Δθ1 and Δθ2, represent the rotation of the user's hands.
Δθ1 may be computed by the following relationship:
Δθ1=a tan2(Yva1wh, D/2)
with Yva1 wh=Xva1*sin(−(θ+π))+Yva1*cos(−(θ+π)) characterizing the translation along the y-axis in the frame of reference Owh.
Δθ2 may be computed in similar manner.
Similarly, the new diameter D′ of the steering wheel is computed on the basis of its previous diameter D and on the basis of the movement of the user's hands (determined via the two regions of interest 610 and 615). It may be computed in the following manner:
D′=D+((Xva1wh+Xva2wh)/2)
with Xva1 wh=Xva1*cos(−(θ+)−Yva1*sin(−(θ+π)) and
Xva2 wh=Xva2*cos(−θ)−Yva2*sin(−θ)
Thus, knowing the angular position of the steering wheel and its diameter, the game scenario may in particular compute a corresponding computer graphics image.
FIG. 7 illustrates an example of a device which may be used to identify the movements of objects represented in images provided by a video camera and to trigger particular actions according to identified movements. The device 700 is for example a mobile telephone of smartphone type, a personal digital assistant, a micro-computer or a workstation.
The device 700 preferably comprises a communication bus 702 to which are connected:

- a central processing unit or microprocessor 704 (CPU);
- A read only memory 706 (ROM) able to include the operating system and programs such as “Prog”;
- a random access memory or cache memory (RAM) 708, comprising registers adapted to record variables and parameters created and modified during the execution of the aforementioned programs;
- a video acquisition card 710 connected to a video camera 712; and
- a graphics card 714 connected to a screen or a projector 716.

Optionally, the device 700 may also have the following items:

- a hard disk 720 able to contain the aforesaid programs “Prog” and data processed or to be processed according to the invention;
- a keyboard 722 and a mouse 724 or any other pointing device such as an optical stylus, a touch screen or a remote control enabling the user to interact with the programs according to the invention, in particular during the phases of installation and/or initialization;
- a communication interface 726 connected to a distributed communication network 728, for example the Internet, the interface being able to transmit and receive data; and,
- a reader for memory cards (not shown) adapted to read or write thereon data processed or to be processed according to the invention.

The communication bus allows communication and interoperability between the different elements included in the device 700 or connected to it. The representation of the bus is non-limiting and, in particular, the central processing unit may communicate instructions to any element of the device 700 directly or by means of another element of the device 700.
The executable code of each program enabling the programmable apparatus to implement the processes according to the invention may be stored, for example, on the hard disk 720 or in read only memory 706.
According to a variant, the executable code of the programs can be received by the intermediary of the communication network 728, via the interface 726, in order to be stored in an identical fashion to that described previously.
More generally, the program or programs may be loaded into one of the storage means of the device 700 before being executed.
The central processing unit 704 will control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, these instructions being stored on the hard disk 720 or in the read-only memory 706 or in the other aforementioned storage elements. On powering up, the program or programs which are stored in a non-volatile memory, for example the hard disk 720 or the read only memory 706, are transferred into the random-access memory 708, which then contains the executable code of the program or programs according to the invention, as well as registers for storing the variables and parameters necessary for implementation of the invention.
It should be noted that the communication apparatus comprising the device according to the invention can also be a programmed apparatus. This apparatus then contains the code of the computer program or programs for example fixed in an application specific integrated circuit (ASIC).
Naturally, to satisfy specific needs, a person skilled in the art will be able to make amendments to the preceding description.

Claims

1. A computer implemented method of detecting movement of at least one object situated in a field of an image sensor, the image sensor providing a stream of images to the computer, the method comprising:

receiving at least one first image from the image sensor;

identifying at least one first region of interest in the first image, wherein the at least one first region of interest corresponds to a part of the at least one first image;

receiving at least one second image from the image sensor;

identifying at least one second region of interest in the at least one second image, wherein the at least one second region of interest corresponds to the at least one first region of interest of the at least one first image;

comparing the at least one first and second regions of interest and determining a mask of interest characterizing a variation of at least one feature of corresponding points in the at least one first and second regions of interest;

determining a movement of the at least one object from the mask of interest, wherein the at least one object is at least partially represented in at least one of the at least one first and second regions of interest;

analyzing the movement; and

determining whether to trigger an action.

2. The method according to claim 1, wherein determining the movement comprises determining and matching at least one pair of points of interest in the at least one first and second images, wherein at least one point of the at least one pair of points of interest belong to the mask of interest.

3. The method according to claim 2, wherein determining the movement comprises determining and matching a plurality of pairs of points of interest in the at least one first and second images, wherein at least one point of each of the pairs of points of interest belong to the mask of interest, wherein the movement is estimated on the basis of a transformation of a first set of points of interest into a second set of points of interest, wherein the points of interest of the first and second sets belong to the plurality of pairs of points of interest, wherein the points of interest of the first set of points of interest additionally belong to at least one first image, and wherein the points of interest of the second set of points of interest additionally belong to at least one second image.

4. The method according to claim 3, wherein the transformation implements a weighting function based on a distance between two points of interest from the same pairs of points of interest of the plurality of pairs of points of interest.

5. The method according to claim 3, further comprising validating at least one point of interest of the at least one first image, belonging to the at least one pair of points of interest, according to the determined movement, wherein the at least one validated point of interest is used to track the object in at least one third image following the at least one second image and the at least one validated point of interest is used for modifying a mask of interest created on the basis of the at least one second and third images.

6. The method according to claim 1, wherein comparing the at least one first and second regions of interest comprises performing subtraction, point by point, of values of corresponding points of the at least one first and second regions of interest and comparing a result of the subtraction to a threshold.

7. The method according to claim 1, further comprising detecting at least one feature in the at least one first image, wherein the at least one first region of interest is at least partially identified in response to the detecting.

8. The method according to claim 7, wherein the at least one feature includes at least one of a shape and a color.

9. The method according to claim 1, further comprising estimating at least one modified second region of interest in the at least one second image, wherein the at least one modified second region of interest of the at least one second image is estimated according to the at least one first region of interest of the at least one first image and of the at least one second region of interest of the at least one second image.

10. The method according to claim 9, wherein the estimating comprises performing an object tracking algorithm of KL T type.

11. The method according to claim 1, wherein the movement comprises at least one of a translation, a rotation, a scale factor.

12. The method according to claim 11, wherein the movement comprises a scale factor and wherein whether the action is triggered is determined based at least in part on the scale factor.

13. The method according to claim 1, wherein movements of at least two objects situated in the field of the image sensor are determined, and wherein whether the action is triggered is determined based at least in part on a combination of the movements associated with the at least two objects.

14. (canceled)

15. (canceled)

16. A non-transitory computer readable medium having instructions, which, when executed cause the computer to perform the method of claim 1.

17. A device configured to perform the method of claim 1.