WO2016183380A1 - Facial signature methods, systems and software - Google Patents

Facial signature methods, systems and software Download PDF

Info

Publication number
WO2016183380A1
WO2016183380A1 PCT/US2016/032213 US2016032213W WO2016183380A1 WO 2016183380 A1 WO2016183380 A1 WO 2016183380A1 US 2016032213 W US2016032213 W US 2016032213W WO 2016183380 A1 WO2016183380 A1 WO 2016183380A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
camera
image
disparity
images
Prior art date
Application number
PCT/US2016/032213
Other languages
French (fr)
Inventor
James A. MCCOMBE
Rolf Herken
Brian W. Smith
Original Assignee
Mine One Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US2016/023433 external-priority patent/WO2016154123A2/en
Application filed by Mine One Gmbh filed Critical Mine One Gmbh
Priority to US15/573,475 priority Critical patent/US10853625B2/en
Priority to EP16793565.9A priority patent/EP3295372A4/en
Publication of WO2016183380A1 publication Critical patent/WO2016183380A1/en
Priority to US17/107,413 priority patent/US20210192188A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/758Involving statistics of pixels or of feature values, e.g. histogram matching

Definitions

  • the present invention relates generally to methods, systems and computer program products ("software") for enabling a virtual three-dimensional visual experience (referred to herein as "V3D "5 ) in videoconferencing and other applications; for capturing, processing and displaying of images and image streams; and for generating a facial signature, based on images of a given human user's or subject's face, for enabling accurate, reliable identification or authentication of a human user or subject in a secure, difficult to forge manner.
  • V3D virtual three-dimensional visual experience
  • V3D virtual 30 experience
  • ( 10) generate a facial signature, based on images of a gi ven human user's or subject's face, or face and head, for enabling accurate, reliable identification, authentication or matching of a human user or subject, in a. secure, difficult to forge manner.
  • the present invention provides methods, systems, de vices and ' computer -software/program code products that enable the foregoing aspects and others.
  • Some embodiments and practices of the invention are collectively referred to herein as V3D.
  • Certain other embodiments and practices of the invention are collectively referred to as Facial Signature aspects of the invention.
  • Facial Signature aspects of the invention may utilize certain V3D aspects of the invention.
  • the present invention provides methods, systems, devices, arid computer softw3 ⁇ 4ue/program code products for, among other aspects and possible applications, facilitating video communications and presentation of image and video content, and generating i mage input streams for a control system of autonomous vehicles; and for generating a facial signature, based on images of a human user's or subject's face, for enabling accurate, reliable identification or authentication of a human user or subject, in a secure:, difficult to forge manner.
  • Methods, systems, devices, and computer software/program code products in accordance with the in vention are suitable for implementation or execution in, or in conjunction with, commercially available computer graphics processor configurations and systems including one or mote display screens for displaying images, cameras for capturing images, and graphics processors for rendering images for storage or for display, such as on a di play sc een, and for processing data values for pixels in an image representation.
  • the cameras, graphics processors and display screens cart be of a form provided in commercially available smartphones, tablets and other mobile telecommunications devices, as well as in commercially available laptop and desktop computers, which may communicate using commercially available network architectures including client/server and client netvvork/cioud architectures.
  • digital processors which can include graphics processor units, including GPGPUs such as those commercially available on cellphones, smartphoo.es, tablets and other commercially available telecommunications and computing devices, as well as in digi tal display devices and digital cameras.
  • GPGPUs graphics processor units
  • Those skilled in the art to which this invention pertains will understand the structure and operation of digital processors, GPGPUs and similar digital graphics processor units.
  • One aspe ct of the present invention relates to methods, sy stems and computer soft are/program code products that enable a first user to view a second user with direct virtual eye contact with the second user.
  • This aspect of the invention comprises capturing images of the second use , utilizing at least one camera having a view of the second user's face: generating a data representation, representative of the captured images; reconstructing a synthetic view of the second riser, based on the representation: and displaying the synthetic view to the first user on a display screen used by the first user; the capturing, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen, by the reconstructing and displaying of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, even if no camera has a direct eye contact, gaze vector to the second user.
  • Another aspect includes executing a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in ima e space between the common features, to generate disparity values; wherein the data, .representation is representati e of the captured images and the corresponding disparity values; the capturing, detecting, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen.
  • the capturing includes utilizing at least two cameras, each having a view of the second user's face: and executing a feature correspondence function comprises detecting common features between corresponding images captured by the respective cameras,
  • the capturing comprises utilizing at least one camera having a view of the second user's face, and. which is an infra-red time-of-f!ight camera that directly provides depth
  • the data representation is representative of the captured images and corresponding depth information.
  • the capturing includes utilizing a single camera having a view of the second user's face: and executing a feature correspondence function comprises detecting common features between sequential images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
  • the captured images of the second user comprise visual information of the scene surrounding the second user: and the capturing, detecting, generating, reconstructing and displaying are executed such that: (a) the first user is provided the visual impression of looking through his display screen as a physical windo to the second user and the visual scene surrounding the second user, and (b) the first user is pro vided an immersive visual experience of the second user and the scene surrounding the second user.
  • Another practice of the invention includes executing image rectification to compensate for optical distortion of each camera and relative misalignment of the cameras.
  • executing linage rectification comprises apply ing a 2D image space transform; and applying a 2D image space transform comprises utilizing a GPGPU processor running a shader program.
  • the cameras for capturing images of the second user are located at or near the periphery or edges of a displa device used by the second user, the dispiay device used by the second user having a display screen viewable by the second user and having a geometric center; and the synthetic view of the second user corresponds to a selected virtual camera location, the selected virtual camera location corresponding to a point at or proximate to the geometric center.
  • the cameras for capturing images of the second user are located at a selected position outside the periphery or edges of a display device used by the second user .
  • respective camera view vectors are directed in non- copiaoar orientations.
  • the cameras for capturing images of the second user, or of a remote scene are located in selected positions and positioned with selected orientations around the second user or the remote scene.
  • Another aspect includes estimating a location of the first user's head or eyes, thereby generating tracking information; and the reconstructing of a synthetic view of the second user comprises
  • camera shake effects are inherently eliminated, in that the capturing, detecting, generating, reconstructing and displaying are executed such that the first user has a virtual direct view through his display screen to the second user and the visual scene surrounding the second user: and scale and perspective of the image of the second user and objects in the visual scene surrounding the second user are accurately represented to the first user regardless of user view distance and angle.
  • This aspect of the invention provi des, on the user's display screen, the visual impression of a frame without glass; a window into a 3D scene of the second user and the scene surrounding the second user.
  • the invention is adapted for uttplemeatatioa on a mobile telephone device.
  • the cameras for capturing images of the second user are located at or near the periphery or edges of a mobile telephone device used by the second user.
  • die invention is adapted for implementation on a laptop or desktop computer, and the cameras for capturing images of the second user are located at or near the periphery or edges of a display device of a laptop or desktop computer used by the second user.
  • the invention is adapted for implementation on computing or
  • telecommunications devices comprisin any of tablet computing devices, computer-driven televi ion displays or computer-driven image projection devices, and wherein the cameras for capturing images of the second user are located at or near the periphery or edges of a computing or telecommunications device used by the second user.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that enable a user to view a remote scene in a manner that gives the user a visual
  • This aspect, of the invention includes capturing images of the remote scene, utilizing at least two cameras each having a vie of the remote scene; executing a feature correspondence function by detecting common features between
  • the capturing, detecting, generating, reconstructing and displaying being executed such that: (a) the user is provided the visual impression, of looking th rough his display screen as a physical window to the remote scene, and (b) the user is provided an immersive visual experience of the remote scene.
  • the capturing of images includes using at least one color camera. In another practice of the invention, the capturing includes using at least one infrared structured light emitter.
  • the capturing comprises utilizing a view vector rotated camera configuration wherein the locations of first and second cameras define a .line; and the line defined b the first and second camera, locations is rotated by a selected amount from a selected horizontal or vertical axis; thereby increasing the number of valid feature correspondences identified in typical real-world settings by the feature: correspondence function.
  • the first and second cameras are positioned relati ve to each other along epipoiar lines.
  • disparity values are rotated back to a selected horizontal or vertical orientation along with the captured linages.
  • the synthetic view is rotated back to a selected horizontal or vertical orientation.
  • the capturing comprises exposure cycling, comprising dynamically adjusting the exposure of the cameras on a frame-by-frame basis to improve disparity estimation in regions outside the exposed region viewed by th user; wherein a series of exposures are taken, including exposures lighter than and exposures darker than a visibility-optimal exposure, disparity values are calculated for each exposure, and the disparity values are integrated into an overall disparity solution over time, so as to improve disparity estimation.
  • the exposure cycling comprises dynamicall adjusting the exposure of the cameras on a frame-b -frame basis to improve disparity estimation in regions outside the exposed region viewed by the user: wherein a series of exposures are taken, including exposures lighter than and exposures darker than a visibility-optimal exposure, disparity values are calculated for each exposure, and the disparity values are integrated in a disparity histogram, the disparity histogram being converged over time, so as to improve disparity estimation.
  • a further aspect of the invention comprises analyzing the quality of the overall disparity solution on respective dark, mid-range and light pixels to generate variance information used to control the exposure settings of the cameras, thereby to form a closed loop between the quality of the disparity estimate and the set of exposures requested from the cameras.
  • Another aspect includes analyzing variance of the disparity histograms on respective dark, mid- range and light pixels to generate variance information used to control the exposure settings of the cameras, thereby to form a closed loop between the quality of the disparity estimate and the set of exposures requested from the cameras.
  • the feature correspondence function includes evaluating and combining vertical- and horizontal-axis correspondence information.
  • the feature correspondence function further comprises applying, to image pixels containing a disparity solution, a coordinate transformation, to unified coordinate system.
  • the unified coordinate sy stem can be the mi-rectified coordinate system of the captured images.
  • Another aspect of the invention incl udes using at least three cameras arranged in substantially "L"-shaped configuration, such that a pair of cameras is presented along a first axis ami a second pair of cameras is presented along a second axis substantially perpendicular t the first axis.
  • the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence.
  • the refining can inciude retaining a disparity solution over a time interval, and continuing to integrate disparity solution values for each image frame over the time interval, so as to converge on an improved disparity solution by sampling over time.
  • the feature correspondence function comprises filling unknowns in a correspondence information set wit historical data obtained .from previously captured images.
  • the filling of unknowns can include the following: if a gi ven image feature is detected in an image captured by one of the cameras, and no corresponding image feature is found in a corresponding image captured by another of the cameras, then utilizing data for a pixel corresponding to the given image feature, from a corresponding, previously captured image.
  • the feature correspondence function utilizes a disparity histogram -based method of integrating data and determining correspondence.
  • This aspect of the invention can include constructing a disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel.
  • the disparity histogram functions as a Probability
  • PDF Densit Function
  • one axis of the disparity histogram indicates a given disparity range
  • a second axis of the histogram indicates the number of pixels in a kernel surrounding the central pixel in question that are voting for the given disparity range.
  • votes indicated by the disparity histogram are initially generated utilizing a Sum of Square Differences (SSD) method, which can comprise executing an SSD method with a relati vel small kerael to produce a fast dense disparity map in which each pixel has a selected disparity that represents the lowest error;
  • SSD Sum of Square Differences
  • processing a plurality of pixels to accumulate into the disparity histogram a tally of the number of votes for a given disparity in a relatively larger kernel surrounding the pixel in question.
  • Another aspect of the invention includes transforming the disparity histogram into a Cumulative Distribution Function (CDF) from which the width of a corresponding interquartile range can be determined, thereby to establish a confidence level in the corresponding disparity solution.
  • CDF Cumulative Distribution Function
  • a further aspect includes maintaining a count of the number of statistically significant modes in the histogram, thereby to indicate modality, to accordance with the invention, modality can be used as an input to the above-described reconstruction aspect, to control application of a stretch vs. slide
  • reconstruction method Sril another aspect of the invention includes .maintaining a disparity histogram over a selected time interval and accumulating samples into the histogram, thereby to compensate for camera noise or other sources of motion or error.
  • Another aspect includes generating fast disparity estimates for multiple independent axes; and then combining the corresponding, respective disparity histograms to produce a statistically more robust disparity solution.
  • Another aspect includes evaluating the interquartile ⁇ range of a CDF of a given disparit - histogram to produce an interquartile result; and if the interquartile result is indicative of an area of poor sampling signal to noise ratio, due to camera over- or underexposure, then controlling camera exposure based on the interquartile result to improve a poorly sampled area of a given disparity histogram.
  • Yet another practice of the in vention includes testing for only a small se t of disparity values using a smali-kemel SSD method to generate initial results; populating a corresponding disparity histogram with the initial results; and then, using histogram votes to drive further SSD testing within a given range to improve disparity resolution over time.
  • Another aspect include extracting sub-pixel disparity information from the disparity histogram, the extracting including the following: where the histogram indicates a maximum-vote disparity range and an adjacent, runner-up disparity range, calculating a weighted average disparity value based on the ratio between the number of votes for each of the adjacent disparity ranges.
  • the feature correspondence function comprises w eighting toward a center pixel in a Sum. of Squared Differences (SSD approach, wherein the weighting includes applying a higher weight: to die center pixel for which a disparity solution is sought, and a lesser weight outside the cen ter pixel the lesser weight being proportional to the distance of a given kernel sample from the center.
  • SSD approach wherein the weighting includes applying a higher weight: to die center pixel for which a disparity solution is sought, and a lesser weight outside the cen ter pixel the lesser weight being proportional to the distance of a given kernel sample from the center.
  • the feature correspondence function comprises optimizing generation of disparity values on GPGPIJ computing structures.
  • GPGPU computing structures are commercially available, and are contained in commercially available forms of smartphones and tablet computers.
  • generating a data representation includes generating a dat structure representing 2D coordinates of a control point in image space, and c nt ining a disparity value treated as a pixel velocity in screen space with respect to a given movement of a. given view vector: and using the disparity value in combination with a movement vector to slide a pixel in a given source image in selected directions, in 2D, to enable a reconstruction of 3D image movement.
  • each camera generates a. respective camera stream; and the data structure representing 2D coordinates of a control point further contains a sample buffer index, stored in association with the control point coordinates, which indicates which camera stream to sample in association with the given control point.
  • Another aspect includes determining whether a given, pixel should be assigned a control point.
  • a related practice of the invention includes assigning control points along image edges, wherein assigning control points along image edges comprises executing computations enabling identification of image edges.
  • generating a data representation includes flagging a given iniage feature with a reference count indicating how man samples reference the given image feature, thereby to differentiate a uniquely referenced image features, and a sample corresponding to the uniquely referenced image feature, from repeatedly referenced image features; and. using the reference count, extracting unique samples, so as to enable a reduction in bandwidth requirements.
  • generating a data representation further includes using the reference count to encode and transmit a given sample exactly once, even if a pixel or image feature corresponding to the sample is repeated in multiple camera views, so as to enable a. reduction in bandwidth requirements.
  • Yet another aspect of the invention includes estimating a location of the first user's head or eyes, thereby generating tracking information; wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking infb-rmation; and wherein 3D image reconstruction is executed by warping a 2D image by utilizing the control points, by sliding a given pixel along a head movement vector at a displacement rate proportional to disparity, based on the tracking information and disparity values.
  • the disparity values are acquired from the feature correspondence function or from a co rol point data stream.
  • reconstructing a synthetic view comprises utilizing the tracking information to control a 2D crop box, such that the synthetic view is reconstructed based on the view origin, and then cropped and scaled so a to fill the first user's display screen view window; and the minima and maxima of the crop box are defined as a function of the first user's head location with respect to the displ ay screen, and the dimensions of the disp lay screen view window ,
  • reconstructing a synthetic view comprises executing a 2D warping reconstruction of a. selected view based on seiected control points, wherein the 2D warping
  • reconstruction includes designating a set of control points, respective control points corresponding to respective, selected pixels in a source image; sliding the control points in seiected directions in 2D space, wherein the control points are slid proportionally to respecti ve disparity values; and interpolating data values for pixels between the selected pixels corresponding to the control points; so as to create a synthetic view of the image from a selected new perspective in 3D space.
  • the invention can further include rotating the source image and control point coordinates such, that rows or columns of image pixels are parallel to the vector between the original source image center and the new view vector defined by the selected new perspective.
  • a related practice of the invention further include rotating the source image and control point coordinates so as to align, the view vector to image scanlines; iterating through each scaaline and each control point for a given scanline, generating a line element beginning and ending at each control point in 2D image space, with the addition of the corresponding disparity value multiplied by the corresponding view vector magnitude with the corresponding x-axis coordinate: assigning a texture coordinate to the beginning and ending points of each generated l ne element, equal to their respective, original 2D location in the source image; and interpolating texture coordinates linearly along each line element; thereby to create a resulting image in which image data between the control points is linearly stretched.
  • the invention can also include rotating the resulting image back by the inverse of the rotation applied to align the view vector with die scaniines.
  • Another practice of the invention includes linking the control points between scaniines, as well as along scaniines, to create polygon elements defined ' by the control points, across which interpolation; is executed.
  • reconstructing a sy nthetic view further comprises, for a gi ven source image, selecti vely sliding image foreground and image background independently of each other.
  • sliding is utilized in regions of large disparity or depth change.
  • a determination of whether to utilize sliding includes evaluating a disparity histogram to detect multi-modal behavior indicating that a given control point is on an. image boundary for which allowing foreground and background to slide independent of each other presents a better solution than interpolating depth between foreground and background; wherein the disparity histogram functions as a Probability Density Function (PDF) of disparity for a given pixel in which higher values indicate a higher probability of the corresponding disparity range being valid for the given pixel
  • PDF Probability Density Function
  • reconstructing a synthetic view includes using at least one Sample Integration Function Table (SIFT), the SIFT comprising a table of sample integration functions for one or more pixels in a desired output resolution of an image to be displayed to the user, wherein a given sample integration function maps an input view origin vector to at least one known, weighted 2D image sample location in at least one input image buffer.
  • SIFT Sample Integration Function Table
  • displaying the synthetic view to the first user on a display screen used by the first user includes displaying the synthetic view to the first user on a 2D display screen; and updating the display in real-time, based on the tracking information, so that the display appears to the first user to he a window into a 3D scene responsive to the first user ' s head or eye location.
  • Displaying the synthetic view to the first user on a display screen used by the fi st user can include displaying the synthetic view to the first user on a binocular stere display device: or, among other alternatives, on a lenticular display that enables auto-stereoscopic viewing.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that facilitate se ' tf-portraiture of a user utilizing a. handheld device to take the self-portrait, the handheld mobile device having a display screen for displaying images to the user.
  • This aspect includes providing at least one camera around the periphery of the display screen, the at least one camera having a view of the user's face at a self portrait setup time during which the user is setting up the self- portrait; capturing images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen; estimating a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information; generating a data representation, representative of the captured images; reconstructing a synthetic view of the user, based on the generated data representation and the generated tracking information; displaying to the user, on the display screen during the setup time, the synthetic view of the user; thereby enabling the user, while setting up the self- portrait, to selectively orient or position his gaze or head, or the handheld device and its camera, with realtime visual feedback.
  • the capturing, estimating, generating, reconstructing and displaying are exec tried such that in the self-portrait the user can appear to be looking directly into the camera, even if the camera does not have a di rect eye contact gaze vector to the user.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that facilitate composition of a photograph of a scene, by a user utilizing a handheld device to take die photograph, the handheld device having a display screen on first side for displaying images to the user, and at least one camera on a second, opposite side of the handheld device, for capturing images.
  • This aspect includes capturing images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting up the photograph; estimating a location of the user's head or eyes relative to the handheld device during tine setup time, thereby generating tracking information; generating a data representation, representative of the captured images; reconstructing a synthetic view of the scene, based on the generated data representation and the generated tracking information, the synthetic view being reconstructed such that the scale and perspective of the synthetic view has a selected correspondence to the user's viewpoint relative to the handheld device and the scene; and displaying to the user, on the display screen during the setup time, the synthetic view of tire scene; thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspective within the display frame, with realtime visual feedback.
  • the user can control the scale and pe rspective of the synthet ic view by changing the position of the handheld device relative to the position of the user's head.
  • estimating a location of the user's head or eyes relative to the handheld device includes using at least one camera on the first, display side of the handheld device, having a view of the user's head or eyes during photograph setup time.
  • the invention enables the features described herein to be provided in a manner that fits within the form factors of modern mobile devices such as tablets and smartphones, as well as the form factors of laptops, PCs, computer-driven televisions, computer-dri en projector devices, and the like, does not dramatically alter the economics of building such devices, and is viable within current or near current communications network/connecti vi ty architectures .
  • One aspect of the present invention relates to methods, system and computer software/program code products for displaying images to a user utilizing a binocular stereo head-mounted display (HMD).
  • This aspect includes capturing at least two image streams using at least one camera attached or mounted on or proximate to an external portion or surface of the HMD, the captured image streams containing images of a scene: generating a dat representation- representative of captured images contained in the captured image streams; reconstructing two synthetic views, based on the representation; and displaying the synthetic views to the user, via the HMD; the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respecti ve virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes, so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
  • Another aspect of the present invention relates to methods, systems and computer
  • the image content can include pre-recorded image content, which can be stored, transmitted, broadcast downloaded, streamed or otherwise made available.
  • This aspect includes capturing or generating at least two image streams rising at least one camera, the captured image streams containing images of a scene; generating a data representation, representative of captured images contained in the captured image streams; reconstructing two synthetic views, based on the representation; and displayin the synthetic views to a user, via the HMD, the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes, so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
  • the data representation can be pre-recorded, and stored, transmitted, broadcast downloaded, streamed or otherwise .made available.
  • Another aspect of the invention includes tracking the location or position of the user's head or eyes to generate a motion vector usable in the reconstructing of synthetic views.
  • the motion vector can be used to modif the respective view origins, during the reconstructing of synthetic views, so as to produce intermediate image frames to be interposed between captured image frames in the captured image streams; and interposing the intermediate image frames between the captured image frames so as to reduce apparent latency.
  • At least one camera is a panoramic camera, night-vision camera, or thermal imaging camera.
  • One aspect of the i vention relates to methods, systems and computer software program code products for generating an image data stream for use by a control system of an autonomous vehicie.
  • This aspect includes capturing images of a scene around at least a portion of the vehicle, the capturing comprising utilizing at least one camera having a view of the scene: executing a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values; calculating corresponding depth information based on. th disparity values; and generating from the images and corresponding depth, information an image data, stream for use by the control system.
  • the capturing can include capturing comprises utilizing at least tw o cameras, each having a view of the scene: and executing a feature correspondence function comprises detecting common features between corresponding images captured by the respective cameras.
  • the capturing can include using a single camera having a view of the scene; and executing a feature correspondence function comprises detecting common features between sequential images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture and processing, including: f 1) capturing images of a scene, the capturing comprising utilizing at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis; and (2) executing a feature correspondence function, by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises: constructing a multi-level disparity histogram indicating the relative probability of a. given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising:
  • each level can be assigned a level number, and each successivel higher level can be characterized by a lower image resolution.
  • the downsampling can include reducing image resolution via low-pass filtering.
  • the downsampling can include using a weighted summation of a kernel in level jo.-! j to produce a pixel value in level jnj, and the normalized kernel center position remains the same across all levels.
  • the FDDE votes for eve image level are included in the disparity solution.
  • Another aspect of the invention includes generating a multi-level histogram comprising a set of initially independent histograms at different levels of resolution.
  • each histogram bin in given level represents votes for disparity determined by the FDDE at that level.
  • each histogram bin in a given level has an associated disparity uncertainty range, and the disparity uncertainty range represented by each, histogram bin is a selected multiple wider than the disparity uncertainty range of a bin in the preceding level
  • a further aspect of the invention includes applying a sub-pixel shift to the disparity values at each level during downsampling, to negate rounding error effect, in a related aspect, applying a sub- pixel shift comprises applying a half pixel shift to only one of the images in a stereo pair at each level of dow nsampling, in a further aspect applying a sub-pixel shift is implemented inline, within the weights of the filter kernel utilized to implement the downsampling from level to level.
  • Another aspect of the in vention includes executing histogram integration, the histogram
  • a related aspect includes, during summation, modifying the weighting of each level to control the amplitude of the effect of lower levels in overall voting, by applying selected weighting coefficients to selected levels.
  • Yet another aspect of the invention includes inferring a sub-pixel disparity solution from the disparity histogram, by calculating a sub-pixel offset ' based on the number of votes for the maximum vote disparity range and the number of votes for an adjacent, runner-up disparity range.
  • a summation stack can be maintained in a memory unit
  • One aspect, of the present invention relates to methods, systems and computer software/program code products that enable capturing of images using at least two stereo c mera pairs, each pair being arranged along a. respective camera pair axis, and for each camera pair axis: executing image capture utilizing the camera pair to generate image data; executing rectification and undistorting transformations to transform the image data into UD im ge space: iteratively downsampiing to produce multiple, successively lower resolution levels: executing FDDE calculations for each level to compile FDDE votes for each level; gathering FDDE disparity range votes into a multi-level histogram; determining the highest ranked disparity range in the multi-level histogram; and processing the muSti-level histogram disparity data to generate a final disparit result.
  • One aspect of the present invention relates to methods, systems and computer soft aavprogram code products that enable video capture and processing, including: (i) capturing images of a scene, the capturing comprising utilizing at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair; and (2.) executing a feature
  • the feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, the feature correspondence function further comprising: generating a dispari ty solution based on the disparity values; and applying an injective constraint to the disparity solution based on domain, and co-domain, wherein the domain comprises pixels for a. given image captured by the first camera and. the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in response to violation of the injective constraint, wherein the injecti ve constraint is thai no element in the co-domain is referenced more than once by elements in the domain.
  • applying an injective constraint comprises: maintaining reference count for each pixel in the co-domain, and checking whether the reference count for the pixels in the co-domain exceeds " 1". and if the count exceeds "I" then designating a violation and responding to the violation with a selected error correction approach.
  • the selected error correction approach can include any of (a) first come, first served, (b) best match wins, (c) smallest disparity wins , or (d) seek alternative candidates.
  • T ie first come, first served approach can include assigning priority to the fi rst element in the domain to claim an element in the co-domain, and if a second dement in.
  • the best match win approach can include: comparing the actual image matching error or corresponding histogram vote count between the two possible candidate elements in the domain against the contested element in. the co-domain, and designating as winner the domain candidate with the best match.
  • the smallest disparity wins approach can include: if there is a contest between candidate elements in the domain for a given co-domain element wherein each candidate element has a corresponding disparity, selecting the domain candidate with the smallest disparity and designating as invalid die others.
  • the seek alternative candidates approach can include: selecting and testing the next best domain candidate, based on a selected criterion, and iterating the selecting and testing until the violatio , is eliminated or a computational time limit is reached.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture in which a first user is able to view a second user with direct virtual eye contact with the second user, including: ( !) capturing images of the second user, the capturing comprising utilizing at least one camera having a view of the second user's face: (2) executing a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space betwee the common features, to generate dispari ty values; (3) generating a.
  • the location estimating comprises: (a) passing a captured image of the first user, the captured image including the first user's head and face, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the face relative to an image plane: (b) utilizing an estimated eenter-of-faee position, face rotation angle, and head depth range based on the first estimate, to determine a three-dimensional (3 D) location of the first user's head, face or eyes, thereby generating tracking information; and (5) reconstructing a synthetic view of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking inibrmation; and wherein the location estimating comprises: (a) passing a captured image of the first user, the
  • a related aspect of the invention includes downsampling the captured image before passing it to the 2D facial feature detector. Another aspect includes interpolating image data from video frame to video frame, based on the time that has passed from a given video frame from a previous vide frame. Another aspect includes converting image data, to luminance values.
  • One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture and processing, including: (1 ) capturing images of . scene, the capturing comprising utilizing at least three cameras having a view of the scene, the cameras being arranged in a substantially "L" -shaped configuration wherein a first pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, wherein the first and second pairs of cameras share a common camera at or near the intersection of the first and second axis, so that the first and second pairs of cameras represent respecti ve first and second independent stereo axes that share a common camera; (2) executing a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values; (3) generating a data .representation, representative of the captured images and the corresponding disparity values: and
  • a related aspect includes executing a stereo correspondence operation on the image data in a rectified, undistorted (RUD) image space, and storing resultant disparity data in a RUD space coordinate system.
  • the resultant disparity data is stored in a URUD space coordinate system.
  • Another aspect includes generating disparity histograms from the disparity data in either RUD or URUD space, and storing tire disparity' histograms in a unified URUD space coordinate system,
  • a further aspect include applying a URUD to RUD coordinate transformation to obtain per-axis disparity values.
  • One aspect of the present, invention relates to methods, systems and computer software/program code products that enable video capture and processing, including (1) capturing images of a scene, the capturing comprising utilizing at least one camera having a view of the scene; (2) executing a feature correspondence function by detecting common features between corresponding images captured b the at.
  • the feature correspottdence function utilizes a disparity histogram -based method of integrating data and detemumng correspondence, the disparity histogram- based method comprising: (a) constructing a disparity histogram indicating the relative probabilit of: ' given disparity value being correct for a gi en pixel; and (b) optimizing generation of disparity values on a GPU computing structure, the optimizing comprising: generating, in the GPU computing structure, a plurality of output pixel threads; and, for each output pixel thread, maintaining a private disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU computing structure.
  • the private disparity histogram is stored such that each pixel thread writes to and reads from the corresponding pri vate disparity histogram os a dedicated portion of shared local memory in the GPU.
  • shared local memory in the GPU is organized at least in part into memory words; the private disparity histogram is characterized by a series of histogram bins indicating the number of votes for a given disparity range; and if a maximum possible number of votes in the private disparity histogram is known, multiple histogram bins can be packed into a single word of the shared local memory, and accessed using bitwise GPU access operations.
  • One aspect of the invention includes a program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least first and second cameras having a view of a. scene, the cameras being arranged along an axis to configure stereo camer pa r having a camera pair axis, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed i the digital processing resource cause the digital processing resource to; ( I ) capture images of the scene, utilizing the at least first and second cameras; and (2) execute a feature correspondence function b detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises: constructing a multi-level disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising
  • the digital processing system comprises at least two stereo camera pairs, each pair being arranged along a respective camera pair axis, and the digital processor- executable program instructions further comprise instructions which when executed in the digital processing resource cause the digital processing resource to execute, tor each camera pair axis, the following: ( 1) image capture utilizing the camera pair to generate image data; (2) rectification and uudisiortiug transform ations to transform the image dat into ROD image space; (3) iterattvely
  • Another aspect of the invention includes a program product for use with a digital processing system, the digital processing system comprising at least first and second cameras having a view of a scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and a digital processing resource composing at least one digital processor, the program product comprising digital processor-executable program instructions stored on a uon-transitory digital processor- readable medium, which when executed in the digital processing resource cause the digital processing resource to: (1) capture i mages of the scene, utilizing the at least first and second cameras; and (2) execute a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity- values, wherein the feature correspondence function comprises: (a) generating a disparity solution based on the disparity values; and (b) applying art injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured by the
  • the digital processor- executable program instructions further comprise instructions which when executed in the digital processing resource cause the digital processing resource to: maintain a reference count for each pixel in the co-domain, and check whether the reference count for the pixels in the co-domain exceeds " 1 ", and if the count exceeds " 1" then designate a violation and responding to the violation with a selected error correction approach.
  • Another aspect of the invention includes a program product for use with a digital processing system, for enabling a first user to view a second user with direct virtual eye contact with the second user, the digital processing system comprising at least one camera having a view of the second user's lace, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readabie medium, which when executed in the digital processing resource cause the digital processing resource to: (1 ) capture images of the second user, utilizing the at least one camera: (2) execute a feature
  • the correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values: (3) generate a data representation, representative of the captured images and the corresponding disparit values; (4) estimate a three-dimensional (3D) location, of the first user's head, face or eyes, thereby generating tracking information; and (5) reconstruct a synthetic vie of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises .reconstructing the synthetic view based on the generated data representation and the generated tracking information; herei the 3D location estimating comprises: (a) passing a captured image of the first user, the captured image including the first user's bead and face, to a two-dimensional (2D) fecial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of
  • Yet another aspect of the invention includes a program product for use with a digital processing system, for enabling capture and processing of images of a scene, the digital processing system
  • a digital processing resource comprising at least one digital processor, the program product comprising digital processor- executable program instructions stored on anon-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to: (1) capture images of the scene, utilizing the at least three cameras; (2) execute a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values; (3) generate a data representation, representative of the captured images and the corresponding disparity values; and (4) utilize an unrectified, undistorted (URUD) image space to integrate disparity data for pixels between the firs and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the URUD space is an image space in which polynomial lens distortion has been removed
  • the digital processor-executable prog ram instructions further comprise instructions which, when, executed in the digital processing resource cause the digital processing resource to execute a stereo correspondence operation on the image data in a rectified, undistorted (RUD) image space, and store resultant disparity data in a RUD space coordinate system.
  • RUD rectified, undistorted
  • Another aspect of the inventon includes a program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least, one camera having a. view of a scene, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, whic when executed in the digital processing resource cause the digital processing resource to; (!
  • the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence, the disparity histogram-based method, comprising: (a) constructing a disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel; and (b) optimizing generation of disparity values on a GPU computing structure, die optimizing comprising: generating,, in the GPU computing structure, a plurality of output pixel threads; and for each output pixel thread, maintaining a. private disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU computing structure.
  • One aspect of the invention includes a digital processing system for enabling a fi st user to view a second user with direct virtual eye contact with the second user, the digitai processing system comprising: ( 1 ) at least one camera having a view of the second user's face; (2) a.
  • a digital processing resource comprising at least one digitai processor, the digital processing resource being operable to: (a) capture images of the second user, utilizing the at least one camera: (fa) generate a data representation, representative of the captured images; (c) reconstruct a synthetic view of the second user, based on the representation; and (d) display the synthetic view to the first user on the display screen for use b the first user; the capturing, generating, reconstructing and displaying being executed such that the .first user can have direct virtual eye contact with the second user through the first user's display screen, by the teeonstruetmg and displaying of a synthetic view of the second user in which the second user appears to be gazing directly at the fi st user, even if no camera has a direct eye contact gaze vector to the second user.
  • Another aspect of the invention includes a digital processing system for enabling a first user to view a remote scene with the visual impression of being present with respect to the remote scene, the digital processing system, comprising: (1 ) at least two cameras, each having a view of the remote scene; (2) a display screen for use by the first user: and (3) a digital processing resource comprising at ieast one digital processor, the digital processing resource being operable to: (a) capture images of the remote scene, utilizing the at least two cameras; (b) execute a feature correspondence function by detecting common features between corresponding rmages captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values; (c) generate a data representation, representative of the captured images and the corresponding disparity values: (d) reconstruct a synthetic view of the remote scene, based on the representation; and (e) display the synthetic view to the first user on the display screen; the capturing, detecting, generating, reconstructing and displaying being executed such that: the first user is provided the visual impression o
  • Another aspect of the invention includes a system operable ia a handheld digital processing device, for facilitating self-portraiture of a user utilizing the handheld device to take the self portrait the system comprising: ( i ) a digital processor; (2) a display screen for displaying images to the user; and (3) at least one camera around die periphery of the display screen, the at least one camera having a view of the user's face at a.
  • the system being operable to: (a) capture images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen; (b) estimate a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information; (c) generate a data representation, representative of the captured images; (d) reconstruct a synthetic view of the user, based on. the generated data representation and the generated tracking information; and (e) display to the user, on. the display screen during the setup time, the synthetic view of me user; thereby enabling the user, while setting up the self-portrait;, to selectively orient or position his gaze or head, or the handheld device and its camera, with realtime visual feedback.
  • One aspect of the invention includes a system operable in a handheld digital processing device, for facil i tating composition of a photog raph of a scene by a user utilizing the handheld device to take the photograph, the system comprising: ( .1 ) a digital processor; (2) a display screen on a first side of the handheld device for displaying images to the user; and (3) at least one camera on a second, opposite side of the .handheld device, for capturing images; the system being operable to: (a) capture images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting tip the photograph: (b) estimate a location of the user's head or eyes .relati ve to the handheld device during the setup time, thereby generating tracking information; (c) generate a data representation, representative of the captured, images; (d) reconstruct a synthetic view of the scene, based on the generated data representation and the generated tracking information, the synthetic view being reconstructed such that tile scale and perspective of the synthetic view has
  • Another aspect of the invention includes a system for enabling display of images to a user utilizing a binocular stereo head-mounted display (HMD), the system comprising: ( 1) at least one camera attached or mounted on or proximate to an external portion or surface of the HMD; and (2) a digital processing resource comprising at least one digital processor; the system being operable to: (a) capture at least two image streams using the at least one camera, the captured image streams containing i mages of a scene: (b) generate a data representation, representative of captured images contained in the captured image streams; (c) .reconstruct two synthetic views, based on the representation; and (d) display the synthetic views to the user, via the HMD: the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins arc positioned sack that the respective virtual camera locations coincide with respective locations of the user's left and tight eyes, so as to provide the user with a substantially natural visual experience of the
  • Another aspect of the invention includes an image processing system for enabling the generation of an image data, stream for use by a control system of an autonom ous vehicle, the image processing system comprising: (1) at least one camera with a view of a scene around at least a portion of the vehicle; and (2) a digital processing resource comprising at least one digital processor; the system being operable to: (a) capture images of the scene around at least a portion of the vehicle, using the at least one camera; (b) execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values; (c) calculate corresponding depth information based on the disparity values; and d) generate from the images and corresponding depth information an image data stream for use by the control system.
  • Another aspect of the invention includes generating a facial signature, based on images of a human user's or subject's face, for enabling accurate, reliable identification: or authentication of a hitman subject or user of a system or resource, in a secure, difficult to forge manner.
  • invention relates to methods, systems and computer soft ; are/program code products that enable generating a fecial signature for use in identifying a given human user.
  • generating a facial signature incl udes capturing images of the user's face, using at least one camera having a view of the user's face; executing an image rectification function to compensate for optical distortion and alignment of the at least one camera; executing a feature correspondence function by detecting common features between corresponding images captured by the at .least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilizing the feature correspondence data
  • facial signature data representation to generate a facial signature data representation, the facial signature data representation being usable in accurately identifying the user or subject in a secure, difficult to forge manner.
  • the capturing can utilize at least two cameras, each having a view of the user's face; and the feature correspondence function can include detecting common features between corresponding images captured by the respective cameras.
  • the capturing can utilize at least one camera having a view of the user's face and which is an infra-red time-of-fl ight camera or structured light ca era thai directly provides depth information; and the feature correspondence data representation can be representati ve of the captured images and corresponding depth information.
  • the capturing utilizes a single camera having a view of the second user's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
  • the identity ing aspect of the invention uses stereo depth estimation to verify that human fecial features are presented to the cameras at the correct distance ratios ' between the cameras at the correct distance ratios between the cameras or from the structured light or tirne-of-flight sensor.
  • the identifying takes into account the actual 3D coordinates of facial features wit respect to other facial features, and the feature correspondence function or depth detection function includes computing distances between facial features from multiple perspectives.
  • the facial signature is a combination of 3D facial contour information and a 2D image from one or more of the cameras.
  • the 3D con to ur date can be sto red in the facia! signature data representation.
  • the facial signature is utilized as a security factor in art authentication system, either as the sole security factor or in combination, with other security factors.
  • the other security factors can include a passcode, a fingerprint or other Mornetrie infbraiatioa.
  • the 3D facial contour data is combined with a 2D image fr m one or more cameras in a conventional 2D face identification system to create a hybrid 3D/2D face identification system.
  • 3D .facial contour data is used solely to confirm that a face having credible 3D human fecial proportions was presented to the cameras at an overlapping spatial location of the captured 2D image.
  • a further aspect of the invention includes using a 2D bounding .rectangle, defining the.2D extent of the face location, to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption.
  • Still another aspect of the invention includes prompting the user to present multiple distinct facial poses or head positions, and utilizing a depth detection system: to scan the multiple facial poses or head positions across a series of image frames, so as to increase protection against forgery of the facial signature.
  • generating a unique facial signature further includes executing a enrollment phase, which includes prompting the user to present to the cameras a plurality of selected head movements or positions, or a series of selected fecial poses, and collecting image frames from a plurality of head positions or facial poses for use in generating the unique facial signature representative of the user.
  • the invention further includes a matching phase, which includes using the cameras to capture, over an interval of time, a plurality of frames of 3 D and 2D data representative of the user's face; correlating the captured data with the facial signature generated during the enrollment phase, thereby to generate a probability of match score; and comparing the probability of match score with a selected threshold value, thereb to confirm or deny an identity match.
  • a matching phase which includes using the cameras to capture, over an interval of time, a plurality of frames of 3 D and 2D data representative of the user's face; correlating the captured data with the facial signature generated during the enrollment phase, thereby to generate a probability of match score; and comparing the probability of match score with a selected threshold value, thereb to confirm or deny an identity match.
  • the enrollment phase can inckide generating an enrolled facial signature containing data corresponding to multiple image scans of a user's face, the multiple image scans corresponding to a plurality of the user's head positions or facial poses; and the matching phase can inckide requiring at least a minimum number of captured image frames corresponding to different racial or head positions matching the multiple scans within the enrolled signature.
  • Another aspect of the invention relates to generating a histogram based facial signature representation, whereby a fecial signature is represented as one or more histograms obtained from a summation of per-pi el disparity histograms within the feature correspondence calculation, or generated from depth, data from a sensor capable of directly perceiving depth.
  • the histograms represent th normalized relative proportion of facia! feature depths across a plane parallel to the user's face.
  • Tire X- axis of the histogram represents a given disparity or depth range
  • the Y-axis of the histogram represents the normalized count of image samples that fall within the given range.
  • a conventional 2D face detector provides a face rectangle and location of basic facial features, and only samples within the face rectangle are accumulated into the histogram.
  • Another aspect includes rejecting samples falling outsid the statistical majority of samples within the face rectangle.
  • Another aspect includes projecting disparity and depth points into a canonical coordinate system defined by a plane constructed from the 3D coordinates of the haste facial features.
  • a. histogram is accumulated over multiple captured image .frames over a period of time.
  • each set of samples of the captured image frames undergoes an affine transform to lie on a common faci al plane, to enable multiple samples of facial depth relationships to be accumulated into a histogram.
  • multiple samples of facial depth relationships are accumulated into a histogram across a series of facial positions or poses.
  • a candidate histogram is accumulated over multiple captured image frames over a period of time.
  • a candidate histogram is accumulated, it is subtracted from a set of enrolled histograms to generate a vector distance constituting a degree-of-match score.
  • a further aspect of the invention includes comparing the degree-of-match score to a selected threshold to confirm or deny a match wi th each enrolled signature in a set of enrolled signatures.
  • the histogram representation is used in combination with conventional 2D face matching to provide an additional authentication factor.
  • the fecial signature is utilized as a factor in an authentication process in which a human subject or user of a. system or resource is successfully authenticated if selected criteria m met, and the facial signature aspect further iaeludes updating the facial signature on every successful match, or on e very nth successful match, where n is a selected integer.
  • Another aspect of the invention includes a program product for use with a digital processing system, for generating a fecial signature for use in identifying a human user or subject, the digital processing system comprising at least one camera, having a view of the user's or subject's face, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a.
  • non-transitory digital processor-readable medium which when executed in the digital processing resource cause the digital processing resource to: capture images of the user's or subject's face, utilizing the at least one camera; execute an image rectification function to compensate for optical distortion and alignment of the at least one camera; execute a feature correspondence function b detecting common features between corresponding images captured by the at least one camera and measuring a .relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilize the feature correspondence data representation to generate a facial signature data representation, the facial signature data representation being usable to accurately identify the user or subject in a secure, difficult to forge manner.
  • the capturing can include using at least two cameras, each having a view of the user's or subject's face; and executing a feature correspondence function can include detecting common features between corresponding images captured by the respective cameras.
  • the capturing can include using at least one camera having a view of the user's or subject's face and which is an infra-red Jime-af-fiight camera or structured light camera thai, directly provides depth information; and the feature correspondence data representation is representative of the captured images and correspondin depth information
  • the capturing includes using a single camera havin a view of the user's or subject's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
  • the identifying of a human user or subject utilizes stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the cameras or from the structured light or time-of-flight sensor.
  • the identifying can take into account the actual 3D coordinates of facial features with respect to other facial features.
  • Anothe aspect of the in ention includes a digital processing system tor generating a facial signature for use in identifying a human user or subject, the digital processing system comprising at least one camera having a view 1 of the users or subject's face, and a digital processmg resource comprising at least one digital processor, the digital processing resource being operable to: capture images of the user's or subject's face, utilizing the at least one camera execute an image rectification function to compensate for optica!
  • the at least one camera executes a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilize the feature correspondence data representation to generate a facial signature data .representation, the facial signature data representation being usable to accurately identify the user or subject, in a secure, difficult to forge manner,
  • the system can include at least two cameras having a view of the user's or subject's face; the capturing can include utilizing the least two cameras and executing a feature correspondence function can include detecting common features between corresponding images captured by the respective cameras,
  • the capturing includes using at least one camera having a view of the user's or subject's face and which is an infra-red titne-of- ight camera, or structured light camera that directly provides depth information: and the feature correspondence data representation is
  • the capturing includes utilizing a single camera having a view of the user's or subject's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a. relative distance in image space between the common features, to generate disparity values.
  • the identifying of a human subject or user includes utilizing stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the cameras or from, the stmctured light or time-of-fiight sensor.
  • the identifying can take into account the actual 3D coordinates of facial features with respect to other facial features.
  • FIG. i stows a camera, configuration useful in an exemplary practice of the invention.
  • F GS. 2-6 are schematic diagrams illustrating exemplary practices of the invention.
  • FIG. 7 is a flowchart showing an exemplary practice of the invention.
  • FIG, 8 is a block diagram depicting an exemplary embodiment of the invention.
  • FIGS. 9-18 are schematic diagrams illustrating exemplary practices of the invention.
  • FIG, 1 is a graph, in accordance with as aspect of the invention.
  • FIGS. 20-45 are schematic diagrams illustrating exemplary practices of the invention.
  • FIG . 46 is a graph in accordance with an aspect of the invention.
  • FIGS. 47-54 are schematic diagrams illustrating exemplar ⁇ ' practices of the invention.
  • FIGS. 55-80 are flowcharts depicting exemplary practices of the invention.
  • FIG. 81 is a schematic flow diagram depicting processing of images to generate a Facial
  • FIGS. 82-83 show art e em lar' image processed in accordance with an exemplary practice of die Facial Signature aspects of the invention, where FIG. 82 is an example of an image of a human user or subject captured by at least one camera, and FIG. 83 is an example of a representation of image data, corresponding to the image of FIG. 82, processed in accordance with an exemplary practice of the invention.
  • FIG. 84 shows a histogram representation corresponding to the image(s) of FIGS. 82-83, generated in accordance with an exemplary practice of the Facial Signature aspects of the invention.
  • FIGS. 85-88 are flowcharts depicting exemplary practices of the Facial Signature aspects of the invention.
  • V3D aims to address and radically improve the visual aspect of sensory ; engagement in teleconferencing and other video capture settings, while doing so with low latency .
  • the visual aspect, of conducing a video conference is conventionally achie ved via a camera pointing at eac user, transmi tting the video stream captured by eac camera, and then projecting the video stream (s) o o the two-dimensional (2D) display of the other user in a different location.
  • Both users have a camera and display and thus is formed a full-duplex connection where both users can see each other and their respective environments.
  • the V3D of the present invention aims to deliver a significant enhancement to this particular aspect by creating a "portaT where each user would look 'through " their respective displays as if there were a "magic " sheet of glass in a frame to the other side in the remote location.
  • This approach enables a number of importan improvements for the user (assuming robust implementation:
  • Each user can form direct eye contact with the other,
  • Each user can move his or her head in any direction and look through the portal to the othe side. They can e ven look " 'around" and see the
  • V3D aspects of the invention can be configured to deliver these advantages in a manner that fits within the highly optimized form factors of today's modem mobile devices, does not dramatically alter the economics of building such de ices, and is viabte within the current connectivity performance levels available to most users.
  • FIG, I shows a perspective view of an exemplary prototype device 1.0, which includes a display 12 and three cameras; a to right camera 14, and bottom right camera 16, and a bottom left camera 18. in connection with this example, there will next be described various aspects of the invention relating to the unique user experience provided by the V3D invention.
  • the V3D system of the invention enables immersive communication between people (and in various embodiments, between sites and places), in exemplary practices of the invention, each person can. look "through” thei screen, and see the other place. Eye contact is greatly improved. Perspective and scale are matched to the viewer ' s natural view. Device shaking is inherently eliminated.
  • embodiments of the V3D system can be implemented in mobile configurations as well as traditional stationary devices,
  • FIGS. 2A-B, 3A-B. and 5A-B are images illustrating an aspect of the invention, in which the V3D system is used in conjunction with a smartphone 20, or like device.
  • Smartphone 20 includes a display 22, on which is displayed an. image of a face 24.
  • the image may be, for example, part of video/telephone conversation, in which a video image and sound conversation is being conducted with someone in a remote location, who is looking into the camera of their own smartphone.
  • FIGS. 2 A and. 2B illustrate a feature of the V3D system for improving eye contact.
  • FIG. 2.A. shows the face image prior to correction, it will be seen that the woman appears to be lookin down, so that there can be no eye contact with the other user or participant.
  • FIG . IB shows the face image after correction, it will be seen that in the corrected image, the woman appears to be snaking eye contact with the smartphone user,
  • FIGS. 3A-3B are a pair of diagrams illustrating the V3.D system ' s "move left” (FIG. 3A) and “move right” (FIG. 3B) corrections.
  • FIGS. 4A-4B are a pair of diagrams of the light pathways 26 a, 26b in the scene shown respectively on display 22 in FIGS. 3A-3B (shown from above, with the background at the top) leading from face 24 and surrounding objects to iew oint 28a, 28b through the "window " defined by display 22.
  • FIGS. 5A-5B are a pair of diagrams illustrating the V3D system's "move in” (FIG. 5A) and “move out” (FIG. 5B) corrections.
  • F GS. 6A-6B are a pair of diagrams of the light: pathways 26c, 26d in the scene shown respectively on display 22 in FIGS. 3A-3B (shown from above, with the background at the top) leading from face 24 and surrounding objects to viewpoints 28c, 28d through the "window "' defined by display 22.
  • Another embodiment of the invention utilizes the invention's ability to synthesize a virtual camera view of the user to aid in solving the problem of "where to look” when taking a self-portrait on a mobile device. " This aspect of the invention operates by image-capturing the user per the overall V3D method of the invention described herein, tracking the position and orientation of the user's face, eyes or head, and by using a display, presenting an image of the user back to themselves with a synthesized virtual camera viewpoint as if the user were looking in a mirror.
  • Another embodiment of the invention makes it easier to compose a photograph using a rear- facing camera on.
  • a mobile device works like the overall V3D method of invention described herein, except that the scene is captured through the rear-facing camerafs) and then, using the user's head location, a view is constructed such that the scale and perspective of the image matches the view of the user. such, that the device display frame becomes like a picture frame. This results in a user experience where the photographer does not have to manipulate zoom controls or perform cropping, since they can simply frame the subject as they like within the frame of the display, and take the photo.
  • Another embodiment of the invention enables the creation of cylindrical or spherical panoramic photographs, by processing a series of photographs taken with a device using the cameta(s) rumiing the V3D system of the invention.
  • the user can then enjoy viewing the panoramic view thus created, with an immersive sense of depth .
  • the panorama can either be viewed on a 2D display with head tracking, a multi-view display or a binocular virtual reality (VR) headset with a unique perspective shown for each eye. If the binocular VR headset lias a facility to track head location, the V3D system can re-project the view accurately.
  • VR virtual reality
  • FIG. 7 is a flow diagram illustrating the overall V3D digital processing pipeline 70, which includes the following aspects:
  • Image Capture One or more images of a scene, which may include a human user, are collected instantaneoitsly or over time via one or more cameras and fed into the system . Wide-angle lenses are generally preferred due to the ability to get greater stereo overlap between images, although this depends on the application and can in principle work with any focal length.
  • Image Rectification in order to compensate for optical lens distortion from each camera and relative misalignment between the cameras in the multi-view system, image processing is performed to apply an inverse transform to eliminate distortion, and an affine transform to correct misalignment between the cameras, in order to perform efficiently and in real-time, this process can be performed using a custom imaging pipeline or implemented using the shading hardware present in many conventional graphical processing units (CPUs) today, including GPU hardware present in devices such as i Phones and other commercially available smartphones. Additional detail and other variations of these operations will be discussed in greater detail herein.
  • CPUs graphical processing units
  • GPU hardware present in devices such as i Phones and other commercially available smartphones. Additional detail and other variations of these operations will be discussed in greater detail herein.
  • Feature Correspondence With the exception of using ttme-of-flight type sensors in the Image Capture phase that provide depth information directly, this process is used in order to extract parallax information present in the stereo images from the camem views. This process involves detecting common features between multi-view images and measuring their relative distance in image space to produce a disparity measurement litis disparity measurement can either be used directly or converted, to actual depth based on knowiedge of the camera field-of-vie , relative positioning, sensor size and image resolution. Additional detail and. other variations of these operations will be discussed in greater detail herein.
  • Reconstruction Using the previously established representation, whether stored locally on the device or received over a network, a series of Synthetic views into the originally captured scene can be generated. For example, in a video chat the physical image inputs may have come from cameras surrounding the head of the user in which no one view has a direct eye contact, gaze vector to the user. Using reconstruction, a synthetic camera view placed potentially within the bounds of the device display enabling the visual appearance of eye contact can be produced.
  • Head Tracking Using the image capture data as an input many different methods exist to establish an estimate of the viewer' head or eye location. This information ca be used to drive the reconstruction and generate a synthetic view which looks valid from the user's -established head location. Additional detail and various forms of these operations will be discussed in greater detail herein.
  • Display Several types of display can. be used with the V3D pipeline in different ways. The currently employed method involves a conventional 2D display com ined with head tracking to update the display project in -real-time so as to gi ve the visual impression of being three-dimensional (3D) or a look into a 3D environment. However binocular stereo displays ⁇ such as the commercially available Ocui s Rift) can be employed used, or still further, a lenticular type display can be employed, to allow auto-stereoscopic viewing.
  • binocular stereo displays such as the commercially available Ocui s Rift
  • a lenticular type display can be employed, to allow auto-stereoscopic viewing.
  • FIGS. 7 and 8 can also be used to enable the Facial Signature aspects of the invention., to enable a "signature" of a user's or subject's face, or face and head, to be generated from the Feature
  • Correspondence phase for purposes such as user identification, authentication or matching.
  • FIG 8 is a diagram of an exemplary V3.D pipeline 80 configured in accordance with the invention, for immersive eornraimication with eye contact.
  • the depicted pipeline is full -duplex, meaning that it allows simultaneous two-way conunomcation in both directions.
  • Pipeline 80 comprises a pair of communication devices 1 A-B (for example, commercially available smartphones such as iPhones) that are linked to each other through a network 82.
  • Each communication device includes a decoder end 83 A-B for receiving and decoding communications from the other device and an encoder end 84A-B for encoding and sending commimications to the other device 8 ! A-B.
  • the decoder end 83 A-B includes the following components:
  • the View Reconstruction module 833A-8 receives data 835A-B from a Head Tracking Module 836-B, which provides x ⁇ , y ⁇ , and z-coordmate data, with respect to the user's head that is generated by cameras 841 A-B.
  • the encoder end 84-B comprises a multi-camera array that includes camera;; 8 1 A-B, cameras 84 A-B, and additional camera(s) 842A-B. (As noted herein, it is possible to practice various aspects of the invention using only two cameras.)
  • the camera array provides data in the form of color camera streams 843 A-B that are fed into a Color Image Redundancy Elimination module 844A-B and an Encode module.
  • the output of the camera array is also fed into a Passive Feature Disparity Estimation module 84SA-B that provides disparity estimation data to the Color Image Redundancy Elimination module 846A-B and the Encode module 847A-B.
  • the encoded output of the device is then transmitted over network 82 to the Receive module 83 1 A-B in the second device 8.1A-B.
  • the V3D system requires an input of images in order to capture the user and the world around the user.
  • the V3D system can be configured to operate with a wide range of input imaging device.
  • Some devices, such as normal color cameras, are inherently passive and thus require extensive image processing to extract depth information, whereas non-passive systems can get depth directly, although they have the disadvantages of requiring reflected IR to work, and thus do not perform well in strongly naturally lit environments or large spaces.
  • Those skilled in the art will understand that a wide range of color cameras and other passive imaging devices, as well as non-passive image capture devices, are commercially available from a variet of manufacturers.
  • Hits descriptor is intended to cover the use of visible light or infrared specific cameras coupled with an active infrared emitter that beams one of many -potential patterns onto the surfaces of objects, to aid in. computing distance.
  • I -Structured Light devices are known in the art,
  • This descriptor covers the use o time-of ⁇ tlight cameras that work by emitting a pulse of light, and then measuring the time taken for reflected light to reach each of the camera's sensor elements. This is a more direct method of measuring depth, but has currently not reached the cost and resolution levels useful fo significant consumer adoption. Using this type of sensor, in some practices of the invention the feature correspondence operation noted above could be omitted, since accurate depth information is already provided directly from the sensor.
  • the V3D system of the invention can be configured to operate with multiple cameras positioned in a fixed relati ve position as part of a device. It is also possible to use a single camera, by taking images over time and with accurate tracking, so that the relative position of the camera between frames can be estimated with sufficient accuracy. With sufficiently accurate positional data, feature correspondence algorithms such as those described herein could continue to be used.
  • V3D a practice of the V3D invention that relates to the positioning of the cameras within the multi-camera configuration, to significantly increase the number of valid feature correspondences between images captured ia real world settings, ' litis approach is based on three observations:
  • Man features in man-made indoor or urban environments consist of edges aligned in the three orthogonal axes (x, y .? ⁇ .
  • feature correspondence algorithms typically perform their search along horizontal or vertical epipolar lines in image space.
  • FIGS. 9, 10, and 11 show three exemplary sensor configurations 90, 100, 1 10.
  • FIG. 9 shows a handheld device 90 comprising a display screen 91 surrotmded by a bezel 92.
  • Sensors 93, 94, and 95 are located at the corners of bezel 92, arid define a pair of perpendicular axes: a first axis 96 between sensors 93 and 94, and a second axis 97 between cameras 94 and 95.
  • FIG. 10 shows a handheld device 100 comprising display 101, bezel 102, and sensors .103, 104. 105.
  • each of sensors 103. 104, 105 is rotated by an angle 0 relati ve to bezel 1 2.
  • the position of the sensors .103, 104, and 105 on bezel 1 2 has been configured so that the three sensors define a pai r of perpendicular axes 1 6,. 1 7.
  • FIG. 1 1 shows a handheld device 110 comprising display 1 1 1, bezel 1 12, and sensors 113, 1 14, 1 15.
  • the sensors 1 13, 1 14, 1 15 are not rotated.
  • the sensors i 13, 1 14, 115 are positioned to define perpendicular axes- 116, 1 17 that are angled with respect to bezel i 12.
  • the data from sensors 113, 1 .14, 115 are then rotated in software such that the correspondence continues to be performed along the epipolar lines.
  • V3D uses 3 sensors to enable vertical and horizontal cross correspondence
  • the methods and practices described above are also applicable in a 2- camera stereo system.
  • FIGS. 12 and .13 next highlight advantages of a "rotated configuration" in accordance with the invention, in particular, 12A shows a “non-rotated” device configuration 120, with sensors 121, 122, 123 located in three comers, similar to configuration 90 shown in FIG. 9.
  • FIGS. 12 A - .12D being referred to as "FIG. 12"
  • FIGS. 12 A - .12D show the respective scene image data collected at sensors 121. .122, 123.
  • Sensors 121 and 122 define a horizontal axis between them, and generate a pair of images with horizontally displaced viewpoints. For certain features, e.g Craig features Hi, H2, there is a strong
  • correspondence i.e., the horizontally-displaced scene data provides a high level of certainty with respect to the correspondence of these features.
  • the correspondence is weak, as shown in FIG. .12. (i.e., the horizontally-displaced scene data provides a low level of certainty with respect to the correspondence of these features).
  • Sensors 122 and 123 define a vertical axis that i perpendicular to the axis defined by sensors 121 and 122. Again, for certain features, e.g., feature VI. in FIG. 12, there is strong correspondence. For other features, e.g. feature V2 in FIG. 12, the correspondence is weak.
  • FIG. 1 A shows a device configuration 130, similar to configuration 1.00 show in FIG. .10. with, sensors 131 , i 32, 133 positioned and rotated to define an angled horizontal axi and an angled vertical axis.
  • sensors 131 , i 32, 133 positioned and rotated to define an angled horizontal axi and an angled vertical axis.
  • FIGS. 13B, 13C, and S D the use of an angled sensor configuration eliminates the weakly corresponding features shown In FIGS, 128, 12C, and I2D.
  • FIGS. 12 and 13 a rotated configuration of .sensors in accordance with an exemplary practice of the invention enabies strong correspondence for certain scene features where the non-rotated configuration did not.
  • duri ng the process of calculating feature correspondence, a feature is selected in one image and then scanned for a corresponding feature in another image. During this process, there can often be several possible matches found and various methods are used to establish which match (if any) has the highest likelihood of being the correct one.
  • the correspondence errors in the excessively dark or light areas of the image can cause large-scale visible errors in the image by causing the computing of radically incorrect disparity or depth estimates.
  • another practice of he invention involves dynamically adjusting the exposure of the multi-vie camera, system, on a ftame-by-franie basis in order to improve the dsspariiy estimation in areas out of the exposed region viewed by the user.
  • exposures taken at darker and lighter exposure settings surrounding the visibility optimal exposure would be taken,, have their disparity calculated and then get integrated in the overall pixel histograms which are being retained and converged over time.
  • the dark and light, images could be, but are not required to be. presented to the user and would serve only to improve the disparity estimation.
  • Another aspect of this approach is to analyze the variance of die disparity histograms on "d k" pixels, "mid-range” pixels and “light pixels”, and use this to drive die exposure setting of the cameras, thus forming a closed loop system between the qualit of the disparity estimate and the set of exposures winch, are requested from the input multi-view camera system.
  • the cameras are viewin a purely indoor environment, such as an interior room, with limited dynamic range due to indirect lighting, only one exposure may be needed.
  • An exemplary practice of the V3D system executes image rectification in real -time using the GPU hardware of the device on winc it is operating, such as a conventional smartphone, to facilitate and improve an overall solution.
  • a search must be performed between two cameras arranged in a stereo configuration in order to detect the relative movement of features in the image due to parallax. This relative movement is measured in pixels and is referred to as "the disparity",
  • FIG. 14 shows an exemplary pair of unreetified and distorted camera (URD) source camera images 140 A and 140R for left and right stereo.
  • the image pair includes a matched feature, i.e.. the subject's right eye 141A, 140B.
  • the matching feature has largely been shifted horizontally , but there is also a vertical shift because of slight misalignment of the cameras and the fact that there is a polynomial term resulting from lens distortion.
  • the matching process can be optimized by measuring the lens distortion polynomial terms, and by inferring the affine transform required to apply to the images such that they are rectified to appear perfectl horizontal iy aligned and co-planar. When this is done, what would otherwise be a freefonn 2D search for a feature match can now be simplified by simply searching along the same horizontal row on the source image to find the match.
  • this is done in one step, in which the lens distortion and then affine transform
  • URD (Un rectified. Distorted) space This is the image space in which the source camera images are captured. There is both polynomial distortion due to the lens shape and an arTiue transform that makes the image not perfectly co-planar and axis-aligned with the oilier stereo linage. S he number of URD images in the system is equal to the number of cameras in the system.
  • URU.D Unreetified, Undistorted space: This is a space in which the polynomial lens distortion is removed from the image but the images remain unreetified.
  • the number of URUD images in the system is equal to number of URD images and therefore, cameras, in the system.
  • RUD (Rectified, Undistorted) space: This is a space in which both the polynomial lens distortion is removed from the image and an affine transform i applied to .make the image perfectly co-planar and axis aligned with the other stereo image on the respective axis.
  • RUD always exist in pairs. As such, for example, in a 3 camera system where the cameras are arranged in a substantially L -shape configuration (having two axes intersecting at a selected point), there would be two stereo axes, and thus
  • FIG . 15 is a flow diagram 150 providing various examples of possible transforms in a 4-camera Y3D system. Note that there are 4 stereo axes. Diagonal axes (not shown) would also be possible.
  • the typical transform when sampling the source camera images in a stereo correspondence system is to transform from RUD space (the desired space for feature correspondence on a stereo axis) to URD space (the source camera images).
  • RUD space the desired space for feature correspondence on a stereo axis
  • URD space the source camera images
  • an exemplary practice of the invention makes substantial use of the URU ' D image space to connect the stereo axes disparity values together.
  • FIG. 16 sets forth a flow diagram 160
  • FIGS. I 7A-C are a series of images that illustrate the appearance and purpose of the various transforms on a single camera i mage .
  • Tins practice of the invention works by extending the feature correspondence algorithm to include one or more additional axes of correspondence and integrating the results to improve the quality of the solution.
  • FIGS. .18A-D illustrate an example of this approach.
  • FIG. 18A is a diagram of sensor
  • configuration I SO having 3 cameras 181. 1 S2, .183 in a substantially L-shaped configuration such that a stereo pair exists on both the horizontal axis 185 and vertical xis 186, with one camera in common between the axes, similar to the configuration 90 shown in FIG. 9.
  • the overall system contains a suitable representation to integrate the .multiple disparit solutions (one such representation being the "Disparity Histograms'" practice of the invention discussed herein), this configuration will allow for 'uncertain correspondences in one stereo pair to be either corroborated or discarded through the additiooai infonnation found by performing correspondence on the other axis.
  • certain features which have no correspondence on one axis may find a
  • FIGS. I 8B, I SC, and 18D are depictions of three simultaneous images received respectively b sensors 181, 182, 183.
  • the three-image set. is illustrative of all the points mentioned, above.
  • Feature (A), i.e., the subject ' s nose, is found t correspond bom o the horizontal stereo pair
  • correspondence helps eliminate correspondence errors by improving the signal-to-noise ratio, since the likelihood of the same erroneous correspondence being found in both axes is low.
  • Feature (B) i.e., the spool of twine
  • Feature (B) is found to correspond only on the horizontal stereo pairs. Had the system only included a vertical pair, mis feature would not have had a depth estimate because it is entirely out of view on the upper image.
  • Feature (C) i.e., the cushion on the couch, is only possible to correspond on the vertical axis. Had the system only included a horizontal stereo pair, the cushion would have been entirely occluded in the left image, meaning no valid disparity estimate could have been established.
  • the stereo pair on a particular axis will have undergone a calibration process such that the epipolar lines are aligned to the rows or columns of the images.
  • Each stereo axis will have it own unique camera alignment properties and hence the coordinate systems of the features will be incompatible, in order to integrate dtsparity information on pixels between multiple axes, the pixels containing the disparity solutions will need to undergo coordinate transformation to a unified coordinate system.
  • This aspect of the invention involves .retaining a representation of disparity in the form of the error function or, as described elsewhere herein, the di parity histogram, and continuing to integrate disparity solutions for each frame in time to converge on a better solution through additional sampling.
  • This aspect of the invention is a variation of the correspondence refinement over time aspect. In cases where a given feature is detected but for which no correspondence can be found in another camera, if there was a prior solution for that pixel from a previous frame, this can be used instead. Histogram -Based pi sgarjtv Rgpre scmati on. Method
  • This aspect of the invention provides a representation to allow multiple disparity measuring techniques to be combined to produce a higher qualify estimate of image disparity, potentially even over time. It also permits a more efficient method of estimating disparity, taking into account more global context in the images, wi thout the significant cost of large per pixel kernels and image differencing.
  • disparity estimation methods for a given pixel in an image in the stereo pair involve sliding a region of pixels (known as a kernel) surrounding the pixel in question from one image over the other in the stereo pair, and computing the difference for each pixel, in. the kernel, and reducing this to a scalar value for each disparity being tested.
  • a region of pixels known as a kernel
  • FIG. 19 is a graph 190 of cumulative error for a 5x5 block of pixels for disparity values between 0 and 128 pixels. In this example, it can e seen that there is a single global minimum that is likely to be the best solution.
  • FIGS. 20A-B and 21.A-B are two horizontal stereo images, FIGS. 21 A and 2 IB, which, correspond to FIGS. 20A and 20B, show a selected kernel of pixels around the solution point for which we are trying to compute the disparity, it can be seen that for the kernel at its current i e, the cumulative error function will have two minima, one representing the features that have a small disparity since they are in the image background, and those on the wall which are in the foreground and will have a larger disparity. In the ideal situation, the minima would flip from the background to the foreground disparity value as close to the edge of the wall as possible. In practice, due to the high intensity of the wall pixels, many of the background pixels snap to the disparity of the foreground, resulting in a serious quality issue forming a border near the wall . lack oj ' Meaningfin Units
  • the units of measure of "error” i .e. the Y-axis on the example graph, is unsealed and may not be compatible between multiple cameras, each with its own color and luminance response. his introduces difficulty in applying statistical methods or combining error estimates produced through other methods. For example, computing the error function from a different stereo axis would be incompatible in scale, and thus the terms could not be easily integrated to produce a better error function.
  • One practice of the disparity histogram solution method of the invention works by maintaining a histogram showing the relative likelihood of a particular disparity being valid for a given pixel.
  • the disparity histogram behaves as a probabilit density function (PDF) of disparity for a given pixel, higher values indicating a higher likelihood that the disparity range is the "truth”.
  • PDF probabilit density function
  • FIG. 22 show an example of a typical disparity histogram for a pixel.
  • the x-axis indicates a particular disparit range and the scale v-axis is the number of pixels in the kernel surrounding the central pixel that are "voting" for that given disparity range.
  • FIGS. 23 and 24 show a pair of images and associated histograms.
  • the votes can be generated by using a relatively fast and low-quality estimate of disparity produced using small kernels and standard SSD type methods.
  • the SSD method is used to produce a "fast dense disparity ma ' ' (FDDE), wherein each pixel has a selected disparity that, is the lowest error. Then, the algorithm would go through each pixel accumulating into the histogram a tally of the number of votes for a given disparity in a larger kernel surrounding the pixel.
  • FDDE fast dense disparity ma ' '
  • disparity histogram With a given disparity histogram, many forms of analysis can be performed to establish the most likely disparity for the pixel, confidence in the solution validity, and even identify cases where there are multiple highly likely solutions. For example, if there is a single dominant mode in the histogram, the x coordinate of that peak denotes the most likely disparity solution.
  • FIG. 25 shows an example of a bi-modal disparity histogram with 2 equally probable disparity possibilities.
  • FIG. 26 is a diagram of an example showing the disparity histogram and associated cumulative distribution function (CDF), The interquartile range is narrow, indicating high confidence.
  • CDF cumulative distribution function
  • FIG. 2? is a contrasting example showing a wide interquartile range in the CDF and thus a low confidence in any peak within that range.
  • the width of the interquartile range can be established. This range can then be used to establish a confidence level in the solution.
  • a narrow interquartile range indicates that the vast majority of the samples agree with the solution, whereas a wide interquartile range (as in FIG . 27) indicates that the solution confidence is lo because many other disparity values could be the truth.
  • a count of the number of statistically significant modes in the histogram can be used to indicate 'modality.” For example, if there are two strong modes in the histogram (as in FIG, 25), it is highly likely that the point in question is right on the edge of a feature that demarks a background versus foreground transition in depth. This can be used to control the reconstruction later in the pipeline to control stretch versus slide (discussed in greater detail elsewhere herein).
  • the histogram is not biased by variation in image intensity at ail, allowing for high quali ty disparity edges on depth discontinuities , in addition, this permits othe method of estimating disparity for the given pixel to be easily integrated into a combined histogram.
  • SSD performance is proportional to the square of its kernel size multiplied by the number of disparity values being tested for. Even through the small SSD kernel output is a noisy disparity solution, the subsequent voting, which is done by a larger kernel of the pixels to produce the histograms, filters out so muc of the noise that it is, in practice, better than the SSD approach, even with very large kernels.
  • the histogram accumulation is only an addition function, and need only be done once per pixei per frame and does not increase in cost with additional. disparity resolution.
  • Another useful practice of the invention involves testing only for a small set of disparity values with SSD, populating the histogram, and then using the histogram votes to drive further SSD testing within that range to improve disparity resolution over time.
  • each output pixei thread having a respective "private histogram" maintained in on-chip storage close to the computation units (e.g., GPUs).
  • This private histogram can be stored such, that each, pixel thread will be reading and writing to the histogram on a single dedicated bank of shared local memory on a modern programmable GPU.
  • multiple histogram bins can be packed into a single word of the shared local memory and accessed using bitwise operations.
  • This practice of the invention is an. extension of the disparity histogram aspect of the invention, and has proven to be an highly useful pint of reducing error in the resulting disparity values, while still preserving important detail, on depth discontinuities in the scene.
  • n the- disparity values can come from many sources.
  • Multi-level disparity histograms reduce the contri bution from several of these error sources, including:
  • the multi-level voting scheme applies that same concept, but across descending frequencies in the image space.
  • FIGS. 2.8A. and 2.8B shows an example of a horizontal stereo image pair
  • FIGS. 28C and 28D show, respecti ely, the resulting disparity data before and after application of the described multi-level histogram, technique.
  • This aspect of the invention works by performing the image pattern matching FDDE) at several successively low-pass filtered versions of the input stereo images.
  • level is used herein to define a level of detail in the image, where higher level numbers imply a lower level of detail.
  • the peak image frequencies at leveifnj will be half that of levelfn-l J.
  • FIGS. 30A-E are a series of exemplary left and right multi-level input images.
  • Each level jhj is a do nsampled version of level jn-1 j.
  • the downsampling kernel is a 2x2 kernel with equal weights of 0.23 for each pixel The kernel remains centered at each successive level of downsampling.
  • the FDDE votes for every image level are included.
  • sttcH as the white wooden beams on the cabinets shown in. the background of the example of FIG, 30.
  • Level ⁇ 0 J the full image resolution
  • several possible matches may be found b the FDDE image comparisons since each of the wooden beams looks rather similar to each other, given the limited kernel size used for the FDDE.
  • FIG. 31 depicts an example of an image pair and disparity histogram, illustrating an incorrect matching scenario and its associated disparity histogram (see the notation "Winning candidate: incorrect" in the histogram of FIG. 31).
  • FIG. 32 shows the same scenario, but with the support, of 4 lower levels of FDDE votes in the histogram, resulting in a correct winning candidate (see the notation "Winning candidate: correct” in the histogram of FIG. 3.1 ).
  • the lower levels provide support for the true candidate at the higher levels.
  • a lower level i.e., a level characterized by reduced image resolution via low-pass filtering
  • the individual wooden beams shown in the image become less proooitnced, and the overall form of the broader context of that image region, begins to dominate the pattern .matching.
  • FIG . 33 is a schematic diagram of an exemplary practice of the invention.
  • FIG, 33 depicts a processing pipeline show ing the series of operations between the input camera images, through FDDE calculation and multi-level histogram voting, into a final disparity result.
  • multiple stereo axes e.g., 0 through n
  • multi-level disparity histogram representations in accordance with the invention, the following describes how the multi-level histogram is represented, and how to reliably integrate i s result to locate the final most likely disparity solution.
  • FIG, 34 shows a logical representation of the multi -level histogram after votes have been placed at each level.
  • FIG. 35 shows a physical repre entation of the same maiti-ievel histogram in numeric arrays in device memory, such as She digital memory units in a conventional s artphone architecture, hi an exemplary practice of the invention, the multi-level histogram consists of a series of initially independent histograms at each level of detail . Each histogram bin. in a given level represents the votes for a disparity found by the FDDE at that level. Since lev lj n] has a fraction the resolution as that of leveifn-l], each calculated disparity value represents a disparity uncertainty range which is that same fraction as wide. For example, in FIG. 4, each le vel is half the resolution as the one above it. As such, the disparity uncertainty range represented by each histogram bin is twice as wide as the level before it.
  • a significant detail to render the multi-level histogram integration correct involves applying a sub-pixel shift to the disparity values at each level during downsampling.
  • FIG. 34 if we look at the votes in ieveifOJ. disparity bin 8, these represent votes for disparity values 8-9, At level [1], the disparity bins are twice as wide. As such, we want to ensure that the histograms remain centered under the level above. Levelf 1 ] shows that the same bi represent 7.5 throug 9.5. This half-pixel offset is highly significant, because image error can cause the disparity to be rounded to the neighbor bin and then fail to receive support from the level below,
  • an exemplary practice of the invention applies a half pixel shift to only one of the images in the stereo pair at each level of down sampling. This can be done inline within the weights of the filter kernel used to do the downsampling between levels. While it is possible to omit the half pixei shift and use more complex weighting during multi-level histogram summation, it. is very inefficient. Performing the half pixel shift during down-sampling only involves modifying the filter weights and adding two extra taps, making it almost "free", from a computational standpoint,
  • FIG. 36 shows an example of per- !evel down-sampling according to the invention, using a 2x2 box filter.
  • On the left is illustrated a method without a half pixel shift.
  • On the right of FIG. 36 is illustrated the modified filter with a half pixel shift, in accordance with an exemplary practice of the invention. Note that this half pixel shift should only be appiied to one of the image in the stereo pair. This has the effect of disparity values remaining centered at each level in the multi-level histogram during voting, resulting in the configuration shown in FIG. 34.
  • FIG, 3? illustrates an exemplary practice of the invention, showing an example of the summation of the multi-level histogram to produce a combined histogram in which the peak can he found.
  • the histogram integration involves performing a recursive summation across all of the levels as shown in FIG. 37.
  • the peak disparity index and number of votes for that peak are needed and thus the combined histogram does not need to be actually stored in memory.
  • maintaining a summation stack can reduce summation operations and multi-level histogram memory access.
  • each le vel can be modified to control the amount of effect that the lower levels in the overall voting.
  • the current value at level [ J gets added to two of the bins above it in levelf n-lj with a weight of 1 ⁇ 2 each.
  • FIGS. 39-40 An exemplary practice of the invention, illustrated in FIGS. 39-40, builds on the disparity h istograms and allows for a higher accuracy disparity estimate to be acquired without requiring any additional SSD steps to be performed, and tor only a smalt amount of incremental math when selecting the optimal disparity from the histogram.
  • FIG. 38 is a disparit histogram for a typical pixel.
  • FIG. 39 is a histogram in a situation in which a sub-pixel disparity solution can be inferred from the disparity histogram. We can see mat an even number of votes exists in the 3rd and 4th bins. As such, we can say that the true disparity range l ies between 3.5 and 4.5 with a center point of 4.0.
  • FIG. 40 is a histogram that reveals another case in which a sub-pixel disparity solution can be inferred, in this ease, the 3rd bin is the peak with 1.0 votes, its directly adjacent neighbor is at 5 votes. As such, we can state mat the sub-pixel dispari ty is between these two and closer to the 3rd bin, ranging from 3.25 to 4,25, using the following equation:
  • Another practice of the invention provides a further method of solving the problem where larger kernels in the SSD method tend to favor larger intensity differences with the overall kernel, rather than for the pixel being solved.
  • This method of the invention involves applying a higher weight to the center pixel with a decreasing weight proportional to the distance of the given kernel sample from the center, By- doing this, the error function minim will tend to be found closer to the valid solution for the pixel being solved. injecdve Constraint
  • Yet another aspect of the invention involves the use of an "mjeetive constraint", as illustrated in FIGS. 41-45.
  • the goal is to produce the most correct results possible.
  • incorrect disparity values will get computed, especially if only using the FDDE data, produced via image comparison using SSD, SAD or one of the many image comparison error measurement techniques.
  • FIG. 41 shows an exemplary pair of stereoscopic images and the disparit data resulting from the FDDE using SAD with a 3x3 kernel.
  • Warmer colors represent closer objects.
  • a close look at FIG, 41 reveals occasional values wh ich look obviously incorrect.
  • Some of the factors causing these errors include camera sensor noise, image color response differences between sensors and lack of visibil ity of common feature between cameras.
  • one way of reducing these errors is by applying "constraints" to the solution which -reduce the set of possible solutions to a more realistic set of possibilities.
  • solving the disparity across multiple stereo axes is a tbrra of constraint by using the solution on one axis to reinforce or contradict that of another axis.
  • the disparit histograms are another form of constraint by limiting the set of possible solutions by filtering out spurious results in 2D space.
  • Multi-level histograms constrain the solution by ensuring agreement of the solution across multiple frequencies in the image.
  • the iojeefive constraint aspect of the invention uses geometric rules about how features must correspond between images in the stereo pair to eliminate false disparity solutions. It. maps these geometric rules on the concept of an imective function in set theory .
  • the domain and co-domain are pixels from each of the stereo cameras on an axis.
  • the references between the sets are the disparity values. For example, if every pixel in the domain (image A) had a disparity value of " 3 ⁇ 4", then flits means that a perfect bijection exists between the two images, since e ver pixel in the domain maps to the same pixel in the co-domain.
  • FIG. 42 shows an example of a bijection where every pixel in the domain maps to a unique pixel in the co-domain, hi this case, the image features are all at infinity distance and thus do not appear to shift between the camera images.
  • FIG. 43 shows anothe r example of a bijection. In tins ease ail the image features are closer to the cameras, but are all at the same depth and hence shift together.
  • FIG. 44 shows an example of an image with a foreground and background . Note that the foreground moves substantially between images. This causes new features to be revealed in the co- domain that will have .no valid reference in the domain. This is still art injective function, but not a bisection.
  • Best match wins The actual image matching error or histogram vote count rc compared between the two possible candidate element in the domain against the contested element in the co- domain. The one with the best match wins.
  • Smallest disparity wins During image reconstruction, typically errors caused by to small a disparity are less noticeable than errors with too high a disparity. As such, if there is contest for a given co-domain element, select the one with the smallest disparity and invalidate the others.
  • An exemplary practice of the invention involves the use of a disparity value and a sample buffer index at 2D control points. This aspect works by defining a data structure representing a 2D coordinate in image space and containing a disparity value, which is treated as a "pixel velocity" in screen space with respect to a given movement of the view vector.
  • control points can contain a sample buffer index that indicates which of the camera streams to take the samples from. For example, a. given feature may be visible in only one of the cameras in which case we will want to change the source that the samples are taken from when reconstructing the final reconstructed image.
  • This aspect of the invention is based on the observation that many of the samples in the multiple camera streams are of the same feature and are thus redundant. With a valid disparity estimate, it can be calculated that a feature is either redundant or is a unique feature from a specific camera, and
  • features/samples can be flagged with a refe rence count of how many of the views "reference” that feature.
  • Compression Method or Streaming with Video
  • a system in accordance with the invention can choose to only encode and transmit samples exactly one time. For example, if the system is capturing 4 camera streams to produce the disparity and control points and have produced reference counts, the system will be able to determine whether a pixel is repeated in all the camera views, or only visible in one. As such, the system need only transmit to the encoder the chunk of pixels from each camera that are actually unique. This allows for a bandwidth reduction in a video streaming session.
  • a system in accordance with the invention can establish an estimate of the viewer head or eye location and/or orientation. With this information and the disparity values acquired from feature correspondence or within the transmitted control point stream, the system can slide the pixels along the head movement vector at a rate that is proportional to the disparity. As such, die disparity forms the radius of a "sphere" of motion for a given feature.
  • This aspect allows a 3D reconstruction to be performed simply by warping a 2D image, provided the control points are positioned along important feature edges and have a sufficiently high quality disparity estimate.
  • no 3D geometry m the form of polygons or higher order surfaces is required.
  • a shortcut to estimate this behavior is to reconstruct the synthetic view based on the view origin and then crop the 2D image and scale it up to fill the v iew window before presentation, th minima and maxima of the c rop box being defined as a function of the viewer head location with respect to the display and the display dimensions.
  • An exemplary practice of the V3D invention contains a hybrid 2D/3D head detection component that combines a fast 2D head detector with the 3D disparity data from the multi-view solver to obtain an accurate v iewpoint position in 3D space relative to the camera system.
  • FIGS. 47A-B provid a flow diagram that illustrates the operation of the hybrid markerless head tracking system .
  • the system optionally converts to luminance and downsamp!es the image, and men passes it to a basic 2D facial feature detector.
  • the 2D feature detector uses the image to extract an estimate of the head and eye position as well as the face's rotation angle relative to the image plane. These extracted 2D feature positions are extremely noisy from frame to frame which, if taken alone as a 3D viewpoint, would not be sufficiently stable for the intended purposes of the invention. Accordingly, the 2D feature detection is used as a starting estimate of a head position.
  • the system uses tins 2D feature estimate to extract 3D points from the disparity date that exists in the same coordinate system as the original 2D image.
  • the system first determines an average depth for the face by extracting 3D points via the disparity date for a small area located in the center of the face. This average depth is used to determine a reasonable valid depth range that would encompass the entire head.
  • the system uses the estimated center of the face, the face's rotation angle, and the depth range to determine a best-fit rectangle that includes ' the head. For both the horizontal and vertical axis, the system calculates multiple vectors that are perpendicular to the axis but spaced at different intervals. For each of these vectors, the system tests the 3D points starting from outside the head and working towards the inside, to the horizontal or vertical axis. When a 3D point is encountered that, falls within the previously designated valid depth range, the system considers that a valid extent of the head -rectangle.
  • the system can determine a best-fit rectangle for the head, from which the system then extracts all 3 points that lie withi this best-fu rectangle and calculates a weighted average, if the number of valid 3D points extracted from this region pass a threshold in relation to the maximum number of possible 3D points in the region, then there is designated a valid 3D head position result.
  • FIG. 48 is a diagram depicting this technique for calculating the disparity extraction
  • the system can interpolate from frame -to-fi3 ⁇ 4me based on the time delta that has passed since the previous frame.
  • This method of the invention works by taking one or more source images and set of control points as described previousl .
  • the control points denote “handles " on the image which we can then move around in 2D space and interpolate the pixels in between.
  • the system can therefore slide the control points around in 2D image space proportionally to their disparit value and create the appearance of an image taken from a different 3D perspective.
  • the following are details of how the interpolation can be accomplished in accordance with exemplary- practices of the invention.
  • This implementation of 2D warping uses the line drawing hardware and texture filtering available on modem GPU hardware, such as in a conventional smartphone or other mobile device, it has the advantages of being easy to implement, fast to calculate, and avoiding the need to construct, complex connectivity meshes betwee the control points in multiple dimensions. It works by first rotating the source images and control points coordinates such that the rows or columns of pixels are parallel to the vector between the original image center and the new view vector. For purposes of this explanation, assume the view vector is aligned to image scaniines. Next, the system iterates through each scanline and goes through ah the control points for that scan! me.
  • the system draws a line beginning and ending at each control point in 2D image space, but adds the disparity multiplied by the view vector magnitude with the x coordinate.
  • the system assigns a texture coordinate to the beginning and end points that is equal to their original 2D location in the source image.
  • the GPU will draw the line and will interpolate the texture coordinates linearly along the line. As such, image data between (he control points will be stretched linearly. Provided control points are placed on edge features, the interpolation will not be visually obvious.
  • the result is a re-projected image, which is then rotated back by the inverse of the rotation originally applied to align the view vector with the scaniines.
  • Thrs approach is related to the lines but works by linking control points not only along a scanline but also between scaniines. in certain cases, this may provide a higher quality interpolation, than lines alone.
  • the determination of when it is appropriate to slide versus the default stretching behavior can be made by analyzing the disparity histogram and checking for multi-modal behavior. If two strong modes are present, this indicates the control point is on a boundary where it would be better to allow the foreground and background to move independently rather than interpolating depth between them.
  • HMDs head-mounted stereo displays
  • the telecommimications devices can include known Forms of cellphones, smartphones. and other known forms of mobile devices, tablet computers, desktop and laptop computers, and known forms of digital network components and server cloud netxvork ciient
  • Computer software can encompass any set of computer-readable programs instructions encoded on a non- transitory computer readable medium.
  • a computer readable medium can encompass any form of computer readable element, including, but not limited to, a computer hard disk, computer floppy disk, computer-readable flash drive, computer-readable RAM or ROM element or any other known means of encoding, storing or pro viding digital information, whether local to or remote from the cellphone, smaxtphone, tablet computer, PC, laptop, computer-driven television, or oilier digital processing device or system.
  • Various forms of computer readable elements and media are well known in the computing arts, and their selection is left to the implementer.
  • modules can be implemented using computer program modules and digital processing hardware elements, including memory units and other data storage units, and including commercially available processing units, memor units, computers, servers, smartphones and other computing and telecommunications devices.
  • modules include computer program instructions, objects, components, data structures, and the like mat can be executed to perform selected tasks or achie ve selected outcomes.
  • modules shown in the drawin s and discussed in the description herein refer to computer-based or digital processor-based elements that can be implemented as software, hardware, firmware and/or other suitable components, taken separately or in combination ⁇ that provide the functions described herein, and which ma be read from computer storage or memory, loaded into the memory of a digital processor or set of digital processors, connected via a bus.
  • data storage element can refer to any appropriate memory element usable for storing program instructions, machine readable files, databases, and other data structures.
  • the various digital processing, memory and storage elements described herein can be implemented to operate on a single computing device or system, such as a server or collection, of servers, or they can be implemented and inter-operated on various devices across a network, whether in a server-client arrangement, server-cloud-cl ient arrangement, or other configuration in which client devices can communicate with allocated resources, functions or applications programs, or with a server, via a communications network.
  • CM3-U3-I 3S2C- €S Three PoiniGiey Chame!eon3 (CM3-U3-I 3S2C- €S) 1.3 Megapixel camera modules with i/3" sensor size assembled on a polycarbonate plate with shutter synchronization circuit.
  • An Intel Core i7-4650U processor which includes on-chip the following:
  • FIGS. 49-54 depict system aspects of the invention, including digital processing devices and architectures in which the invention can be implemented.
  • FIG. 49 depicts a digital processing device, such as a commercially available smartphone, in which the invention can be implemented
  • FIG. 50 shows a full-duplex, bi-directional practice of the invention between two users and their corresponding devices:
  • FIG. 51 shows the use of a system in accordance with the invention to enable a first user to view a remote scene;
  • FIG. 52 shows a one-to- iany configuration in which multiple users (e.g., audience members) can. view either simultaneously or asynchronously using a variety of different viewing elements in accordance with the invention:
  • FIG. 49 depicts a digital processing device, such as a commercially available smartphone, in which the invention can be implemented
  • FIG. 50 shows a full-duplex, bi-directional practice of the invention between two users and their corresponding devices:
  • FIG. 51 shows the use of a system in accordance with the invention to enable a first user to
  • FIG. 53 shows an embodiment of the invention in connection with generating an image data stream for the control system of an autonomous or self-driving vehicle; and
  • FIG. 54 shows the use of a head-mounted display (HMD) in connection with the invention, either in a pass-through mode to view an actual, external scene (shown on the right side of FIG. 54), or to view prerecorded image content.
  • HMD head-mounted display
  • the commercially available smartphone. tablet computer or other digital processing device 492 communicaies with a conventional digital communications network 494 via a communications pathway 495 of known form (the collective combination of device 492, network 494 and communications pathwa (s) 495 forming configuration 490), and the device 492 includes one or more digital processors 496. cameras 4910 and 4912. digital memory or storage elements) 4914 containing, among other items, digital processor-readable and processor-executable computer program instructions (programs) 4 16, and a display element 498.
  • the processor 496 can execute programs 4916 to cany out various operations, including operations in accordance with the present invention.
  • the full-duplex, bi-directional practice of the invention between two users and their corresponding devices includes first user and scene 503, second user and scene 505, smartphones, tablet computers or other digital processing devices 502, 504, network 506 and communications pathways 508, 5010.
  • the devices 502, 504 respectivel include cameras 5012, 5014. 5022, 5024, displays 5016, 5026, processors 501 8, 5028, and digital memory or storage elements 5020, 5030 (which may store processor-executable computer program instructions, and which may be separate from the processors).
  • the configuration 10 of FIG . 51 for enabling a first user 514 to view a remote scene 515 containing objects 5022, includes smartphone or other digital processing device 5038. which can contain cameras 5030,5032, a display 5034, one or more processors) 5036 and storage 5038 (which can contain computer program instructions and which can be separate from processor 5036).
  • Configuration 510 also includes network 5024» communications pathways 5026, 5028, remote cameras 16, 518 with a view of the .remote scene 15, processors) 5020, and digital memory or storage eieme.ni(s) 5040 (which can contain computer program instructions, and which can be separate from processor 502.0).
  • the one-to-many configuration 520 of FIG. 52 in which multiple users (e.g., audience members) using smartphones, tablet computers or other devices 526.1, 526.2, 526.3 can view a remote scene or remote first user 522, either simultaneously or asynchronously, in accordance with the invention, includes digital processing device 524, network 5212 and communications pathways 5214, 5216.1, 5216.2, 5216.3.
  • the smartphone or other digital processing device 524 used to capture images of the remove scene or first user 522, and the smartphones or other digital processing devices 526.
  • L 526.2, 526.3 used by respective viewers/audience members include respective cameras, digital processors, digital memory or storage elements) ⁇ which may store computer program instructions executable by the respective processor, and which may be separate from the processor), and displays.
  • the embodiment or configuration 530 of the invention, illustrated in FIG. 53, for generating an image data stream for the control system 5312 of an autonomous or self-driving vehicle 532 can include camera(s) 5310 having a view of scene 534 containing objects 536, processors) 538 (which may includs or have in communication therewith digital memory or storage elements for storing data and/or processor-executable computer program instructions) in communication, with vehicle control system 5312.
  • vehicle control system 5312 may also include digital storage or memory elements) 5314. which may include executable program, instructions, and which may be separate from vehicle control system 5312.
  • HMD-related embodiment or configuration 540 of the invention can include the use of a head-mounted, display (HMD) 542 m connection with the invention, either in a pass- through mode to view an actual, external scene 544 containing objects 545 (shown on the right side of FIG, 54), or to view prerecorded image content or data representation 5410.
  • HMD head-mounted, display
  • the HMD 542. which can be a purpose-built HMD or an adaptation of a sniartphone or other digital processing device, can be In communication with an external processor 546, external digital memory or storage elements) 548 that can contain compute r program instructions 549, and/or in communication wi th a source of prerecorded content or data representation 5410,
  • the HMD 542 shown in FIG, 54 includes cameras 5414 and 5416 which can have a view of actual scene 544; left and right displays 5418 and 5420 for respectively displaying to a user's left and right eyes 5424 and 5426; digital processor(s) 5412, and a liead/eye face tracking element 5422.
  • the tracking element: 5422 can consist of a combination of hardware and software elements and algorithms, described in greater detail elsewhere herein, in accordance with the present invention.
  • the processor element(s) 5 12 of the HMD can also contain, or have proximate thereto, digital memory or storage elements, which may store processor-executable computer program instructions.
  • digital memory or storage elements can contain digital processor-executable computer program instructions, which, when executed by a digital processor, cause the processor to execute operations in accordance wi th various aspects of the present invention.
  • FIGS. 55-80 are flowcharts illustrating method aspects and exemplary practices of the invention
  • the methods depicted in these flowcharts are examples only; the organization, order and number of operations in the exemplary practices can he varied; and the exemplary practices and methods can be arranged or ordered differently, and include different functions, whether singly or in combination, while still being within the spirit and scope of the present invention. Items described below in parentheses are among other aspects, optional in a given practice of the invention.
  • FIG. 55 is a flowchart of a V3D method 550 according to an exemplars' practice of the mvention, including the following operations:
  • FIG. 56 is a flowchart of another V3B method 560 according to an exemplary practice of the invention, including the following operations:
  • 56 i Capture images of remote scene ; 562: Execute image rectification:
  • 567 Display synthetic view to first user (on display screen used by first user);
  • FIG. 7 is a flowchart of a self-portraiture V3D method 570 according to an exemplary practice of the invention, including the following operations: 571 : Capture images of user during setup time (use camera provided on or around periphery* of display screen of user's handheld device with, view of user's face during self-portrait setu time);
  • 573 Generate data representation representative of captured images
  • 574 Reconstruct synthetic view of user, based on the generated data represeniation and generated tracking information
  • 575 Display to user the synthetic view of user (on the display screen during the setup time) (thereby enabling user, while setting up self-portrait, to selectively orient or position his gaze or head, or handheld device and its camera, with real-time visual feedback); 576: Execute capturing, estimating, generating, reconstructing and displaying such that, in self-portrait, user can appear to be looking directly into camera, even if camera does not have direct eye contact gaze vector to user.
  • FIG. 58 is a flowchart of a photo composition V3D method 580 according to an exemplary practice of the invention, including the following operations: 581 : At photograph setup time, capture images of scene to he photographed (use camera provided on a side of user's handheld device opposite display screes side of user's device);
  • tracking information synthetic view .reconstructed suc that scale and perspective of synthetic view have selected correspondence to user's viewpoint relati ve to handheld device and scene;
  • 585 Display to user the synthetic view of the scene (on display screen during setup time) (thereby enabling user, while setting up photograph, to frame scene to be photographed, with selected scale and perspective within displa frame, with real-time visual feedback) (wherein user can control scale and perspective of sy nthetic view by changing position of handheld device relative to position of user's head).
  • FIG. 5 is a flowchart of an HMD-related V3D method 590 according to an exemplary practice of the invention, including the following operations:
  • captured image streams contain images of a scene
  • at least one camera is panoramic, night-vision, or thermal imaging camera
  • 392 Execute .feature correspondence function
  • each of the synthetic views has respective view origin corresponding to respective virtual camera location, wherein the resp ctive view origins are positioned such that, ⁇ he respective virtual camera locations coincide with respective locations of user's left and right eyes, so as to provide user with substantially natural visual experience of perspective, binocular stereo and occlusion exemplary practices of the scene, substantially as if user were directly viewing scene without an HMD.
  • FIG. 60 is a flowchart of another HMD-related V 3D method 600 according to an exemplary practice of the invention, including the following operations:
  • 60 i Capture or generate at. least two image streams; (using at least one camera);
  • captured image streams can contain images of a scene); (wherei captured image streams can be pre-recorded image content): (wherein ai least one camera is panoramic, night-vision, or thermal imaging): (wherei at least one IR TOF that directly provides depth); 602: Execute feature correspondence function;
  • 603 Generate data representation representative of captured images contained in captured image streams
  • 605 Display synthetic views to the user, via HMD;
  • each of the synthetic views has respective view origin corresponding to respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of user's left and right eyes, so as to provide user with substantially natural visual experience of perspective, binocular stereo and occlusion exemplary practices of the scene, substantially as if user were directly viewing scene without an HMD.
  • F1.G. 61 is a flowchart of a vehicle control system-related method 610 according to an exemplary practice of tire invention, including the following operations:
  • 61 1 Capture images of scene around at least a portion of vehicle (using at least one camera having a view of scene )
  • FIG. 62 is a flowchart of another V3D method 620 according to an exemplary practice of the invention, which can utilize a view vector rotated camera configuration and/or a number of the foil owing operations:
  • cameras define a line; rotate the line defined by first and second camera locations by a selected amount from selected horizontal or vertical axis to increase number of valid feature
  • FIG. 63 is a flowchart of an exposure cycling method 630 according to an exemplary practice of the invention, including the following operations; 631 : Dynamically adjust exposure of eamera ⁇ s) on irame-by ⁇ frame basis to improve disparity estimation in regions outside exposed region: take series of exposures, including exposures lighter than and exposures darker that ⁇ a visibility-optimal exposure; calculate disparity values tor each exposure; and integrate disparity values into an overall disparity solution over time, to improve disparity estimation;
  • the overall disparity solution includes a disparity histogram into which disparity values are
  • the disparity histogram being converged over time, so as to improve disparity estimation.
  • disparity solution includes disparity histogram: analyze variance of disparity histograms on respective dark, mid-range and light pixels to generate variance information used to control exposure settings of eamerais), thereby to form a closed loop between quality of disparity estimate and set of exposures requested from. camera(s».
  • FIG. 64 is a flowchart of an image rectification method 640 according to an exemplary practice of the in vention, incl uding the following operations:
  • FIGS . 65A-B show a flowchart of a feature correspondence method 650 according to an exemplary practice of the invention, which can include a number of the following operations:
  • 6SS (Votes indicated by disparity histogram initially generated utilizing sum of square differences
  • SSD executing SSD method with relatively small kernel to produce fast dense disparity map m which eac pixel has selected disparity that represents lowest error; then, proces ing plurality' of pixels to accumulate into disparity histogram a tally of number of votes for given disparity in relatively larger kernel surrounding pixel in question);
  • 6515 (Test for only a small set of disparity values using small-kernel SSD method to generate initial results; populate corresponding disparity histogram with initial results; then use histogram votes to drive further SSD testing within given range to improve disparity resolution over time)
  • 651.6 (Extract sub-pixel disparity information from disparity histogram: ' where histogram indicates a maximum-vote disparity range arid an adjacent, ronner-up disparity range, calculate a weighted average disparity value based on ratio between number of votes for each of the adjacent disparity ranges);
  • the feature correspondence function includes weighting toward a center pixel in. a sum -of
  • SSD squared differences
  • the feature correspondence function includes optimising generation of disparity values on
  • FIG, 66 is a flowchart of a method 660 for generating a data representation, according to an exemplar;' practice of the invention, which can include a number of the following operations:
  • a disparity value treated as a pixel velocity in screen space with respect to a given movement of a given view vector; and utilize the disparity value in combination with movement vector to slide a pixel in a gi ven source image in selected directions, in 2D, to enable a
  • each camera generates a respective camera stream; and the data structure contains a sample buffer index, stored in. association with control poin t coordinates, that indicates which, camera stream to sample in association wi th given control point);
  • FIGS. 67A-B show a flowchart of an image reconstruction method 670, according to an exemplar practice of the invention, which can include a number of the 'following operations:
  • sliding is utilized in regions of large disparity or depth change
  • integration functions for one or more pixels in a desired output -resolution of an image to be displayed to the user maps an input view origin vector to at least one known, weighted 2D image sample location in at .least one input image buffer).
  • FIG. 6-8 is a flowchart of a display me thod 680, according to an exemplary practice of the invention, which can include a number of the fotlowing operations:
  • FIG. 69 is a flowchart of a method 690 according to an exemplary practice of the invention, utilizing a multi-level disparity histogram, and which can also include the following: 691 : Capture images of scene, using at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis;
  • Each level is assigned a level number, and each successively higher level is characterized by a lower image resolution:
  • Each histogram bin in a given level represent votes for a disparity determined by the FDDE at that level
  • Each histogram bin in a given level has an associated disparity uncertainty range, and the disparit uncertaint range represented by each histogram bio is a selected multiple wider than the disparity uncertainty range of a bin in the preceding le vel;
  • rounding error effect apply half pixel shift to only one of the images in a stereo pair at each level of downsampling
  • 694, 1 Apply sub-pixel shift implemented inline, within the weights of the filter kernel utilized to implement the downsampling from level to level. 695: Execute histogram integration, including executing a recursive summation across all the f ODE levels;
  • FIG. 70 is a flowchart of a method 700 according to an exemplars.' practice of the invention, utilizing RIID image space and including the following operations:
  • Capture linages of scene using at least first and second cameras hav ing a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and for each camera pair axis, execute image capture using the camera pair to generate image data;
  • FIG. 71 is a flowchart of a method 710 according to an exemplary practice of the invention, utilizing an injective constraint aspect and including the following operations:
  • 71 1 Capture images of a scene, using at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair;
  • a feature correspondence function by detecting common features between corresponding images captured by the respecti ve cameras and measuring a relative distance in image space between the common features, to generate disparity values
  • the feature correspondence function including: generating a disparit solution based on the disparity values, and applying an injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured b the first camera, and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution, in response to violation of the injective constraint, and wherein the infective constraint is that no element in the co-domain is referenced more than once by elements in the domain .
  • FIG. 72 is a flowchart of a method 720 for applying an injective constraint, according to an exemplary practice of the i n vention, including the follo wi ng operations:
  • FIG , 73 is a flowchart of a method 730 relating to error correction approaches based on injective constraint, according to an exemplary practice of the invention, including one or more of the following:
  • First-come, first-served assign priority to the first element tn the domain to claim an element in the co-domain, and if a second element in the domain claims the same co-domain element, invalidating that subsequent match and designating tha subsequent match to be invalid;
  • Best match wins compare the actual image matching error or corresponding histogram vote count between the two possible candidate elements in the domain against the contested element in the co-domain., and designate as winner the domain candidate with the best match;
  • Smallest disparity wins if there is a contest between candidate elements in the domain for a given co-domain element, wherein each candidate element has a corresponding disparity, selecting the domain candidate with the smallest disparity and designating the others as invalid;
  • Seek alternative candidates select and test the next best domain candidate, based on a selected criterion, and iterating the selecting and testing until the violation is eliminated or a computational time limit, is reached.
  • FIG . 74 is a flowchart of a head eye/iaee location estimation method 740 according to an exemplary practice of the invention, including the following operations:
  • tracking information which can include the following:
  • 744.1 Pass a captured image of the first user, the captured image including Ac first user's head and lace, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the face relative to an image plane;
  • 2D two-dimensional
  • 744.2 Use an estimated center-of-face position, face rotation angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the head; 744.3: Extract from the best-fit rectangle alt 3D points that lie within, the best-fit rectangle, and calculate therefrom a representative 3D head position;
  • FIG. 75 is a flowchart of a method 750 providing further optional operations relating to the 3D location estimation shown in FIG. 74, according to an exemplary practice of the invention, including the following:
  • FIG , 76 is a flowchart of optional sub-operations 760 relating to 3D location estimation, according to an exemplar - practice of the invention, which can include a number of the following:
  • FIG. 77 is a flowchart of a .method 770 according to an. exemplary practice of the invention, utilizing URUD image space and including the following operations;
  • Captitre images of a scene using at least three cameras having a view of the scene, the cameras being arranged in a substantially ' T-shaped configuration wherein a first pair of cameras is disposed along a first axis and second pai of cameras is disposed along a second axis intersecting wi th, but angularly displaced from, the first axis, wherein the first and second pairs of came ras share common camera at or near the intersection of the first, and second axis, so that the first and second pairs of cameras represent respecti ve first and second independent stereo axes that share a common camera;
  • FIG, 78 is a flowchart of a method 780 relating to optional operations in RUD/URUD image space according to an exemplary practice of the invention, including the following operations:
  • FIG. 79 is a flow chart of a method 790 relating to private disparit histograms according to an exemplary practice of the invention, including the following operations:
  • a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generat disparity values, using a disparity histogram method to integrate data and determine correspondence, which can include:
  • FIG. 80 is a flowchart of a method 800 fu rther relating to private disparity histograms according to mi exemplary practice of the invention, including the following operations:
  • disparity 1 histogram is characterized by a series of histogram bins indicating the number of votes for a given disparity range; and if a maximum possible number of votes in the private disparity histogram, is known, multiple histogram bins can be packed into a single word of the shared local memory and accessed using bitwise GPU access operations.
  • Identification, authentication or matching of a user or subject, by the user's facial features can be useful in a wide range of settings. These may include controlling or limiting access to systems, enabling rapid or simplified access to systems or to a particular use account or use profile on a system, or other security purposes. Exemplary practices and embodiments of the invention enable such identification, authentication or matching, by generating a Facial Signature based on images of the users or subject's face, or face and head, as described in greater detail below.
  • the digital processor elements of the embodiments of the invention depicted in the accompanying drawing figures can be employed to execute the Facial Signature functions of exemplary practices and embodiments of the i nvention described herein, including image capture, image rectification, feature correlation/disparity value processing, and Facial Signature generation functions.
  • the Facial Signature aspects of the invention can be executed on otherwise conventional processing elements and platforms provided by or associated with known forms of desktop computers, laptop computers, tablet computers, smartphones, and associated additional or peripheral hardware elements, such as cameras, suitably configured in accordance with exemplars' practices of the invention.
  • the front-end aspects of the V3D processing pipeline described above i.e. aspects of Image Capture, Image Rect fication and Feature Correspondence, are employed, but instead of constructing a representation intended for 3D streaming of a scene for visualizing it from different views (see, e.g., FIGS. 7 and 8, depicting exemplar ' practices and embodiments of the V3D invention), the V3D process front-end can be configured to construct, a "Facial Signature" for the purposes of subsequently identifying an individual person, or user of a system or resource, in a secure manner that is substantially more difficult to forge than a regular 2D facial image.
  • a "Facial Signature" for the purposes of subsequently identifying an individual person, or user of a system or resource
  • FIG. 85 is a flowchart of an exemplary practice of the Facial Signature aspects rising V3D process elements of the invention, includi ng capturing images of the user's or subject's face (851 ), executing image rectification to compensate for camera optical di tortion and alignment (852), executing feature correspondence to produce di parity /depth values (853), eliminating the image background (854), and generating a facial, signature data representation (855).
  • the enhanced level of security provided by the Faciai Signature aspect of the invention is enabled in part because the depth stereo estimation of the V3D method of the invention described in this document requires all of the facial features to be presented to the camera(s) at the correct distance ratios between the cameras or from the structure light or tinie-of-fligbt sensor. Creating a forgery would require an accurate physical model of the face in the real world. By requiring multiple poses, the forger's challenge of constructing au accurate 3D model becomes highly impractical.
  • FIG, 81 illustrates an exemplary practice of the Facial Signature aspect of the invention, including obtaining images from the carnera(s) (81, 1). generating rectified images (81.2), executing disparity/depth estimation (81.3), executing background elimination (81.4), and combining with 2D color information (81.5a, 81.5b, 81.5c).. which can occur using multiple poses of the human user/subject, as described in greater detail below.
  • the facial signature could be a combination of the 3D facial contour information and the regular 2D image from one or more of the cameras.
  • Facial Signature aspect of die invention could either store the 3D contour data in the signature, or simply use the 2D image of the face but use the 3D facial contours jest to confirm that the image(s) depict an actual human face with credible 3D proportions that was viewed by the cameras at. the same location as die 2 image.
  • a method in accordance with the Facial Signature aspect can also include an enrollment phase in which the human user or subject would be requested, by the system, to move his or her head into different orientations, and, optionally, strike a number of alternative fe i l poses, such as "smile” or "wink", so that the system can establish a. robust scan (or multiple scans) of the human subject's facial proportions.
  • an enrolled Facial Signature is generated from these scans.
  • a few seconds of images of the user's or subject's head can be captured in real-time, resulting in hundreds of individual captures, each slightly different, and then correlated with the facial signature to confirm a match.
  • Exemplary practices of the invention can be configured for a variety of purposes, including, but not limited to, the following:
  • the facial signature generated from the depth information extracted from the V3D front-end can be used to identify a specific individual. Such an identification would be more reliable than a
  • the facial signature aspects of the invention can be combined with other security factors, such as a fingerprin or a pass-code, to provide a high level of security for accessing a user accoun on a device or system, or for other authentication purposes, hi a hybrid configuration with a conventional 2D face identification system:
  • the 3D contour data could atone be used to identify a face, combining it with the existing 2D image from one or more cameras would add further security 1 .
  • Existing 2D facial detection systems return one or more rectangles or "boxes" defining the 2D extent of a human subject's face location.
  • a 2D facial detection operation could be executed, and then used to minimize the amount of processing required for the 3D calculation by limiting the calculations to within that box. Utilizing the 2D facial detection operation in this manner can reduce the system's power ssumpt on, and reduce the time required for facial recognition on a given device.
  • a user would train the device by generating a unique fecial signature.
  • the system would request the user to present a series of desired head movements relative to the device, or a series of facial expressions, suck as "smile” or "wink.”
  • an enrolled Facial Signature is generated from these scans. By collecting a. series of possible poses, the signature would have higher dimensionality than would a single pose. Matching process :
  • the matching process could simply observe or capture images of the user for a few seconds, and with a sufficiently high-performance depth detection system, would capture many frames of 3D and 2D data, in accordance with the invention, tins would be correlated with the racial signature captured during the enrollment process, and a probability of match score would be generated. This score would then be compared with a threshold to confirm or deny an identify match.
  • an exemplary practice of the facial signature method of the invention includes updating or evolving the facial signature itself on. every successful match, or on every «th successful match, where n is a selected integer, in order to accommodate these changes.
  • one method of representing the facial signature is in the form of one or more combined histograms taken directly from the summation of per-pi xel disparit histogram s within the feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiv ng depth, " These combined histograms represent the normalized relative proportion of facial, feature depths across a plane, parallel to the user's face.
  • FIGS. 82-83 show an exemplary image processed in accordance with a exemplary practice of the Facial Signature aspects of the invention.
  • FIG. 82 is an example of an image of a human user or subject captured by at least one camera
  • FIG. 8 is an example of a representat ion of image data, corresponding to the image of FIG. 82. processed in acco rdance with an exemplary practice of the invention.
  • FIG. 84 shows a histogram representation corresponding to the image(s) of FIGS. 82-83, generated in accordance with an exemplary practice of the Facial Signature aspects of the invention.
  • the X-axis of the histogram would represent a disparity (or depth) range
  • the Y-axis would represent the normalized count of image samples that fell within that range.
  • a conventional 2D face detector can be employed to provide a face rectangle and location of the basic facial features, such as eyes, nose and mouth. See, e.g., FIG. 83, which indicates, among other aspects, a rectangle surrounding the human subject's face.
  • a candidate identification histogram ss captured, it would he subtracted from the set of enrolled histograms and the vector distance would constitute a matching score. By comparing the matching score against a programmable threshold, access could be granted or denied.
  • this method could be used hi isolation or paired in a hybrid configuration with conventional 2D image matching of the face to provide a further authentication factor.
  • FIGS. 85-88 are flowcharts illustrating method aspects and exemplary practices of the invention.
  • the methods depicted in these flowcharts axe examples only ; the organization, order and number of opemtions in the exempian' practices can be varied; and the exemplary practices and methods can be arranged or ordered differently, and include different functions, whether singly or in combination, while still being wi thin the spiri t and scope of the present invention, items described below in parentheses are, among other aspects, optional in a given practice of the invention.
  • FIG. 85 is a flowchart of a .method 850 for generating a facial signature data representation,, according to an exemplary practice of the invention, which can include a number of the following operations:
  • FIG. 86 is a flowchart of further 'method aspects 860 for generating a faciai signature data representation, according to an exemplary practice of the invention, which can incl ude a number of the following operations:
  • the metiiod or system can uiibze stereo depth estimation t verify that human facial features are presented to eanierai s) at correct distance ratios between camera(s) Or front structured light or time-of-flight sensor);
  • the feature correspondence function or depth detection function includes computing distances between facial features from multiple perspectives
  • the facial signature can be combination, of 3D faciai contour information and. 2D image data, from one or more camera(s));
  • (3D contour data can be stored in facial signature data representation);
  • facial signature generated in accordance with the invention can be utilized as a security factor in an authentication system, either alone or in combination with other security factors);
  • (3D f cial contour data can be combined with 2D image data from one or more cameras in a conventional 2D face identification system, to create a hybrid 3D/2D face identification system);
  • 3D faciai contour data can be used to confirm that a face having credible 3D human facial proportions was presented to the caniera(s) at an overlapping spatial location of captured 2D tmage(s));
  • a 2D bounding rectangle defining a 2D extent of the human user's or subject's face location, can be used to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption);
  • the facial signature data representation can. be a histogram-based facial signature data
  • FIG. S7 is a flowchart of method aspects 870 for generating a histogram-based facial signature dat representation, according to an exemplary practice of the invention, which ca include a number of the following operations:
  • the facial signature is represented as one or more histograms obtained from a summation of per ⁇ pixel disparity histograms within feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiving depth);
  • the histogram represents normalized retative proportioo of facial feature depths across a plane parallel to the user's or subject's face):
  • the Y-axis represents normalized count of image samples that fall within given range
  • a conventional 2D face detector can provide a face
  • 870.6 (Disparity and depth points cars be projected into a canonical coordinate system defined by a plane geometrically constructed from or defined by basic facial features such as eyes, nose- mouth); 870.7: (The histogram e la t tion can be used m eonmination with conventional 2D face matching to provide an additional authentication factor).
  • FIG. 88 is a flowchart of a facial signature method aspect: 880, including enrollment; and matching phases of an exemplary facial signature method in accordance with the invention, which can include a mmiber of the following operations:
  • Capture images (using at least one camera) for the enrollment phase (can utilize and require a selected number ( «) poses of die human user or subject):
  • Capture images (using at least one camera) for the matching phase can utilize and require a selected nam her ( «) poses of the human user or subject) ;

Abstract

Methods, systems and computer program products ("software") enable a virtual three- dimensional visual experience (referred to herein as " V3D") in videoconferencing and other applications; the capturing, processing and displaying of images and image streams; and generation of a facial signature based on images of a given human user's or subject's face, or lace and head, for accurate, reliable identification or authentication of a human user or subject, in a secure, difficult to forge manner.

Description

FACKAi SIGNATURE METOOPS, SYSTE 'S A P SOFTWARE Cross-Reference to Related Applications fo orporatton by Rsferea.ce
This application for patent claims the priority benefit of commonly-owned U.S. Provisional Application for Patent Serial No. 62/1.60563, filed May 1.2, 2015, entitled "Facial Signature Methods, Systems and Software" (Attorney Dfct, MNE-H3-PR), which is incorporated by reference herein as if set forth herein in its entirety.
This application for patent is also a Continuation-in-Part (CIP of commonly-owned PCX Patent
Application No. PCT/US 16/23433 filed March 21, 2016, entitled "Virtual 3D Methods, Systems and Software" (Attorney Dkt. MN.E-1 1 i-PCT), which is incorporated by reference herein as if set forth herein in its entirety.
Also incorporated by reference herein as if set forth herein in their entireties are the following: U.S. Pat App. Pub, No. 2013/0101 160, Woodfill et al
Carranza et al., "Free-Viewpoint Video of Human Actors," ACM Transactions on
Graphics, vol, 22, no. 3, pp. 569-577, July 2003;
Chu et al., "OpenCV and TYZX: Video Surveillance for Tracking," August 2008,
Sandia. Report SAND2008-5776;
Gordon et al, "Person and Gesture Tracking with Smart Stereo Cameras," Proc. SPIE, vol.
6805. Jan. 2008;
Hannah, ''Computer Matching of Areas in Stereo images". July 1974 Thesis. Stanford
University Computer Science Department Report STAN-CS-74-438:
Harrison et al.. '*Pseudo-3D Video Conferencing with a Generic Webcam,'7 2008 IEEE
Inrl Symposium on Multimedia, pp. 236-241 ;
Luo et al., "Hierarchical Genetic Disparity Estimation Algorithm for Muhiview Image
Synthesis/- 2000 IEEE Int., conf, on linage Processing, Vol. 2, pp. 768-771;
Zabih et al., " on -parametric Local Transforms for Computing Visual Correspondence"
Proc. European Conf, on Computer Vision, May 1994, pp. 151-158,
Field of the Invention
The present invention relates generally to methods, systems and computer program products ("software") for enabling a virtual three-dimensional visual experience (referred to herein as "V3D"5) in videoconferencing and other applications; for capturing, processing and displaying of images and image streams; and for generating a facial signature, based on images of a given human user's or subject's face, for enabling accurate, reliable identification or authentication of a human user or subject in a secure, difficult to forge manner. Background of the Invention
It would be desirable to provide methods, systems, devices and computer software program code products that:
( 1 ) enable unproved visual aspects of videoconferencing over otherwise conventional
telecommunications networks and devices;
(2) enable a first user in a videoconference to view a second, remote user in the vkleoeonference with direct virtual eye contact with the second user, even if no camera used in the videoconference set-up has a direct eye contact gaze vector to the second user;
(3) enable a virtual 30 experience (referred to herein as V3D), not only in videoconferencing but in other applications such as viewing of remote scenes, and in virtual reality (VR) applications;
(4) facilitate self-portraiture of a user utilizing a handheld device to take the self-portrait;
(5) facilitate composition of a photograph of a scene, by a user utilizing a handheld device to take the photograph;
(6) provide such features in a manner that fits within the form factors of modern mobile devices such as tablet computers and smartphones, as well as the form factors of laptops. PCs, computer-driven televisions, computer-dri ven projector de vices, and the like, does not dramatically alter the economics of building such devices, and is viable within current, or near-current communications network/connectivity architectures.
(7) improve the capturing and displaying of images to a user utilizing a binocular stereo head- mounted display (HMD) in a pass -through mode;
(8) improve the capturing and displaying of image content on a binocular stereo head-mounted display (HMD), wherein the captured content is prerecorded content;
(9) generate an input image stream adapted for use in the control system of an autonomous or seif-driving vehicle;
( 10) generate a facial signature, based on images of a gi ven human user's or subject's face, or face and head, for enabling accurate, reliable identification, authentication or matching of a human user or subject, in a. secure, difficult to forge manner.
The present invention provides methods, systems, de vices and 'computer -software/program code products that enable the foregoing aspects and others. Some embodiments and practices of the invention are collectively referred to herein as V3D. Certain other embodiments and practices of the invention are collectively referred to as Facial Signature aspects of the invention. As described in greater detail below, certain Facial Signature aspects of the invention may utilize certain V3D aspects of the invention.
Aspects, examples, embodiments and practices of the invention, whether in the form of methods, devices, systems or computer software/program code products., will next be described in greater detail in the following Summary of the invention and Detailed Description of the Invention sections, in conjunction with the attached drawing figures. Summary of fee Invention
The present invention provides methods, systems, devices, arid computer softw¾ue/program code products for, among other aspects and possible applications, facilitating video communications and presentation of image and video content, and generating i mage input streams for a control system of autonomous vehicles; and for generating a facial signature, based on images of a human user's or subject's face, for enabling accurate, reliable identification or authentication of a human user or subject, in a secure:, difficult to forge manner.
Methods, systems, devices, and computer software/program code products in accordance with the in vention are suitable for implementation or execution in, or in conjunction with, commercially available computer graphics processor configurations and systems including one or mote display screens for displaying images, cameras for capturing images, and graphics processors for rendering images for storage or for display, such as on a di play sc een, and for processing data values for pixels in an image representation. The cameras, graphics processors and display screens cart be of a form provided in commercially available smartphones, tablets and other mobile telecommunications devices, as well as in commercially available laptop and desktop computers, which may communicate using commercially available network architectures including client/server and client netvvork/cioud architectures.
In die aspects of the invention described below and hereinafter, the algorithmic image processing methods described aire executed by digital processors, which can include graphics processor units, including GPGPUs such as those commercially available on cellphones, smartphoo.es, tablets and other commercially available telecommunications and computing devices, as well as in digi tal display devices and digital cameras. Those skilled in the art to which this invention pertains will understand the structure and operation of digital processors, GPGPUs and similar digital graphics processor units.
While a number of the following aspects are described in the context of one-directional ("half- duplex") configurations, those skilled in the art will understand that the invention further relates to and encompasses providing bi-directional, full-duplex configurations of the claimed subject matter.
One aspe ct of the present invention relates to methods, sy stems and computer soft are/program code products that enable a first user to view a second user with direct virtual eye contact with the second user. This aspect of the invention comprises capturing images of the second use , utilizing at least one camera having a view of the second user's face: generating a data representation, representative of the captured images; reconstructing a synthetic view of the second riser, based on the representation: and displaying the synthetic view to the first user on a display screen used by the first user; the capturing, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen, by the reconstructing and displaying of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, even if no camera has a direct eye contact, gaze vector to the second user.
Another aspect includes executing a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in ima e space between the common features, to generate disparity values; wherein the data, .representation is representati e of the captured images and the corresponding disparity values; the capturing, detecting, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen.
In another aspect, the capturing includes utilizing at least two cameras, each having a view of the second user's face: and executing a feature correspondence function comprises detecting common features between corresponding images captured by the respective cameras,
in yet another aspect, the capturing comprises utilizing at least one camera having a view of the second user's face, and. which is an infra-red time-of-f!ight camera that directly provides depth
information; and the data representation is representative of the captured images and corresponding depth information.
In a further practice of the invention., the capturing includes utilizing a single camera having a view of the second user's face: and executing a feature correspondence function comprises detecting common features between sequential images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
In another aspect, the captured images of the second user comprise visual information of the scene surrounding the second user: and the capturing, detecting, generating, reconstructing and displaying are executed such that: (a) the first user is provided the visual impression of looking through his display screen as a physical windo to the second user and the visual scene surrounding the second user, and (b) the first user is pro vided an immersive visual experience of the second user and the scene surrounding the second user.
Another practice of the invention includes executing image rectification to compensate for optical distortion of each camera and relative misalignment of the cameras.
In another aspect, executing linage rectification comprises apply ing a 2D image space transform; and applying a 2D image space transform comprises utilizing a GPGPU processor running a shader program.
In one practice of the invention, the cameras for capturing images of the second user are located at or near the periphery or edges of a displa device used by the second user, the dispiay device used by the second user having a display screen viewable by the second user and having a geometric center; and the synthetic view of the second user corresponds to a selected virtual camera location, the selected virtual camera location corresponding to a point at or proximate to the geometric center.
in ano ther practice of the invention, the cameras for capturing images of the second user are located at a selected position outside the periphery or edges of a display device used by the second user .
In still another aspect of the invention, respective camera view vectors are directed in non- copiaoar orientations.
In another aspect, the cameras for capturing images of the second user, or of a remote scene, are located in selected positions and positioned with selected orientations around the second user or the remote scene. Another aspect includes estimating a location of the first user's head or eyes, thereby generating tracking information; and the reconstructing of a synthetic view of the second user comprises
reconstructing the synthetic view based on the gene ated data representation arid the generated tracking information.
In one aspect of the invention, camera shake effects are inherently eliminated, in that the capturing, detecting, generating, reconstructing and displaying are executed such that the first user has a virtual direct view through his display screen to the second user and the visual scene surrounding the second user: and scale and perspective of the image of the second user and objects in the visual scene surrounding the second user are accurately represented to the first user regardless of user view distance and angle.
This aspect of the invention provi des, on the user's display screen, the visual impression of a frame without glass; a window into a 3D scene of the second user and the scene surrounding the second user.
la one aspect, the invention is adapted for uttplemeatatioa on a mobile telephone device., and the cameras for capturing images of the second user are located at or near the periphery or edges of a mobile telephone device used by the second user.
In another practice of the invention, die invention is adapted for implementation on a laptop or desktop computer, and the cameras for capturing images of the second user are located at or near the periphery or edges of a display device of a laptop or desktop computer used by the second user.
In another aspect, the invention is adapted for implementation on computing or
telecommunications devices comprisin any of tablet computing devices, computer-driven televi ion displays or computer-driven image projection devices, and wherein the cameras for capturing images of the second user are located at or near the periphery or edges of a computing or telecommunications device used by the second user.
One aspect of the present invention relates to methods, systems and computer software/program code products that enable a user to view a remote scene in a manner that gives the user a visual
impression of being presen t with respect to the remote scene. This aspect, of the invention includes capturing images of the remote scene, utilizing at least two cameras each having a vie of the remote scene; executing a feature correspondence function by detecting common features between
corresponding images captured by the respecti ve cameras and measuring a relative distance in image space between the common features, to generate disparity values; generating a data representation, representative of the captured images and the corresponding disparity values; reconstructing a synthetic view of the remote scene, based on the representation; and displaying the synthetic view to the first user on a display screen used by the first user; the capturing, detecting, generating, reconstructing and displaying being executed such that: (a) the user is provided the visual impression, of looking th rough his display screen as a physical window to the remote scene, and (b) the user is provided an immersive visual experience of the remote scene.
In one aspect of the invention, the capturing of images includes using at least one color camera. In another practice of the invention, the capturing includes using at least one infrared structured light emitter.
In yet another aspect the capturing comprises utilizing a view vector rotated camera configuration wherein the locations of first and second cameras define a .line; and the line defined b the first and second camera, locations is rotated by a selected amount from a selected horizontal or vertical axis; thereby increasing the number of valid feature correspondences identified in typical real-world settings by the feature: correspondence function.
In another aspect of the invention, the first and second cameras are positioned relati ve to each other along epipoiar lines.
In a further aspect, subsequent to the capturing of images, disparity values are rotated back to a selected horizontal or vertical orientation along with the captured linages.
In another aspect, subsequent to the reconstructing of a synthetic view, the synthetic view is rotated back to a selected horizontal or vertical orientation.
In yet another practice of the invention, the capturing comprises exposure cycling, comprising dynamically adjusting the exposure of the cameras on a frame-by-frame basis to improve disparity estimation in regions outside the exposed region viewed by th user; wherein a series of exposures are taken, including exposures lighter than and exposures darker than a visibility-optimal exposure, disparity values are calculated for each exposure, and the disparity values are integrated into an overall disparity solution over time, so as to improve disparity estimation.
in another aspect, the exposure cycling comprises dynamicall adjusting the exposure of the cameras on a frame-b -frame basis to improve disparity estimation in regions outside the exposed region viewed by the user: wherein a series of exposures are taken, including exposures lighter than and exposures darker than a visibility-optimal exposure, disparity values are calculated for each exposure, and the disparity values are integrated in a disparity histogram, the disparity histogram being converged over time, so as to improve disparity estimation.
A further aspect of the invention comprises analyzing the quality of the overall disparity solution on respective dark, mid-range and light pixels to generate variance information used to control the exposure settings of the cameras, thereby to form a closed loop between the quality of the disparity estimate and the set of exposures requested from the cameras.
Another aspect includes analyzing variance of the disparity histograms on respective dark, mid- range and light pixels to generate variance information used to control the exposure settings of the cameras, thereby to form a closed loop between the quality of the disparity estimate and the set of exposures requested from the cameras.
In one practice of the invention, the feature correspondence function includes evaluating and combining vertical- and horizontal-axis correspondence information.
in another aspect, the feature correspondence function further comprises applying, to image pixels containing a disparity solution, a coordinate transformation, to unified coordinate system. The unified coordinate sy stem can be the mi-rectified coordinate system of the captured images. Another aspect of the invention incl udes using at least three cameras arranged in substantially "L"-shaped configuration, such that a pair of cameras is presented along a first axis ami a second pair of cameras is presented along a second axis substantially perpendicular t the first axis.
in a further aspect, the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence.
in accordance with another aspect of the invention, the feature correspondence function
comprises refining correspondence information over time. The refining can inciude retaining a disparity solution over a time interval, and continuing to integrate disparity solution values for each image frame over the time interval, so as to converge on an improved disparity solution by sampling over time.
In another aspect, the feature correspondence function comprises filling unknowns in a correspondence information set wit historical data obtained .from previously captured images. The filling of unknowns can include the following: if a gi ven image feature is detected in an image captured by one of the cameras, and no corresponding image feature is found in a corresponding image captured by another of the cameras, then utilizing data for a pixel corresponding to the given image feature, from a corresponding, previously captured image.
In a further aspect of the invention, the feature correspondence function utilizes a disparity histogram -based method of integrating data and determining correspondence. This aspect of the invention can include constructing a disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel. The disparity histogram functions as a Probability
Densit Function (PDF) of disparity for the gi ven pixel in which h igher values indicate a higher probability of the corresponding disparit range being valid for the given pixel.
In another practice of the invention, one axis of the disparity histogram indicates a given disparity range, and a second axis of the histogram indicates the number of pixels in a kernel surrounding the central pixel in question that are voting for the given disparity range.
In one aspect of the invention, votes indicated by the disparity histogram are initially generated utilizing a Sum of Square Differences (SSD) method, which can comprise executing an SSD method with a relati vel small kerael to produce a fast dense disparity map in which each pixel has a selected disparity that represents the lowest error;
then, processing a plurality of pixels to accumulate into the disparity histogram a tally of the number of votes for a given disparity in a relatively larger kernel surrounding the pixel in question.
Another aspect of the invention includes transforming the disparity histogram into a Cumulative Distribution Function (CDF) from which the width of a corresponding interquartile range can be determined, thereby to establish a confidence level in the corresponding disparity solution.
A further aspect includes maintaining a count of the number of statistically significant modes in the histogram, thereby to indicate modality, to accordance with the invention, modality can be used as an input to the above-described reconstruction aspect, to control application of a stretch vs. slide
reconstruction method Sril another aspect of the invention includes .maintaining a disparity histogram over a selected time interval and accumulating samples into the histogram, thereby to compensate for camera noise or other sources of motion or error.
Another aspect includes generating fast disparity estimates for multiple independent axes; and then combining the corresponding, respective disparity histograms to produce a statistically more robust disparity solution.
Another aspect includes evaluating the interquartile · range of a CDF of a given disparit - histogram to produce an interquartile result; and if the interquartile result is indicative of an area of poor sampling signal to noise ratio, due to camera over- or underexposure, then controlling camera exposure based on the interquartile result to improve a poorly sampled area of a given disparity histogram.
Yet another practice of the in vention includes testing for only a small se t of disparity values using a smali-kemel SSD method to generate initial results; populating a corresponding disparity histogram with the initial results; and then, using histogram votes to drive further SSD testing within a given range to improve disparity resolution over time.
Another aspect include extracting sub-pixel disparity information from the disparity histogram, the extracting including the following: where the histogram indicates a maximum-vote disparity range and an adjacent, runner-up disparity range, calculating a weighted average disparity value based on the ratio between the number of votes for each of the adjacent disparity ranges.
In another practice of the in vention, the feature correspondence function comprises w eighting toward a center pixel in a Sum. of Squared Differences (SSD approach, wherein the weighting includes applying a higher weight: to die center pixel for which a disparity solution is sought, and a lesser weight outside the cen ter pixel the lesser weight being proportional to the distance of a given kernel sample from the center.
in another aspect, the feature correspondence function comprises optimizing generation of disparity values on GPGPIJ computing structures. Such GPGPU computing structures are commercially available, and are contained in commercially available forms of smartphones and tablet computers.
in one practice of the invention, generating a data representation includes generating a dat structure representing 2D coordinates of a control point in image space, and c nt ining a disparity value treated as a pixel velocity in screen space with respect to a given movement of a. given view vector: and using the disparity value in combination with a movement vector to slide a pixel in a given source image in selected directions, in 2D, to enable a reconstruction of 3D image movement.
in another aspect of the in vention, each camera generates a. respective camera stream; and the data structure representing 2D coordinates of a control point further contains a sample buffer index, stored in association with the control point coordinates, which indicates which camera stream to sample in association with the given control point.
Another aspect includes determining whether a given, pixel should be assigned a control point. A related practice of the invention includes assigning control points along image edges, wherein assigning control points along image edges comprises executing computations enabling identification of image edges.
in another practice of the invention, generating a data representation includes flagging a given iniage feature with a reference count indicating how man samples reference the given image feature, thereby to differentiate a uniquely referenced image features, and a sample corresponding to the uniquely referenced image feature, from repeatedly referenced image features; and. using the reference count, extracting unique samples, so as to enable a reduction in bandwidth requirements.
fa a further aspect, generating a data representation further includes using the reference count to encode and transmit a given sample exactly once, even if a pixel or image feature corresponding to the sample is repeated in multiple camera views, so as to enable a. reduction in bandwidth requirements.
Yet another aspect of the invention includes estimating a location of the first user's head or eyes, thereby generating tracking information; wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking infb-rmation; and wherein 3D image reconstruction is executed by warping a 2D image by utilizing the control points, by sliding a given pixel along a head movement vector at a displacement rate proportional to disparity, based on the tracking information and disparity values.
in another aspect, the disparity values are acquired from the feature correspondence function or from a co rol point data stream.
In another practice of the invention, reconstructing a synthetic view comprises utilizing the tracking information to control a 2D crop box, such that the synthetic view is reconstructed based on the view origin, and then cropped and scaled so a to fill the first user's display screen view window; and the minima and maxima of the crop box are defined as a function of the first user's head location with respect to the displ ay screen, and the dimensions of the disp lay screen view window ,
in a further aspect, reconstructing a synthetic view comprises executing a 2D warping reconstruction of a. selected view based on seiected control points, wherein the 2D warping
reconstruction includes designating a set of control points, respective control points corresponding to respective, selected pixels in a source image; sliding the control points in seiected directions in 2D space, wherein the control points are slid proportionally to respecti ve disparity values; and interpolating data values for pixels between the selected pixels corresponding to the control points; so as to create a synthetic view of the image from a selected new perspective in 3D space.
The invention can further include rotating the source image and control point coordinates such, that rows or columns of image pixels are parallel to the vector between the original source image center and the new view vector defined by the selected new perspective.
A related practice of the invention further include rotating the source image and control point coordinates so as to align, the view vector to image scanlines; iterating through each scaaline and each control point for a given scanline, generating a line element beginning and ending at each control point in 2D image space, with the addition of the corresponding disparity value multiplied by the corresponding view vector magnitude with the corresponding x-axis coordinate: assigning a texture coordinate to the beginning and ending points of each generated l ne element, equal to their respective, original 2D location in the source image; and interpolating texture coordinates linearly along each line element; thereby to create a resulting image in which image data between the control points is linearly stretched.
The invention can also include rotating the resulting image back by the inverse of the rotation applied to align the view vector with die scaniines.
Another practice of the invention includes linking the control points between scaniines, as well as along scaniines, to create polygon elements defined 'by the control points, across which interpolation; is executed.
In another aspect of the invention, reconstructing a sy nthetic view further comprises, for a gi ven source image, selecti vely sliding image foreground and image background independently of each other. In a related aspect sliding is utilized in regions of large disparity or depth change.
In another practice of the invention, a determination of whether to utilize sliding includes evaluating a disparity histogram to detect multi-modal behavior indicating that a given control point is on an. image boundary for which allowing foreground and background to slide independent of each other presents a better solution than interpolating depth between foreground and background; wherein the disparity histogram functions as a Probability Density Function (PDF) of disparity for a given pixel in which higher values indicate a higher probability of the corresponding disparity range being valid for the given pixel
In yet another aspect of the invention, reconstructing a synthetic view includes using at least one Sample Integration Function Table (SIFT), the SIFT comprising a table of sample integration functions for one or more pixels in a desired output resolution of an image to be displayed to the user, wherein a given sample integration function maps an input view origin vector to at least one known, weighted 2D image sample location in at least one input image buffer.
In another aspect, displaying the synthetic view to the first user on a display screen used by the first user includes displaying the synthetic view to the first user on a 2D display screen; and updating the display in real-time, based on the tracking information, so that the display appears to the first user to he a window into a 3D scene responsive to the first user's head or eye location.
Displaying the synthetic view to the first user on a display screen used by the fi st user can include displaying the synthetic view to the first user on a binocular stere display device: or, among other alternatives, on a lenticular display that enables auto-stereoscopic viewing.
One aspect of the present invention relates to methods, systems and computer software/program code products that facilitate se'tf-portraiture of a user utilizing a. handheld device to take the self-portrait, the handheld mobile device having a display screen for displaying images to the user. This aspect includes providing at least one camera around the periphery of the display screen, the at least one camera having a view of the user's face at a self portrait setup time during which the user is setting up the self- portrait; capturing images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen; estimating a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information; generating a data representation, representative of the captured images; reconstructing a synthetic view of the user, based on the generated data representation and the generated tracking information; displaying to the user, on the display screen during the setup time, the synthetic view of the user; thereby enabling the user, while setting up the self- portrait, to selectively orient or position his gaze or head, or the handheld device and its camera, with realtime visual feedback.
in another aspect of the invention, the capturing, estimating, generating, reconstructing and displaying are exec tried such that in the self-portrait the user can appear to be looking directly into the camera, even if the camera does not have a di rect eye contact gaze vector to the user.
One aspect of the present invention relates to methods, systems and computer software/program code products that facilitate composition of a photograph of a scene, by a user utilizing a handheld device to take die photograph, the handheld device having a display screen on first side for displaying images to the user, and at least one camera on a second, opposite side of the handheld device, for capturing images. This aspect includes capturing images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting up the photograph; estimating a location of the user's head or eyes relative to the handheld device during tine setup time, thereby generating tracking information; generating a data representation, representative of the captured images; reconstructing a synthetic view of the scene, based on the generated data representation and the generated tracking information, the synthetic view being reconstructed such that the scale and perspective of the synthetic view has a selected correspondence to the user's viewpoint relative to the handheld device and the scene; and displaying to the user, on the display screen during the setup time, the synthetic view of tire scene; thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspective within the display frame, with realtime visual feedback.
In another aspect of the invention, the user can control the scale and pe rspective of the synthet ic view by changing the position of the handheld device relative to the position of the user's head.
in another practice of the invention, estimating a location of the user's head or eyes relative to the handheld device includes using at least one camera on the first, display side of the handheld device, having a view of the user's head or eyes during photograph setup time.
The invention enables the features described herein to be provided in a manner that fits within the form factors of modern mobile devices such as tablets and smartphones, as well as the form factors of laptops, PCs, computer-driven televisions, computer-dri en projector devices, and the like, does not dramatically alter the economics of building such devices, and is viable within current or near current communications network/connecti vi ty architectures .
One aspect of the present invention relates to methods, system and computer software/program code products for displaying images to a user utilizing a binocular stereo head-mounted display (HMD). This aspect includes capturing at least two image streams using at least one camera attached or mounted on or proximate to an external portion or surface of the HMD, the captured image streams containing images of a scene: generating a dat representation- representative of captured images contained in the captured image streams; reconstructing two synthetic views, based on the representation; and displaying the synthetic views to the user, via the HMD; the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respecti ve virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes, so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
Another aspect of the present invention relates to methods, systems and computer
software/program code products for capturing and displaying image content on a binocular stereo head- mounted display (HMD). The image content can include pre-recorded image content, which can be stored, transmitted, broadcast downloaded, streamed or otherwise made available. This aspect includes capturing or generating at least two image streams rising at least one camera, the captured image streams containing images of a scene; generating a data representation, representative of captured images contained in the captured image streams; reconstructing two synthetic views, based on the representation; and displayin the synthetic views to a user, via the HMD, the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes, so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
In another aspect, the data representation can be pre-recorded, and stored, transmitted, broadcast downloaded, streamed or otherwise .made available.
Another aspect of the invention includes tracking the location or position of the user's head or eyes to generate a motion vector usable in the reconstructing of synthetic views. The motion vector can be used to modif the respective view origins, during the reconstructing of synthetic views, so as to produce intermediate image frames to be interposed between captured image frames in the captured image streams; and interposing the intermediate image frames between the captured image frames so as to reduce apparent latency.
In another aspect, at least one camera is a panoramic camera, night-vision camera, or thermal imaging camera.
One aspect of the i vention relates to methods, systems and computer software program code products for generating an image data stream for use by a control system of an autonomous vehicie. This aspect includes capturing images of a scene around at least a portion of the vehicle, the capturing comprising utilizing at least one camera having a view of the scene: executing a feature correspondence functio by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values; calculating corresponding depth information based on. th disparity values; and generating from the images and corresponding depth, information an image data, stream for use by the control system. The capturing can include capturing comprises utilizing at least tw o cameras, each having a view of the scene: and executing a feature correspondence function comprises detecting common features between corresponding images captured by the respective cameras.
Alternatively, the capturing can include using a single camera having a view of the scene; and executing a feature correspondence function comprises detecting common features between sequential images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture and processing, including: f 1) capturing images of a scene, the capturing comprising utilizing at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis; and (2) executing a feature correspondence function, by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises: constructing a multi-level disparity histogram indicating the relative probability of a. given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising:
executing a Fast Dense Disparity Estimate (FDDE) image pattern matching operation on successively .tower-frequency downsampied versions of the input stereo images, the successively lower-frequency downsampied versions constituting a set of levels of FDDE histogram votes. In this aspect of the invention, each level can be assigned a level number, and each successivel higher level can be characterized by a lower image resolution. In one aspect, the downsampling can include reducing image resolution via low-pass filtering. In another aspect, the downsampling can include using a weighted summation of a kernel in level jo.-! j to produce a pixel value in level jnj, and the normalized kernel center position remains the same across all levels.
In one aspect of the invention, for a given, desired disparity solution at full image resolution, the FDDE votes for eve image level are included in the disparity solution.
Another aspect of the invention includes generating a multi-level histogram comprising a set of initially independent histograms at different levels of resolution. In a related aspect each histogram bin in given level represents votes for disparity determined by the FDDE at that level. In another related aspect, each histogram bin in a given level has an associated disparity uncertainty range, and the disparity uncertainty range represented by each, histogram bin is a selected multiple wider than the disparity uncertainty range of a bin in the preceding level
A further aspect of the invention includes applying a sub-pixel shift to the disparity values at each level during downsampling, to negate rounding error effect, in a related aspect, applying a sub- pixel shift comprises applying a half pixel shift to only one of the images in a stereo pair at each level of dow nsampling, in a further aspect applying a sub-pixel shift is implemented inline, within the weights of the filter kernel utilized to implement the downsampling from level to level.
Another aspect of the in vention includes executing histogram integration, the histogram
integration comprising: executing a recursi ve summation across all the FDDE levels. A related aspect includes, during summation, modifying the weighting of each level to control the amplitude of the effect of lower levels in overall voting, by applying selected weighting coefficients to selected levels.
Yet another aspect of the invention includes inferring a sub-pixel disparity solution from the disparity histogram, by calculating a sub-pixel offset 'based on the number of votes for the maximum vote disparity range and the number of votes for an adjacent, runner-up disparity range. In a related aspect, a summation stack can be maintained in a memory unit
One aspect, of the present invention relates to methods, systems and computer software/program code products that enable capturing of images using at least two stereo c mera pairs, each pair being arranged along a. respective camera pair axis, and for each camera pair axis: executing image capture utilizing the camera pair to generate image data; executing rectification and undistorting transformations to transform the image data into UD im ge space: iteratively downsampiing to produce multiple, successively lower resolution levels: executing FDDE calculations for each level to compile FDDE votes for each level; gathering FDDE disparity range votes into a multi-level histogram; determining the highest ranked disparity range in the multi-level histogram; and processing the muSti-level histogram disparity data to generate a final disparit result.
One aspect of the present invention relates to methods, systems and computer soft aavprogram code products that enable video capture and processing, including: (i) capturing images of a scene, the capturing comprising utilizing at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair; and (2.) executing a feature
correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, the feature correspondence function further comprising: generating a dispari ty solution based on the disparity values; and applying an injective constraint to the disparity solution based on domain, and co-domain, wherein the domain comprises pixels for a. given image captured by the first camera and. the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in response to violation of the injective constraint, wherein the injecti ve constraint is thai no element in the co-domain is referenced more than once by elements in the domain.
in a. related aspect, applying an injective constraint, comprises: maintaining reference count for each pixel in the co-domain, and checking whether the reference count for the pixels in the co-domain exceeds " 1". and if the count exceeds "I" then designating a violation and responding to the violation with a selected error correction approach. In anothe related aspect the selected error correction approach can include any of (a) first come, first served, (b) best match wins, (c) smallest disparity wins , or (d) seek alternative candidates. T ie first come, first served approach can include assigning priority to the fi rst element in the domain to claim an element in the co-domain, and if a second dement in. the domain claims the same co-domain element, invalidating that subsequent match and designating that subsequent snatch, to be invalid. The best match win approach can include: comparing the actual image matching error or corresponding histogram vote count between the two possible candidate elements in the domain against the contested element in. the co-domain, and designating as winner the domain candidate with the best match. The smallest disparity wins approach can include: if there is a contest between candidate elements in the domain for a given co-domain element wherein each candidate element has a corresponding disparity, selecting the domain candidate with the smallest disparity and designating as invalid die others. "The seek alternative candidates approach can include: selecting and testing the next best domain candidate, based on a selected criterion, and iterating the selecting and testing until the violatio , is eliminated or a computational time limit is reached.
One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture in which a first user is able to view a second user with direct virtual eye contact with the second user, including: ( !) capturing images of the second user, the capturing comprising utilizing at least one camera having a view of the second user's face: (2) executing a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space betwee the common features, to generate dispari ty values; (3) generating a. dat representation, representative of the captured images and the corresponding disparity values; (4) estimating a three-dimensional (3 D) location of the first user's head, face or eyes, thereby generating tracking information; and (5) reconstructing a synthetic view of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking inibrmation; and wherein the location estimating comprises: (a) passing a captured image of the first user, the captured image including the first user's head and face, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the face relative to an image plane: (b) utilizing an estimated eenter-of-faee position, face rotation angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the head: (c) extracting from the best- fit rectangle all 3D points that lie within die best-fit rectangle, and calculating therefi'om a representative 3D head position; and (d) if the number of valid 3D points extracted from the best-fit rectangle exceeds a selected threshold in relation to the maximum number of possible 3D points in the region, then signaling a valid 3D head position result- In a related aspect, the location estimating includes ( 1 ) determining, from the first estimate of head and eye location and face rotation angle, an estimated center-of-face position; (2) dete rmining an average depth value for the face by extracting three-dimensional (3D) points via the disparity values for a selected, small area located around the estimated center-of-faee position; (3) utilizing the average depth value to determine a depth range that is likely to encompass the entire head; (4) utilizing the estimated center-of-face position, face rotation angle, and depth range to execute a 2D ray march to determine a best-fit rectangle that includes the head; (5) calculating, for both horizontal and vertical axes, vectors that are perpendicular to each respective axis but spaced at different intervals; (6) for each of the calculated vectors, testing the corresponding 3D points starting from a position outside the head region and working inwards, to the horizontal or vertical axis; (7) when a 3D point is encountered that fails within the determined depth range, denominating that point as a valid extent of a best-fit head rectangle; (8 ) from each ray march along each axis, determining a best-fit rectangle for the head, and extracting therefrom all 3D poin ts that lie within the best-fit rectangle, and calculating there from, a weighted average: and (9) if the number of valid 3D points extracted from the best-fit rectangle exceed a selected threshold in relation to the maximum number of possible 3D points in the region, then signaling a valid 3D head position result.
A related aspect of the invention includes downsampling the captured image before passing it to the 2D facial feature detector. Another aspect includes interpolating image data from video frame to video frame, based on the time that has passed from a given video frame from a previous vide frame. Another aspect includes converting image data, to luminance values.
One aspect of the present invention relates to methods, systems and computer software/program code products that enable video capture and processing, including: (1 ) capturing images of . scene, the capturing comprising utilizing at least three cameras having a view of the scene, the cameras being arranged in a substantially "L" -shaped configuration wherein a first pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, wherein the first and second pairs of cameras share a common camera at or near the intersection of the first and second axis, so that the first and second pairs of cameras represent respecti ve first and second independent stereo axes that share a common camera; (2) executing a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values; (3) generating a data .representation, representative of the captured images and the corresponding disparity values: and (4) utilizing an unrecttfiecL undistorted (URUD) image space to integrate disparity data for pixels between the first and second stereo axes, thereby to combine disparity data from the first and second, axes, wherein the URUD space is an image space in which polynomial lens distortion has been removed from the image data but the captured image remains unrectified.
A related aspect includes executing a stereo correspondence operation on the image data in a rectified, undistorted (RUD) image space, and storing resultant disparity data in a RUD space coordinate system. In another aspect, the resultant disparity data is stored in a URUD space coordinate system. Another aspect includes generating disparity histograms from the disparity data in either RUD or URUD space, and storing tire disparity' histograms in a unified URUD space coordinate system, A further aspect include applying a URUD to RUD coordinate transformation to obtain per-axis disparity values.
One aspect of the present, invention relates to methods, systems and computer software/program code products that enable video capture and processing, including (1) capturing images of a scene, the capturing comprising utilizing at least one camera having a view of the scene; (2) executing a feature correspondence function by detecting common features between corresponding images captured b the at. least one camera and measuring a relative distance in image space between the common features, to generate disparity values: and (3) generating a data representation, representati e of the captured images and the corresponding disparity values; wherein the feature correspottdence function utilizes a disparity histogram -based method of integrating data and detemumng correspondence, the disparity histogram- based method comprising: (a) constructing a disparity histogram indicating the relative probabilit of:' given disparity value being correct for a gi en pixel; and (b) optimizing generation of disparity values on a GPU computing structure, the optimizing comprising: generating, in the GPU computing structure, a plurality of output pixel threads; and, for each output pixel thread, maintaining a private disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU computing structure.
In a related aspect the private disparity histogram is stored such that each pixel thread writes to and reads from the corresponding pri vate disparity histogram os a dedicated portion of shared local memory in the GPU. in another related aspect shared local memory in the GPU is organized at least in part into memory words; the private disparity histogram is characterized by a series of histogram bins indicating the number of votes for a given disparity range; and if a maximum possible number of votes in the private disparity histogram is known, multiple histogram bins can be packed into a single word of the shared local memory, and accessed using bitwise GPU access operations.
One aspect of the invention includes a program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least first and second cameras having a view of a. scene, the cameras being arranged along an axis to configure stereo camer pa r having a camera pair axis, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed i the digital processing resource cause the digital processing resource to; ( I ) capture images of the scene, utilizing the at least first and second cameras; and (2) execute a feature correspondence function b detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises: constructing a multi-level disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising: executing a Fast Dense Disparity Estimate (FDDE) image pattern matching operation on successively lower-frequency downsampled versions of the input stereo images, the successively lower-frequency downsampled versions constituting a set of levels of FDDE histogram votes.
in another aspect of the in vention the digital processing system comprises at least two stereo camera pairs, each pair being arranged along a respective camera pair axis, and the digital processor- executable program instructions further comprise instructions which when executed in the digital processing resource cause the digital processing resource to execute, tor each camera pair axis, the following: ( 1) image capture utilizing the camera pair to generate image data; (2) rectification and uudisiortiug transform ations to transform the image dat into ROD image space; (3) iterattvely
downsampie to produce multiple, successively lower resolution levels; (4) execute FDDE calculations for each level to compile FDDE votes for each .level; (5) gather FDD'E disparity range votes into a multilevel histogram; (6) determine the highest ranked disparity mnge in the multi-level histogram; and (?) process the multi-level histogram disparity data to generate a final disparity result.
Another aspect of the invention includes a program product for use with a digital processing system, the digital processing system comprising at least first and second cameras having a view of a scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and a digital processing resource composing at least one digital processor, the program product comprising digital processor-executable program instructions stored on a uon-transitory digital processor- readable medium, which when executed in the digital processing resource cause the digital processing resource to: (1) capture i mages of the scene, utilizing the at least first and second cameras; and (2) execute a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity- values, wherein the feature correspondence function comprises: (a) generating a disparity solution based on the disparity values; and (b) applying art injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured by the first camera and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in response to violation of the injective constraint, wherein the injective constraint is that no element in the co-domain is referenced more than once by elements in the domain. In a related aspect, the digital processor- executable program instructions further comprise instructions which when executed in the digital processing resource cause the digital processing resource to: maintain a reference count for each pixel in the co-domain, and check whether the reference count for the pixels in the co-domain exceeds " 1 ", and if the count exceeds " 1" then designate a violation and responding to the violation with a selected error correction approach.
Another aspect of the invention includes a program product for use with a digital processing system, for enabling a first user to view a second user with direct virtual eye contact with the second user, the digital processing system comprising at least one camera having a view of the second user's lace, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readabie medium, which when executed in the digital processing resource cause the digital processing resource to: (1 ) capture images of the second user, utilizing the at least one camera: (2) execute a feature
correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values: (3) generate a data representation, representative of the captured images and the corresponding disparit values; (4) estimate a three-dimensional (3D) location, of the first user's head, face or eyes, thereby generating tracking information; and (5) reconstruct a synthetic vie of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises .reconstructing the synthetic view based on the generated data representation and the generated tracking information; herei the 3D location estimating comprises: (a) passing a captured image of the first user, the captured image including the first user's bead and face, to a two-dimensional (2D) fecial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the race relative to an image plane; (b) utilizing an estimated center-of-face position, face rotation angle, and head depth range based on the first estimate, to determine a best-ftt rectangle that includes the head; (c) extracting from the best-fit rectangle alt 3D points that lie within, the best-fit rectangle, and calculating therefrom a representative 3D head position; and (d) if the number of valid 3D points extracted from the best-fit rectangle exceeds a selected threshold in relation to the maximum number of possible 3D points hi the region, then signaling a valid 3D head position result.
Yet another aspect of the invention includes a program product for use with a digital processing system, for enabling capture and processing of images of a scene, the digital processing system
comprising (i) at least three cameras having a view of the scene,, the cameras being arranged a substantially "L"-sha.ped configuration wherein a first pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, wherein the first and second pai s of cameras share a common camera at or near the
intersection of the first and second axis, so that the fi rst and second pai rs of cameras represent respective first and second independent stereo axes that share a common camera, and (ii) a digital processing resource comprising at least one digital processor, the program product comprising digital processor- executable program instructions stored on anon-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to: (1) capture images of the scene, utilizing the at least three cameras; (2) execute a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values; (3) generate a data representation, representative of the captured images and the corresponding disparity values; and (4) utilize an unrectified, undistorted (URUD) image space to integrate disparity data for pixels between the firs and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the URUD space is an image space in which polynomial lens distortion has been removed from the image data but the captured image remains unrectified. In. a related aspect of the invention, the digital processor-executable prog ram instructions further comprise instructions which, when, executed in the digital processing resource cause the digital processing resource to execute a stereo correspondence operation on the image data in a rectified, undistorted (RUD) image space, and store resultant disparity data in a RUD space coordinate system.
Another aspect of the inventon includes a program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least, one camera having a. view of a scene, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, whic when executed in the digital processing resource cause the digital processing resource to; (! ) capture images of the scene, utilizing the at ieast one camera; (2.) execute a feature correspondence function by detecting common features between corresponding rmages captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values: and (3) generate a data representation, representative of the captured images and the corresponding disparity values; wherein the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence, the disparity histogram-based method, comprising: (a) constructing a disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel; and (b) optimizing generation of disparity values on a GPU computing structure, die optimizing comprising: generating,, in the GPU computing structure, a plurality of output pixel threads; and for each output pixel thread, maintaining a. private disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU computing structure.
One aspect of the invention includes a digital processing system for enabling a fi st user to view a second user with direct virtual eye contact with the second user, the digitai processing system comprising: ( 1 ) at least one camera having a view of the second user's face; (2) a. display screen for use by the first user; and (3) a digital processing resource comprising at least one digitai processor, the digital processing resource being operable to: (a) capture images of the second user, utilizing the at least one camera: (fa) generate a data representation, representative of the captured images; (c) reconstruct a synthetic view of the second user, based on the representation; and (d) display the synthetic view to the first user on the display screen for use b the first user; the capturing, generating, reconstructing and displaying being executed such that the .first user can have direct virtual eye contact with the second user through the first user's display screen, by the teeonstruetmg and displaying of a synthetic view of the second user in which the second user appears to be gazing directly at the fi st user, even if no camera has a direct eye contact gaze vector to the second user.
Another aspect of the invention includes a digital processing system for enabling a first user to view a remote scene with the visual impression of being present with respect to the remote scene, the digital processing system, comprising: (1 ) at least two cameras, each having a view of the remote scene; (2) a display screen for use by the first user: and (3) a digital processing resource comprising at ieast one digital processor, the digital processing resource being operable to: (a) capture images of the remote scene, utilizing the at least two cameras; (b) execute a feature correspondence function by detecting common features between corresponding rmages captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values; (c) generate a data representation, representative of the captured images and the corresponding disparity values: (d) reconstruct a synthetic view of the remote scene, based on the representation; and (e) display the synthetic view to the first user on the display screen; the capturing, detecting, generating, reconstructing and displaying being executed such that: the first user is provided the visual impression o looking through his display screen as a physical window to the remote scene, and the first user is provided an immersive visual e,xperience of the remote scene.
Another aspect of the invention includes a system operable ia a handheld digital processing device, for facilitating self-portraiture of a user utilizing the handheld device to take the self portrait the system comprising: ( i ) a digital processor; (2) a display screen for displaying images to the user; and (3) at least one camera around die periphery of the display screen, the at least one camera having a view of the user's face at a. self portrait setu time during which the user is setting up the self portrait, the system being operable to: (a) capture images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen; (b) estimate a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information; (c) generate a data representation, representative of the captured images; (d) reconstruct a synthetic view of the user, based on. the generated data representation and the generated tracking information; and (e) display to the user, on. the display screen during the setup time, the synthetic view of me user; thereby enabling the user, while setting up the self-portrait;, to selectively orient or position his gaze or head, or the handheld device and its camera, with realtime visual feedback.
One aspect of the invention includes a system operable in a handheld digital processing device, for facil i tating composition of a photog raph of a scene by a user utilizing the handheld device to take the photograph, the system comprising: ( .1 ) a digital processor; (2) a display screen on a first side of the handheld device for displaying images to the user; and (3) at least one camera on a second, opposite side of the .handheld device, for capturing images; the system being operable to: (a) capture images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting tip the photograph: (b) estimate a location of the user's head or eyes .relati ve to the handheld device during the setup time, thereby generating tracking information; (c) generate a data representation, representative of the captured, images; (d) reconstruct a synthetic view of the scene, based on the generated data representation and the generated tracking information, the synthetic view being reconstructed such that tile scale and perspective of the synthetic view has a selected correspondence to iiie user's viewpoint relati ve to the handheld de vice and the scene; and (e) displa to the user, on the display screen during the setup time, the synthetic view of the scene; thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspective within the display f ame, with realtime visual feedback.
Another aspect of the invention includes a system for enabling display of images to a user utilizing a binocular stereo head-mounted display (HMD), the system comprising: ( 1) at least one camera attached or mounted on or proximate to an external portion or surface of the HMD; and (2) a digital processing resource comprising at least one digital processor; the system being operable to: (a) capture at least two image streams using the at least one camera, the captured image streams containing i mages of a scene: (b) generate a data representation, representative of captured images contained in the captured image streams; (c) .reconstruct two synthetic views, based on the representation; and (d) display the synthetic views to the user, via the HMD: the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins arc positioned sack that the respective virtual camera locations coincide with respective locations of the user's left and tight eyes, so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
Another aspect of the invention includes an image processing system for enabling the generation of an image data, stream for use by a control system of an autonom ous vehicle, the image processing system comprising: (1) at least one camera with a view of a scene around at least a portion of the vehicle; and (2) a digital processing resource comprising at least one digital processor; the system being operable to: (a) capture images of the scene around at least a portion of the vehicle, using the at least one camera; (b) execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values; (c) calculate corresponding depth information based on the disparity values; and d) generate from the images and corresponding depth information an image data stream for use by the control system.
Another aspect of the invention includes generating a facial signature, based on images of a human user's or subject's face, for enabling accurate, reliable identification: or authentication of a hitman subject or user of a system or resource, in a secure, difficult to forge manner. This aspect: of the
invention relates to methods, systems and computer soft ; are/program code products that enable generating a fecial signature for use in identifying a given human user.
In: accordance with thi s aspect, generating a facial signature incl udes capturing images of the user's face, using at least one camera having a view of the user's face; executing an image rectification function to compensate for optical distortion and alignment of the at least one camera; executing a feature correspondence function by detecting common features between corresponding images captured by the at .least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilizing the feature correspondence data
representation to generate a facial signature data representation, the facial signature data representation being usable in accurately identifying the user or subject in a secure, difficult to forge manner.
The capturing can utilize at least two cameras, each having a view of the user's face; and the feature correspondence function can include detecting common features between corresponding images captured by the respective cameras. Alternatively, the capturing can utilize at least one camera having a view of the user's face and which is an infra-red time-of-fl ight camera or structured light ca era thai directly provides depth information; and the feature correspondence data representation can be representati ve of the captured images and corresponding depth information.
In another aspect of the invention, the capturing utilizes a single camera having a view of the second user's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
The identity ing aspect of the invention uses stereo depth estimation to verify that human fecial features are presented to the cameras at the correct distance ratios 'between the cameras at the correct distance ratios between the cameras or from the structured light or tirne-of-flight sensor. The identifying takes into account the actual 3D coordinates of facial features wit respect to other facial features, and the feature correspondence function or depth detection function includes computing distances between facial features from multiple perspectives.
In one aspect of the invention, the facial signature is a combination of 3D facial contour information and a 2D image from one or more of the cameras. The 3D con to ur date can be sto red in the facia! signature data representation.
In another aspect, the facial signature is utilized as a security factor in art authentication system, either as the sole security factor or in combination, with other security factors. The other security factors can include a passcode, a fingerprint or other Mornetrie infbraiatioa.
fa another aspect, the 3D facial contour data is combined with a 2D image fr m one or more cameras in a conventional 2D face identification system to create a hybrid 3D/2D face identification system.
In another aspect, 3D .facial contour data is used solely to confirm that a face having credible 3D human fecial proportions was presented to the cameras at an overlapping spatial location of the captured 2D image.
A further aspect of the invention includes using a 2D bounding .rectangle, defining the.2D extent of the face location, to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption.
Still another aspect of the invention, includes prompting the user to present multiple distinct facial poses or head positions, and utilizing a depth detection system: to scan the multiple facial poses or head positions across a series of image frames, so as to increase protection against forgery of the facial signature.
In another aspect, generating a unique facial signature further includes executing a enrollment phase, which includes prompting the user to present to the cameras a plurality of selected head movements or positions, or a series of selected fecial poses, and collecting image frames from a plurality of head positions or facial poses for use in generating the unique facial signature representative of the user.
in another aspect, the invention further includes a matching phase, which includes using the cameras to capture, over an interval of time, a plurality of frames of 3 D and 2D data representative of the user's face; correlating the captured data with the facial signature generated during the enrollment phase, thereby to generate a probability of match score; and comparing the probability of match score with a selected threshold value, thereb to confirm or deny an identity match. The enrollment phase can inckide generating an enrolled facial signature containing data corresponding to multiple image scans of a user's face, the multiple image scans corresponding to a plurality of the user's head positions or facial poses; and the matching phase can inckide requiring at least a minimum number of captured image frames corresponding to different racial or head positions matching the multiple scans within the enrolled signature.
Another aspect of the invention relates to generating a histogram based facial signature representation, whereby a fecial signature is represented as one or more histograms obtained from a summation of per-pi el disparity histograms within the feature correspondence calculation, or generated from depth, data from a sensor capable of directly perceiving depth. The histograms represent th normalized relative proportion of facia! feature depths across a plane parallel to the user's face. Tire X- axis of the histogram represents a given disparity or depth range, and the Y-axis of the histogram represents the normalized count of image samples that fall within the given range.
In another aspect, during population of the histogram, a conventional 2D face detector provides a face rectangle and location of basic facial features, and only samples within the face rectangle are accumulated into the histogram.
Another aspect includes rejecting samples falling outsid the statistical majority of samples within the face rectangle.
Another aspect includes projecting disparity and depth points into a canonical coordinate system defined by a plane constructed from the 3D coordinates of the haste facial features.
in yet another aspect of the invention, during the enrollmen phase, a. histogram is accumulated over multiple captured image .frames over a period of time.
In another aspect during the enrollment phase, each set of samples of the captured image frames undergoes an affine transform to lie on a common faci al plane, to enable multiple samples of facial depth relationships to be accumulated into a histogram.
in another aspect,, during the enrollment phase, multiple samples of facial depth relationships are accumulated into a histogram across a series of facial positions or poses.
in a .former aspect of the invention, during the matching phase, a candidate histogram is accumulated over multiple captured image frames over a period of time.
in another aspect, during the matching phase, once a candidate histogram is accumulated, it is subtracted from a set of enrolled histograms to generate a vector distance constituting a degree-of-match score.
A further aspect of the invention includes comparing the degree-of-match score to a selected threshold to confirm or deny a match wi th each enrolled signature in a set of enrolled signatures.
In another aspect, the histogram representation is used in combination with conventional 2D face matching to provide an additional authentication factor.
in one practice of the invention, the fecial signature is utilized as a factor in an authentication process in which a human subject or user of a. system or resource is successfully authenticated if selected criteria m met, and the facial signature aspect further iaeludes updating the facial signature on every successful match, or on e very nth successful match, where n is a selected integer.
Another aspect of the invention, includes a program product for use with a digital processing system, for generating a fecial signature for use in identifying a human user or subject, the digital processing system comprising at least one camera, having a view of the user's or subject's face, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a. non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to: capture images of the user's or subject's face, utilizing the at least one camera; execute an image rectification function to compensate for optical distortion and alignment of the at least one camera; execute a feature correspondence function b detecting common features between corresponding images captured by the at least one camera and measuring a .relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilize the feature correspondence data representation to generate a facial signature data representation, the facial signature data representation being usable to accurately identify the user or subject in a secure, difficult to forge manner.
The capturing can include using at least two cameras, each having a view of the user's or subject's face; and executing a feature correspondence function can include detecting common features between corresponding images captured by the respective cameras.
In another aspect, the capturing can include using at least one camera having a view of the user's or subject's face and which is an infra-red Jime-af-fiight camera or structured light camera thai, directly provides depth information; and the feature correspondence data representation is representative of the captured images and correspondin depth information
in another aspect,, the capturing includes using a single camera havin a view of the user's or subject's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a relative distance in image space between the common features, to generate disparity values.
In another aspect, the identifying of a human user or subject utilizes stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the cameras or from the structured light or time-of-flight sensor. The identifying can take into account the actual 3D coordinates of facial features with respect to other facial features.
Anothe aspect of the in ention includes a digital processing system tor generating a facial signature for use in identifying a human user or subject, the digital processing system comprising at least one camera having a view1 of the users or subject's face, and a digital processmg resource comprising at least one digital processor, the digital processing resource being operable to: capture images of the user's or subject's face, utilizing the at least one camera execute an image rectification function to compensate for optica! distortion and alignment of the at least one camera: execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and utilize the feature correspondence data representation to generate a facial signature data .representation, the facial signature data representation being usable to accurately identify the user or subject, in a secure, difficult to forge manner,
is another aspect of the invention, the system can include at least two cameras having a view of the user's or subject's face; the capturing can include utilizing the least two cameras and executing a feature correspondence function can include detecting common features between corresponding images captured by the respective cameras,
in another aspect of the invention, the capturing includes using at least one camera having a view of the user's or subject's face and which is an infra-red titne-of- ight camera, or structured light camera that directly provides depth information: and the feature correspondence data representation is
representative of the captured images and corresponding depth information.
In another aspect, the capturing includes utilizing a single camera having a view of the user's or subject's face; and executing a feature correspondence function includes detecting common features between images captured by the single camera over time and measuring a. relative distance in image space between the common features, to generate disparity values.
In another aspect of the invention, the identifying of a human subject or user includes utilizing stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the cameras or from, the stmctured light or time-of-fiight sensor. The identifying can take into account the actual 3D coordinates of facial features with respect to other facial features.
These and other aspects, examples, embodiments and practices of the in vention, whether in the form of methods, devices, systems or computer software/program code products, will be discussed in greater detail below in the following Detailed Description of the invention and in connection with the attached drawing figures.
Those skilled in the art will appreciate dial while the following detaiied description provides sufficient detail to enable one skilled in the art to practice the present invention, the various examples, embodiments and practices of the present invention that are discussed and described below, in
conjunction with the attached drawing figures, are provided by way of example, and not by wa of limitation. Numerous variations, additions, and other modifications or different implementations of the present invention are possible, and are within the spirit and scope of the invention. Brief Description Of The Drawings
FIG. i stows a camera, configuration useful in an exemplary practice of the invention.
F GS. 2-6 are schematic diagrams illustrating exemplary practices of the invention.
FIG. 7 is a flowchart showing an exemplary practice of the invention.
FIG, 8 is a block diagram depicting an exemplary embodiment of the invention.
FIGS. 9-18 are schematic diagrams illustrating exemplary practices of the invention.
FIG, 1 is a graph, in accordance with as aspect of the invention.
FIGS. 20-45 are schematic diagrams illustrating exemplary practices of the invention.
FIG . 46 is a graph in accordance with an aspect of the invention.
FIGS. 47-54 are schematic diagrams illustrating exemplar}' practices of the invention.
FIGS. 55-80 are flowcharts depicting exemplary practices of the invention.
FIG. 81 is a schematic flow diagram depicting processing of images to generate a Facial
Signature in accordance with an exemplary practice Of the invention.
FIGS. 82-83 show art e em lar' image processed in accordance with an exemplary practice of die Facial Signature aspects of the invention, where FIG. 82 is an example of an image of a human user or subject captured by at least one camera, and FIG. 83 is an example of a representation of image data, corresponding to the image of FIG. 82, processed in accordance with an exemplary practice of the invention.
FIG. 84 shows a histogram representation corresponding to the image(s) of FIGS. 82-83, generated in accordance with an exemplary practice of the Facial Signature aspects of the invention.
FIGS. 85-88 are flowcharts depicting exemplary practices of the Facial Signature aspects of the invention.
Be tai led Descript jptt of the Invention
I. OVERVIEW
INTRODUCTION - V31>
Current: video conferencing systems such as Apple's Facetime, Skype or Google Hangouts have a number of limitations which make the experieo.ee of each user's presence and environment significantly less engaging than, being physically present on the other side. These limitations include ( ! ) limited bandwidth between users,, which typically results in poor video and audio quality; (2) higher than ideal latency between users (even if bandwidth is adequate, if latency is excessive, a first user's perception of the remote user's voice and visual actions will be delayed from when the remote user actually performed the action, resulting in difficult interaction between users; and (3) limited sensory engagement (of the five traditionally defined senses, even the senses of sight and. sound are only partially served, and of course taste, smell and touch are unaccounted-for}.
The first two issues can be addressed by using a h gher performing network connection and will likely continue to improve as the underlying communications infrastructure improves. As for the third issue, the present invention,, referred to herein as " V3D", aims to address and radically improve the visual aspect of sensory; engagement in teleconferencing and other video capture settings, while doing so with low latency .
The visual aspect, of conducing a video conference is conventionally achie ved via a camera pointing at eac user, transmi tting the video stream captured by eac camera, and then projecting the video stream (s) o o the two-dimensional (2D) display of the other user in a different location. Both users have a camera and display and thus is formed a full-duplex connection where both users can see each other and their respective environments.
The V3D of the present invention aims to deliver a significant enhancement to this particular aspect by creating a "portaT where each user would look 'through" their respective displays as if there were a "magic" sheet of glass in a frame to the other side in the remote location. This approach enables a number of importan improvements for the user (assuming robust implementation:
1. Each user can form direct eye contact with the other,
2. Each user can move his or her head in any direction and look through the portal to the othe side. They can e ven look "'around" and see the
environment as if looking through a. window.
3. Device shaking is automatically corrected for since each user sees a view from their eye directly to the other side. Imagine if you looked through
a window and shook the frame: there would be no change in the image seen through it .
4. Object size will be accurately represented regardless of view distance and angle. The V3D aspects of the invention can be configured to deliver these advantages in a manner that fits within the highly optimized form factors of today's modem mobile devices, does not dramatically alter the economics of building such de ices, and is viabte within the current connectivity performance levels available to most users.
By way of example of the invention, FIG, I shows a perspective view of an exemplary prototype device 1.0, which includes a display 12 and three cameras; a to right camera 14, and bottom right camera 16, and a bottom left camera 18. in connection with this example, there will next be described various aspects of the invention relating to the unique user experience provided by the V3D invention.
OVERALL USER EXPERI ENCE
Communication (including Video Conferencing} with Eye Contact
The V3D system of the invention enables immersive communication between people (and in various embodiments, between sites and places), in exemplary practices of the invention, each person can. look "through" thei screen, and see the other place. Eye contact is greatly improved. Perspective and scale are matched to the viewer's natural view. Device shaking is inherently eliminated. As described herein, embodiments of the V3D system can be implemented in mobile configurations as well as traditional stationary devices,
FIGS. 2A-B, 3A-B. and 5A-B are images illustrating an aspect of the invention, in which the V3D system is used in conjunction with a smartphone 20, or like device. Smartphone 20 includes a display 22, on which is displayed an. image of a face 24. The image may be, for example, part of video/telephone conversation, in which a video image and sound conversation is being conducted with someone in a remote location, who is looking into the camera of their own smartphone.
FIGS. 2 A and. 2B illustrate a feature of the V3D system for improving eye contact. FIG. 2.A. shows the face image prior to correction, it will be seen that the woman appears to be lookin down, so that there can be no eye contact with the other user or participant. FIG . IB shows the face image after correction, it will be seen that in the corrected image, the woman appears to be snaking eye contact with the smartphone user,
FIGS. 3A-3B are a pair of diagrams illustrating the V3.D system's "move left" (FIG. 3A) and "move right" (FIG. 3B) corrections. FIGS. 4A-4B are a pair of diagrams of the light pathways 26 a, 26b in the scene shown respectively on display 22 in FIGS. 3A-3B (shown from above, with the background at the top) leading from face 24 and surrounding objects to iew oint 28a, 28b through the "window " defined by display 22.
FIGS. 5A-5B are a pair of diagrams illustrating the V3D system's "move in" (FIG. 5A) and "move out" (FIG. 5B) corrections. F GS. 6A-6B are a pair of diagrams of the light: pathways 26c, 26d in the scene shown respectively on display 22 in FIGS. 3A-3B (shown from above, with the background at the top) leading from face 24 and surrounding objects to viewpoints 28c, 28d through the "window"' defined by display 22. Self Portraiture Example
Another embodiment of the invention utilizes the invention's ability to synthesize a virtual camera view of the user to aid in solving the problem of "where to look" when taking a self-portrait on a mobile device. "This aspect of the invention operates by image-capturing the user per the overall V3D method of the invention described herein, tracking the position and orientation of the user's face, eyes or head, and by using a display, presenting an image of the user back to themselves with a synthesized virtual camera viewpoint as if the user were looking in a mirror.
Photography Composition
Another embodiment of the invention makes it easier to compose a photograph using a rear- facing camera on. a mobile device, it works like the overall V3D method of invention described herein, except that the scene is captured through the rear-facing camerafs) and then, using the user's head location, a view is constructed such that the scale and perspective of the image matches the view of the user. such, that the device display frame becomes like a picture frame. This results in a user experience where the photographer does not have to manipulate zoom controls or perform cropping, since they can simply frame the subject as they like within the frame of the display, and take the photo.
Panoramic Photography
Another embodiment of the invention enables the creation of cylindrical or spherical panoramic photographs, by processing a series of photographs taken with a device using the cameta(s) rumiing the V3D system of the invention. The user can then enjoy viewing the panoramic view thus created, with an immersive sense of depth . The panorama can either be viewed on a 2D display with head tracking, a multi-view display or a binocular virtual reality (VR) headset with a unique perspective shown for each eye. If the binocular VR headset lias a facility to track head location, the V3D system can re-project the view accurately.
2. OVERALL V3D PROCESSING PIPELINE
FIG. 7 is a flow diagram illustrating the overall V3D digital processing pipeline 70, which includes the following aspects:
71: Image Capture: One or more images of a scene, which may include a human user, are collected instantaneoitsly or over time via one or more cameras and fed into the system . Wide-angle lenses are generally preferred due to the ability to get greater stereo overlap between images, although this depends on the application and can in principle work with any focal length.
72; Image Rectification: in order to compensate for optical lens distortion from each camera and relative misalignment between the cameras in the multi-view system, image processing is performed to apply an inverse transform to eliminate distortion, and an affine transform to correct misalignment between the cameras, in order to perform efficiently and in real-time, this process can be performed using a custom imaging pipeline or implemented using the shading hardware present in many conventional graphical processing units (CPUs) today, including GPU hardware present in devices such as i Phones and other commercially available smartphones. Additional detail and other variations of these operations will be discussed in greater detail herein.
73: Feature Correspondence: With the exception of using ttme-of-flight type sensors in the Image Capture phase that provide depth information directly, this process is used in order to extract parallax information present in the stereo images from the camem views. This process involves detecting common features between multi-view images and measuring their relative distance in image space to produce a disparity measurement litis disparity measurement can either be used directly or converted, to actual depth based on knowiedge of the camera field-of-vie , relative positioning, sensor size and image resolution. Additional detail and. other variations of these operations will be discussed in greater detail herein.
74: Rep.re.sentati.on.: Once disparity or depth information has been acquired, this information, combined with the original images must be represented and potentially transmitted over a network to another user or stored. This could take several forms as discussed in greater detail herein.
75: Reconstruction: Using the previously established representation, whether stored locally on the device or received over a network, a series of Synthetic views into the originally captured scene can be generated. For example, in a video chat the physical image inputs may have come from cameras surrounding the head of the user in which no one view has a direct eye contact, gaze vector to the user. Using reconstruction, a synthetic camera view placed potentially within the bounds of the device display enabling the visual appearance of eye contact can be produced.
76: Head Tracking: Using the image capture data as an input many different methods exist to establish an estimate of the viewer' head or eye location. This information ca be used to drive the reconstruction and generate a synthetic view which looks valid from the user's -established head location. Additional detail and various forms of these operations will be discussed in greater detail herein.
77: Display: Several types of display can. be used with the V3D pipeline in different ways. The currently employed method involves a conventional 2D display com ined with head tracking to update the display project in -real-time so as to gi ve the visual impression of being three-dimensional (3D) or a look into a 3D environment. However binocular stereo displays {such as the commercially available Ocui s Rift) can be employed used, or still further, a lenticular type display can be employed, to allow auto-stereoscopic viewing.
As described in greater detail below, portions of the V3D pipeline described herein and shown in
FIGS. 7 and 8 can also be used to enable the Facial Signature aspects of the invention., to enable a "signature" of a user's or subject's face, or face and head, to be generated from the Feature
Correspondence phase for purposes such as user identification, authentication or matching.
3. PIPELINE DETAILS
FIG 8 is a diagram of an exemplary V3.D pipeline 80 configured in accordance with the invention, for immersive eornraimication with eye contact. The depicted pipeline is full -duplex, meaning that it allows simultaneous two-way conunomcation in both directions. Pipeline 80 comprises a pair of communication devices 1 A-B (for example, commercially available smartphones such as iPhones) that are linked to each other through a network 82. Each communication device includes a decoder end 83 A-B for receiving and decoding communications from the other device and an encoder end 84A-B for encoding and sending commimications to the other device 8 ! A-B.
The decoder end 83 A-B includes the following components:
a Receive module 831 A-B;
a Decode module 832A-B;
a. View Reconstruction module 833 A-B; and
a Display 834A-B.
The View Reconstruction module 833A-8 receives data 835A-B from a Head Tracking Module 836-B, which provides x~, y~, and z-coordmate data, with respect to the user's head that is generated by cameras 841 A-B.
The encoder end 84-B comprises a multi-camera array that includes camera;; 8 1 A-B, cameras 84 A-B, and additional camera(s) 842A-B. (As noted herein, it is possible to practice various aspects of the invention using only two cameras.) The camera array provides data in the form of color camera streams 843 A-B that are fed into a Color Image Redundancy Elimination module 844A-B and an Encode module. The output of the camera array is also fed into a Passive Feature Disparity Estimation module 84SA-B that provides disparity estimation data to the Color Image Redundancy Elimination module 846A-B and the Encode module 847A-B. The encoded output of the device is then transmitted over network 82 to the Receive module 83 1 A-B in the second device 8.1A-B.
These and other aspects of the invention are described in greater detail elsewhere herein.
IMAGE CAPTURE
The V3D system requires an input of images in order to capture the user and the world around the user. The V3D system can be configured to operate with a wide range of input imaging device. Some devices, such as normal color cameras, are inherently passive and thus require extensive image processing to extract depth information, whereas non-passive systems can get depth directly, although they have the disadvantages of requiring reflected IR to work, and thus do not perform well in strongly naturally lit environments or large spaces. Those skilled in the art will understand that a wide range of color cameras and other passive imaging devices, as well as non-passive image capture devices, are commercially available from a variet of manufacturers.
Color Cameras
This descriptor is intended to cover the use of any visible light camera that can Feed into a system in accordance with the V3D system . I ~Structured light
Hits descriptor is intended to cover the use of visible light or infrared specific cameras coupled with an active infrared emitter that beams one of many -potential patterns onto the surfaces of objects, to aid in. computing distance. I -Structured Light devices are known in the art,
IR Time of Flight
This descriptor covers the use o time-of~tlight cameras that work by emitting a pulse of light, and then measuring the time taken for reflected light to reach each of the camera's sensor elements. This is a more direct method of measuring depth, but has currently not reached the cost and resolution levels useful fo significant consumer adoption. Using this type of sensor, in some practices of the invention the feature correspondence operation noted above could be omitted, since accurate depth information is already provided directly from the sensor.
Single Camera over Time
The V3D system of the invention can be configured to operate with multiple cameras positioned in a fixed relati ve position as part of a device. It is also possible to use a single camera, by taking images over time and with accurate tracking, so that the relative position of the camera between frames can be estimated with sufficient accuracy. With sufficiently accurate positional data, feature correspondence algorithms such as those described herein could continue to be used.
View-Vector Rotated Camera Configuration to improve Correspondence Quality
The following describes a practice of the V3D invention that relates to the positioning of the cameras within the multi-camera configuration, to significantly increase the number of valid feature correspondences between images captured ia real world settings, 'litis approach is based on three observations:
1. Users typicall orient their display, tablet or phone at a rotation that is level with their eyes.
2. Man features in man-made indoor or urban environments consist of edges aligned in the three orthogonal axes (x, y .?}.
3. In order to have a practical search domain, feature correspondence algorithms typically perform their search along horizontal or vertical epipolar lines in image space.
Taken together, these observations lead to the conclusion that there are often, large numbers of edges for which there is no definite correspondence. This situation can be significantly improved while keeping the image processing overhead minimal, by applying a suitable rotation angle (or angular displacement) to the arrangement of the camera sensors, while also ensuri ng that the cameras are positioned relative to each other along epipolar lines. The amount of rotation angle can be relatively small (See, for example, FIGS 9, 10 and 1 1.) After the images are captured in this alternative "rotated" configuration, the disparity values can either be rotated along with the images, or the reconstruction phase can be run. and the .final image result rotated back to the correct orientation so that the user does not even perceive or see the rotated images.
There are a variety of spatial arrangements and orientations of the sensors that can accomplish a range of rotations while still fitting within many typical device form factors. FIGS. 9, 10, and 11 show three exemplary sensor configurations 90, 100, 1 10.
FIG. 9 shows a handheld device 90 comprising a display screen 91 surrotmded by a bezel 92. Sensors 93, 94, and 95 are located at the corners of bezel 92, arid define a pair of perpendicular axes: a first axis 96 between sensors 93 and 94, and a second axis 97 between cameras 94 and 95.
FIG. 10 shows a handheld device 100 comprising display 101, bezel 102, and sensors .103, 104. 105. In FIG . 1 , each of sensors 103. 104, 105 is rotated by an angle 0 relati ve to bezel 1 2. The position of the sensors .103, 104, and 105 on bezel 1 2 has been configured so that the three sensors define a pai r of perpendicular axes 1 6,. 1 7.
FIG. 1 1 shows a handheld device 110 comprising display 1 1 1, bezel 1 12, and sensors 113, 1 14, 1 15. In the alternative configuration shown in FIG. 1 1. the sensors 1 13, 1 14, 1 15 are not rotated. The sensors i 13, 1 14, 115 are positioned to define perpendicular axes- 116, 1 17 that are angled with respect to bezel i 12. The data from sensors 113, 1 .14, 115 are then rotated in software such that the correspondence continues to be performed along the epipolar lines.
Although an exemplary practice of the V3D invention uses 3 sensors to enable vertical and horizontal cross correspondence, the methods and practices described above are also applicable in a 2- camera stereo system.
FIGS. 12 and .13 next highlight advantages of a "rotated configuration" in accordance with the invention, in particular, 12A shows a "non-rotated" device configuration 120, with sensors 121, 122, 123 located in three comers, similar to configuration 90 shown in FIG. 9. FIGS. 12B, S.2C, and 12D
(collectively, FIGS. 12 A - .12D being referred to as "FIG. 12") show the respective scene image data collected at sensors 121. .122, 123.
Sensors 121 and 122 define a horizontal axis between them, and generate a pair of images with horizontally displaced viewpoints. For certain features, e.g„ features Hi, H2, there is a strong
correspondence (i.e., the horizontally-displaced scene data provides a high level of certainty with respect to the correspondence of these features). For other features, e.g. , features IB, 114, the correspondence is weak, as shown in FIG. .12. (i.e., the horizontally-displaced scene data provides a low level of certainty with respect to the correspondence of these features).
Sensors 122 and 123 define a vertical axis that i perpendicular to the axis defined by sensors 121 and 122. Again, for certain features, e.g., feature VI. in FIG. 12, there is strong correspondence. For other features, e.g. feature V2 in FIG. 12, the correspondence is weak.
FIG. 1 A shows a device configuration 130, similar to configuration 1.00 show in FIG. .10. with, sensors 131 , i 32, 133 positioned and rotated to define an angled horizontal axi and an angled vertical axis. As shown in FIGS, 13B, 13C, and S D, the use of an angled sensor configuration eliminates the weakly corresponding features shown In FIGS, 128, 12C, and I2D. As shown by FIGS. 12 and 13, a rotated configuration of .sensors in accordance with an exemplary practice of the invention enabies strong correspondence for certain scene features where the non-rotated configuration did not.
Multi-Exposure Cyclin g
in accordance wi th the in vention, duri ng the process of calculating feature correspondence, a feature is selected in one image and then scanned for a corresponding feature in another image. During this process, there can often be several possible matches found and various methods are used to establish which match (if any) has the highest likelihood of being the correct one.
As a genera! fact when the input cameraCs)- capture an image, a choice is made to ensure that the camera exposure settings (such as gain and shutter speed) are selected according to various heuristics, with, the goal of ensuring that a specific region or the majority of the image is within the dynamic range of the sensing element. Areas that are out of this dynamic range will either get clipped (overexposed regions) or suffer from a dominance of sensor noise rather than valid image signal.
During the process of feature correspondence and image reconstruction in an. exemplar}' practice of the V3D in vention, the correspondence errors in the excessively dark or light areas of the image can cause large-scale visible errors in the image by causing the computing of radically incorrect disparity or depth estimates.
Accordingly, another practice of he invention involves dynamically adjusting the exposure of the multi-vie camera, system, on a ftame-by-franie basis in order to improve the dsspariiy estimation in areas out of the exposed region viewed by the user. Within the context of the histogram-based disparity method of the invention, described elsewhere herein, exposures taken at darker and lighter exposure settings surrounding the visibility optimal exposure would be taken,, have their disparity calculated and then get integrated in the overall pixel histograms which are being retained and converged over time. The dark and light, images could be, but are not required to be. presented to the user and would serve only to improve the disparity estimation.
Another aspect of this approach, in accordance with the invention, is to analyze the variance of die disparity histograms on "d k" pixels, "mid-range" pixels and "light pixels", and use this to drive die exposure setting of the cameras, thus forming a closed loop system between the qualit of the disparity estimate and the set of exposures winch, are requested from the input multi-view camera system. Fo example, if the cameras are viewin a purely indoor environment, such as an interior room, with limited dynamic range due to indirect lighting, only one exposure may be needed. If, however, the user were to (e.g.) open curtains or shades, and allow direct sunlight to enter into the room, the system would lack a strong dispari ty solution in those strongly lit areas and in response to the closed loop control described herein, would choose to occasionally take a reduced exposure sample on occasional video frames. 1 MAG E RECTI FICATION
An exemplary practice of the V3D system executes image rectification in real -time using the GPU hardware of the device on winc it is operating, such as a conventional smartphone, to facilitate and improve an overall solution.
Typically, within a feature correspondence system, a search must be performed between two cameras arranged in a stereo configuration in order to detect the relative movement of features in the image due to parallax. This relative movement is measured in pixels and is referred to as "the disparity",
FIG. 14 shows an exemplary pair of unreetified and distorted camera (URD) source camera images 140 A and 140R for left and right stereo. As shows in F!G. 14. the image pair includes a matched feature, i.e.. the subject's right eye 141A, 140B. The matching feature has largely been shifted horizontally , but there is also a vertical shift because of slight misalignment of the cameras and the fact that there is a polynomial term resulting from lens distortion. The matching process can be optimized by measuring the lens distortion polynomial terms, and by inferring the affine transform required to apply to the images such that they are rectified to appear perfectl horizontal iy aligned and co-planar. When this is done, what would otherwise be a freefonn 2D search for a feature match can now be simplified by simply searching along the same horizontal row on the source image to find the match.
Typically, this is done in one step, in which the lens distortion and then affine transform
coefficients are determined and applied together to produce the corrected images. One practice of the invention, however, use a different approach, which will next be described. First, however, we define a number of terms used herein to describe this approach and the transforms used therein, as follows:
URD (Un rectified. Distorted) space: This is the image space in which the source camera images are captured. There is both polynomial distortion due to the lens shape and an arTiue transform that makes the image not perfectly co-planar and axis-aligned with the oilier stereo linage. S he number of URD images in the system is equal to the number of cameras in the system.
URU.D (Unreetified, Undistorted) space: This is a space in which the polynomial lens distortion is removed from the image but the images remain unreetified. The number of URUD images in the system is equal to number of URD images and therefore, cameras, in the system.
RUD (Rectified, Undistorted) space: This is a space in which both the polynomial lens distortion is removed from the image and an affine transform i applied to .make the image perfectly co-planar and axis aligned with the other stereo image on the respective axis. RUD always exist in pairs. As such, for example, in a 3 camera system where the cameras are arranged in a substantially L -shape configuration (having two axes intersecting at a selected point), there would be two stereo axes, and thus
2 pairs of RUD images, and thus a total of 4 RUD images in the system.
FIG . 15 is a flow diagram 150 providing various examples of possible transforms in a 4-camera Y3D system. Note that there are 4 stereo axes. Diagonal axes (not shown) would also be possible.
The typical transform when sampling the source camera images in a stereo correspondence system is to transform from RUD space (the desired space for feature correspondence on a stereo axis) to URD space (the source camera images). in an exemplar}-' practice of the V3'D invention, it is desirable to incorporate multiple stereo axes into the solution in order to compute more accumte disparity values. In order to do this, it is appropriate to combine the disparity solutions between independent stere axes that share a common camera. As such, an exemplary practice of the invention makes substantial use of the URU'D image space to connect the stereo axes disparity values together. This is a significant observation, because of the trivial iiivertibiiitv of the aifine transform (which is simply, for example, a 3x3 matrix). We would not be able to use the URD space to combine disparities between stereo axes because the polynomial lens distortion is not iiivertible, due to the problem of multiple roots and general root finding. Tins process of combining axes in the V3D system is further described below, in "Combining Correspondences on Additional Axes".
FIG. 16 sets forth a flow diagram 160, and FIGS. I 7A-C are a series of images that illustrate the appearance and purpose of the various transforms on a single camera i mage .
Feature Correspondence Algorithm
The "image correspondence problem'" has been the subject of computer science research for many years. Howe ver, gi ven the recent advent of the universal availability of low cost cameras and massively parallel computing hardware (CPUs) contained in many smatiphones aad other common mobile devices, it is now possible to apply brute force approaches and statistically based methods to feature correspondence, involving more than just a single stereo pair of images, involving images over the time dimension and at multiple spatial frequencies, to execute feature correspondence calculations at. performance levels not previously possible.
Various exemplary practices of the invention will next be described, which are novel and represent significant improvement to the quality and reliability attainable in feature correspondence. A number of these approaches, in accordance with the invention, utilize a method of representation referred to herein as "Disparity Histograms" on a per-pixel (or pixel group) basis, to integrate and make sense of collected data.
Combining Correspondences on Additional Axes
An exemplars' practice of the invention addresses the following tw problems:
Typical correspondence errors resulting from matching errors in a single stereo image pair.
Major correspondence errors that occur when a particular feature in one image within the stereo pair does not exist in the other image.
Tins practice of the invention works by extending the feature correspondence algorithm to include one or more additional axes of correspondence and integrating the results to improve the quality of the solution.
FIGS. .18A-D illustrate an example of this approach. FIG. 18A is a diagram of sensor
configuration I SO having 3 cameras 181. 1 S2, .183 in a substantially L-shaped configuration such that a stereo pair exists on both the horizontal axis 185 and vertical xis 186, with one camera in common between the axes, similar to the configuration 90 shown in FIG. 9. Provided the overall system contains a suitable representation to integrate the .multiple disparit solutions (one such representation being the "Disparity Histograms'" practice of the invention discussed herein), this configuration will allow for 'uncertain correspondences in one stereo pair to be either corroborated or discarded through the additiooai infonnation found by performing correspondence on the other axis. In addition, certain features which have no correspondence on one axis, may find a
correspondence on the other axis, allowing for a much more complete disparity solution for the overall image than would otherwise be possible.
FIGS. I 8B, I SC, and 18D are depictions of three simultaneous images received respectively b sensors 181, 182, 183. The three-image set. is illustrative of all the points mentioned, above.
Feature (A), i.e., the subject's nose, is found t correspond bom o the horizontal stereo pair
(FIGS. 18B and ISC) and the vertical stereo pair (FIGS, ISC and 18D). Having the double
correspondence helps eliminate correspondence errors by improving the signal-to-noise ratio, since the likelihood of the same erroneous correspondence being found in both axes is low.
Feature (B), i.e., the spool of twine, is found to correspond only on the horizontal stereo pairs. Had the system only included a vertical pair, mis feature would not have had a depth estimate because it is entirely out of view on the upper image.
Feature (C). i.e., the cushion on the couch, is only possible to correspond on the vertical axis. Had the system only included a horizontal stereo pair, the cushion would have been entirely occluded in the left image, meaning no valid disparity estimate could have been established.
An important detail is that in many cases the stereo pair on a particular axis will have undergone a calibration process such that the epipolar lines are aligned to the rows or columns of the images. Each stereo axis will have it own unique camera alignment properties and hence the coordinate systems of the features will be incompatible, in order to integrate dtsparity information on pixels between multiple axes, the pixels containing the disparity solutions will need to undergo coordinate transformation to a unified coordinate system. In an exemplary practice of the invention, this .means that the stereo correspondence occurs in the ROD space but the resultant disparity data and disparity histograms would be stored in the li'RUD (Unrectified, Undistorted) coordinate system and a tJKtJD to RUD transform would be performed to gather the per-axis disparity values,
This aspect of the invention involves .retaining a representation of disparity in the form of the error function or, as described elsewhere herein, the di parity histogram, and continuing to integrate disparity solutions for each frame in time to converge on a better solution through additional sampling.
Fj l:!»g.I:¾k»Q »s.W.ith Hi toric
This aspect of the invention is a variation of the correspondence refinement over time aspect. In cases where a given feature is detected but for which no correspondence can be found in another camera, if there was a prior solution for that pixel from a previous frame, this can be used instead. Histogram -Based pi sgarjtv Rgpre scmati on. Method
This aspect of the invention provides a representation to allow multiple disparity measuring techniques to be combined to produce a higher qualify estimate of image disparity, potentially even over time. It also permits a more efficient method of estimating disparity, taking into account more global context in the images, wi thout the significant cost of large per pixel kernels and image differencing.
Most disparity estimation methods for a given pixel in an image in the stereo pair involve sliding a region of pixels (known as a kernel) surrounding the pixel in question from one image over the other in the stereo pair, and computing the difference for each pixel, in. the kernel, and reducing this to a scalar value for each disparity being tested.
Given a kernel of reference pixels and a kernel of pixels to be compared w th the re ference, a number of methods exist to produce a scalar difference between them, including the 'following:
1. Sum of Absolute Differences (SAD)
2. Zero-mean Sum of Absolute Differences (ZSAD)
3. Locally scaled Sum of Absolute Differences (LSAD)
4. Sum of Squared Differences (SS'D)
5. Zero-mean Sum of Squared Differences (ZSSD)
6. Locally scaled Sum of Squared Differences (LSSD)
7. Normalized Cross Con-elation (NCC)
8. Zero-Mean Normalized Cross Correlation (Z CC)
9. Sum of Hamming Distances (SHD)
This calculation i repeated as the kernel is slid over the image being compared..
FIG. 19 is a graph 190 of cumulative error for a 5x5 block of pixels for disparity values between 0 and 128 pixels. In this example, it can e seen that there is a single global minimum that is likely to be the best solution.
In various portions of this description of the invention, reference may be made to a specific one of the image comparison methods, such as SSD (Sum of Square Differences). Those skilled in the art will understand that in many instances, others of the above-listed image comparison error measurement techniques could fee used, as could others known i the art. Accordingly, this aspect of the image processing technique is referred to herein as a "Fast Dense Disparity Estimate", or "FDDE".
Used by itsel f this ty pe of approach has some problems, as follows:
Computational Overhead
Even1 pixel for which a dispari ty solution is required must perform a large number of per pixel memory access and .math operations. This cost scales approximately with die square of the radius of the kernel multiplied by the number of possible disparity values to be tested for.
Non-Uniform Importance of Individual Features in the Kernel
With the exception of the normalized cross correlation methods, the error function is
significantly biased based on image intensity similarity across the entire kernel . This means that subtle features with non-extreme intensity changes will fail to attain a match if they are surrounded by areas of high intensity change, since the error function will tend to "snap" to the high intensity regions, ta addition, small differences in camera exposure will bias the disparity because of t e "aon-democratic" manner in which the optimal kernel position is chosen ,
An example of this is shown in FIGS. 20A-B and 21.A-B. FIGS, 20A and 20B are two horizontal stereo images, FIGS. 21 A and 2 IB, which, correspond to FIGS. 20A and 20B, show a selected kernel of pixels around the solution point for which we are trying to compute the disparity, it can be seen that for the kernel at its current i e, the cumulative error function will have two minima, one representing the features that have a small disparity since they are in the image background, and those on the wall which are in the foreground and will have a larger disparity. In the ideal situation, the minima would flip from the background to the foreground disparity value as close to the edge of the wall as possible. In practice, due to the high intensity of the wall pixels, many of the background pixels snap to the disparity of the foreground, resulting in a serious quality issue forming a border near the wall . lack oj'Meaningfin Units
The units of measure of "error" i .e. the Y-axis on the example graph, is unsealed and may not be compatible between multiple cameras, each with its own color and luminance response. his introduces difficulty in applying statistical methods or combining error estimates produced through other methods. For example, computing the error function from a different stereo axis would be incompatible in scale, and thus the terms could not be easily integrated to produce a better error function.
This is an instance in which the disparity histogram method of the invention becomes highly useful, as will next be described.
Operation of the Pi span tv Histogram Repffisentatiori
One practice of the disparity histogram solution method of the invention works by maintaining a histogram showing the relative likelihood of a particular disparity being valid for a given pixel. In other words, the disparity histogram behaves as a probabilit density function (PDF) of disparity for a given pixel, higher values indicating a higher likelihood that the disparity range is the "truth".
FIG. 22 show an example of a typical disparity histogram for a pixel. For each pixel histogram, the x-axis indicates a particular disparit range and the scale v-axis is the number of pixels in the kernel surrounding the central pixel that are "voting" for that given disparity range.
FIGS. 23 and 24 show a pair of images and associated histograms. As shown (herein, the votes can be generated by using a relatively fast and low-quality estimate of disparity produced using small kernels and standard SSD type methods. According to an aspect of the invention, the SSD method is used to produce a "fast dense disparity ma '' (FDDE), wherein each pixel has a selected disparity that, is the lowest error. Then, the algorithm would go through each pixel accumulating into the histogram a tally of the number of votes for a given disparity in a larger kernel surrounding the pixel.
With a given disparity histogram, many forms of analysis can be performed to establish the most likely disparity for the pixel, confidence in the solution validity, and even identify cases where there are multiple highly likely solutions. For example, if there is a single dominant mode in the histogram, the x coordinate of that peak denotes the most likely disparity solution.
FIG. 25 shows an example of a bi-modal disparity histogram with 2 equally probable disparity possibilities.
FIG. 26 is a diagram of an example showing the disparity histogram and associated cumulative distribution function (CDF), The interquartile range is narrow, indicating high confidence.
FIG. 2? is a contrasting example showing a wide interquartile range in the CDF and thus a low confidence in any peak within that range.
By transforming the histogram into a cumulative distribution function (CDF), the width of the interquartile range can be established. This range can then be used to establish a confidence level in the solution. A narrow interquartile range (as in FIG. 26) indicates that the vast majority of the samples agree with the solution, whereas a wide interquartile range (as in FIG . 27) indicates that the solution confidence is lo because many other disparity values could be the truth.
A count of the number of statistically significant modes in the histogram can be used to indicate 'modality." For example, if there are two strong modes in the histogram (as in FIG, 25), it is highly likely that the point in question is right on the edge of a feature that demarks a background versus foreground transition in depth. This can be used to control the reconstruction later in the pipeline to control stretch versus slide (discussed in greater detail elsewhere herein).
Due to the fact that the v-a is scale is now in terms of votes for a given disparity rather than the typical error functions, the histogram is not biased by variation in image intensity at ail, allowing for high quali ty disparity edges on depth discontinuities , in addition, this permits othe method of estimating disparity for the given pixel to be easily integrated into a combined histogram.
If we are processing multiple frames of images temporally, we can preserve the disparity histograms over time and accumulate samples into them to account for camera noise or other spurious sources of motion or error.
If there are multiple cameras, it is possible to produce fast disparity estimates for multiple independent axes and combine the histograms to produce a much more statistically robust, disparity solution, With, a standard error function., this would be much more difficult because the scale would make the function less compatible. With the histograms of the present invention, in contrast, everything is measured m pixel votes, meaning the results can simply be multiplied or added to allow agreeing disparity solutions to compound, and for erroneous solutions to fall into the background noi e.
Using the histograms, if we find the interquartile range of the CDF to be wide in areas of a particular image intensity, this may indicate an area of poor signal to noise, i.e., underexposed to overexposed areas. Using this, we can control the camera exposures to fill in poorly sampled areas of the histogram s.
Computational performance is another major benefit of the histogram based method. The SS.D approach (which is an input to the histogram method is computationally demanding due to the per pixel math and memory access for every kernel pixel for every disparity to be tested. With the histograms, a small. SS.D kernel is all that is needed to produce inputs to the histograms. This is highly significant, since
SSD performance is proportional to the square of its kernel size multiplied by the number of disparity values being tested for. Even through the small SSD kernel output is a noisy disparity solution, the subsequent voting, which is done by a larger kernel of the pixels to produce the histograms, filters out so muc of the noise that it is, in practice, better than the SSD approach, even with very large kernels. The histogram accumulation is only an addition function, and need only be done once per pixei per frame and does not increase in cost with additional. disparity resolution.
Another useful practice of the invention involves testing only for a small set of disparity values with SSD, populating the histogram, and then using the histogram votes to drive further SSD testing within that range to improve disparity resolution over time.
One implementation of the invention involves each output pixei thread having a respective "private histogram" maintained in on-chip storage close to the computation units (e.g., GPUs). This private histogram can be stored such, that each, pixel thread will be reading and writing to the histogram on a single dedicated bank of shared local memory on a modern programmable GPU. in addition, if the maximum possible number of votes is known, multiple histogram bins can be packed into a single word of the shared local memory and accessed using bitwise operations. These details can be useful to reduce the cost of dynamic indexing into an array during the voting and the summation ,
Multi-Level Histogram Voting
This practice of the invention is an. extension of the disparity histogram aspect of the invention, and has proven to be an highly useful pint of reducing error in the resulting disparity values, while still preserving important detail, on depth discontinuities in the scene.
Errors. n the- disparity values can come from many sources. Multi-level disparity histograms reduce the contri bution from several of these error sources, including:
L Image sensor noise.
2. Repetitive patterns at. a given image frequency.
As with the idea of combining multiple stereo axes' histogram votes into the disparity histogram for the purpose of "tie-breaking* and reducing false matches, the multi-level voting scheme applies that same concept, but across descending frequencies in the image space.
FIGS. 2.8A. and 2.8B shows an example of a horizontal stereo image pair, FIGS. 28C and 28D show, respecti ely, the resulting disparity data before and after application of the described multi-level histogram, technique.
This aspect of the invention works by performing the image pattern matching FDDE) at several successively low-pass filtered versions of the input stereo images. The term "level" is used herein to define a level of detail in the image, where higher level numbers imply a lower level of detail. In one practice of the invention, the peak image frequencies at leveifnj will be half that of levelfn-l J.
Many methods can be used to downsaniple, and such methods known in the area of image processing. Many of these methods involve taking a weighed summation of a kernel in ievei[n-lj to produce a pixel in. level jn). In one practice of the invention, the approach would be for the normalized kernel center position to remain the same across all of (he levels.
FIGS. 30A-E are a series of exemplary left and right multi-level input images. Each level jhj is a do nsampled version of level jn-1 j. In the example of FIG. 30, the downsampling kernel is a 2x2 kernel with equal weights of 0.23 for each pixel The kernel remains centered at each successive level of downsampling.
Is this practice of the invention, for a given desired disparity solution at the fill! image resolution, the FDDE votes for every image level are included. Imagine a repetitive image feature. sttcH as the white wooden beams on the cabinets shown in. the background of the example of FIG, 30. At Level {0 J (the full image resolution), several possible matches may be found b the FDDE image comparisons since each of the wooden beams looks rather similar to each other, given the limited kernel size used for the FDDE.
FIG. 31 depicts an example of an image pair and disparity histogram, illustrating an incorrect matching scenario and its associated disparity histogram (see the notation "Winning candidate: incorrect" in the histogram of FIG. 31).
In contrast, and in accordance with an exemplary practice of the invention, FIG. 32 shows the same scenario, but with the support, of 4 lower levels of FDDE votes in the histogram, resulting in a correct winning candidate (see the notation "Winning candidate: correct" in the histogram of FIG. 3.1 ). Note that the lower levels provide support for the true candidate at the higher levels. As shown in FIG. 32, if one looks at a lower level (i.e., a level characterized by reduced image resolution via low-pass filtering), the individual wooden beams shown in the image become less proooitnced, and the overall form of the broader context of that image region, begins to dominate the pattern .matching. By combining togethe ail the votes at each level, where there may have been .multiple closely competing candidate matches at the lower levels, the higher le vels will likely have fewer potential candidates, and thus cause the true matches at the lower levels to be emphasized and the erroneous matches to be suppressed. This is the 'tie-breaking" effect that this practice of the invention provides, resulting in a higher probability of correct winning candidates.
FIG . 33 is a schematic diagram of an exemplary practice of the invention. In particular, FIG, 33 depicts a processing pipeline show ing the series of operations between the input camera images, through FDDE calculation and multi-level histogram voting, into a final disparity result. As shown in FIG. 33, multiple stereo axes (e.g., 0 through n) can be included into the multi-level histogram.
Having described multi-level disparity histogram representations in accordance with the invention, the following describes how the multi-level histogram is represented, and how to reliably integrate i s result to locate the final most likely disparity solution.
Representation of the Mo!tt-Levei Histogra
FIG, 34 shows a logical representation of the multi -level histogram after votes have been placed at each level. FIG. 35 shows a physical repre entation of the same maiti-ievel histogram in numeric arrays in device memory, such as She digital memory units in a conventional s artphone architecture, hi an exemplary practice of the invention, the multi-level histogram consists of a series of initially independent histograms at each level of detail . Each histogram bin. in a given level represents the votes for a disparity found by the FDDE at that level. Since lev lj n] has a fraction the resolution as that of leveifn-l], each calculated disparity value represents a disparity uncertainty range which is that same fraction as wide. For example, in FIG. 4, each le vel is half the resolution as the one above it. As such, the disparity uncertainty range represented by each histogram bin is twice as wide as the level before it.
Sub-Pixel Shifting of Input linages to Enable Multi-Level Histogram Integration
In an exemplary practice of the invention, a significant detail to render the multi-level histogram integration correct involves applying a sub-pixel shift to the disparity values at each level during downsampling. As shown in FIG. 34, if we look at the votes in ieveifOJ. disparity bin 8, these represent votes for disparity values 8-9, At level [1], the disparity bins are twice as wide. As such, we want to ensure that the histograms remain centered under the level above. Levelf 1 ] shows that the same bi represent 7.5 throug 9.5. This half-pixel offset is highly significant, because image error can cause the disparity to be rounded to the neighbor bin and then fail to receive support from the level below,
in order to ensure that the histograms remain centered under the level above, an exemplary practice of the invention applies a half pixel shift to only one of the images in the stereo pair at each level of down sampling. This can be done inline within the weights of the filter kernel used to do the downsampling between levels. While it is possible to omit the half pixei shift and use more complex weighting during multi-level histogram summation, it. is very inefficient. Performing the half pixel shift during down-sampling only involves modifying the filter weights and adding two extra taps, making it almost "free", from a computational standpoint,
Tliis practice of ore invention is farther illustrated in FIG. 36, which shows an example of per- !evel down-sampling according to the invention, using a 2x2 box filter. On the left is illustrated a method without a half pixel shift.. On the right of FIG. 36 is illustrated the modified filter with a half pixel shift, in accordance with an exemplary practice of the invention. Note that this half pixel shift should only be appiied to one of the image in the stereo pair. This has the effect of disparity values remaining centered at each level in the multi-level histogram during voting, resulting in the configuration shown in FIG. 34.
Integration of the Multi-Level Histogram
FIG, 3? illustrates an exemplary practice of the invention, showing an example of the summation of the multi-level histogram to produce a combined histogram in which the peak can he found. Provided that the correct sub-pixel shifting has been applied, the histogram integration involves performing a recursive summation across all of the levels as shown in FIG. 37. Typically, only the peak disparity index and number of votes for that peak are needed and thus the combined histogram does not need to be actually stored in memory. In addition, maintaining a summation stack can reduce summation operations and multi-level histogram memory access.
During the summation, the weighting of each le vel can be modified to control the amount of effect that the lower levels in the overall voting. In the example shown in Figure 37, the current value at level [ J gets added to two of the bins above it in levelf n-lj with a weight of ½ each. Extraction of Sub-Pixel Disparity Information from 'Disparity Histograms
An exemplary practice of the invention, illustrated in FIGS. 39-40, builds on the disparity h istograms and allows for a higher accuracy disparity estimate to be acquired without requiring any additional SSD steps to be performed, and tor only a smalt amount of incremental math when selecting the optimal disparity from the histogram.
FIG. 38 is a disparit histogram for a typical pixel. In the example, there are 10 possible disparity values, each tested using SSD and then accumulated into the histogram with 10 bins. In this example, there is a clear peak in the 4th bit)., which means mat the disparity lies between 3 and 4 with a center point of 3.5.
FIG. 39 is a histogram in a situation in which a sub-pixel disparity solution can be inferred from the disparity histogram. We can see mat an even number of votes exists in the 3rd and 4th bins. As such, we can say that the true disparity range l ies between 3.5 and 4.5 with a center point of 4.0.
FIG. 40 is a histogram that reveals another case in which a sub-pixel disparity solution can be inferred, in this ease, the 3rd bin is the peak with 1.0 votes, its directly adjacent neighbor is at 5 votes. As such, we can state mat the sub-pixel dispari ty is between these two and closer to the 3rd bin, ranging from 3.25 to 4,25, using the following equation:
Center- Weighted SSI) Method
Another practice of the invention, provides a further method of solving the problem where larger kernels in the SSD method tend to favor larger intensity differences with the overall kernel, rather than for the pixel being solved. This method of the invention involves applying a higher weight to the center pixel with a decreasing weight proportional to the distance of the given kernel sample from the center, By- doing this, the error function minim will tend to be found closer to the valid solution for the pixel being solved. injecdve Constraint
Yet another aspect of the invention involves the use of an "mjeetive constraint", as illustrated in FIGS. 41-45. When producing a disparit solution for an image, the goal is to produce the most correct results possible. Unfortunately, due to imperfect input data from the stereo cameras, incorrect disparity values will get computed, especially if only using the FDDE data, produced via image comparison using SSD, SAD or one of the many image comparison error measurement techniques.
FIG. 41 shows an exemplary pair of stereoscopic images and the disparit data resulting from the FDDE using SAD with a 3x3 kernel. Warmer colors represent closer objects. A close look at FIG, 41 reveals occasional values wh ich look obviously incorrect. Some of the factors causing these errors include camera sensor noise, image color response differences between sensors and lack of visibil ity of common feature between cameras. In accordance with the invention, one way of reducing these errors is by applying "constraints" to the solution which -reduce the set of possible solutions to a more realistic set of possibilities. As discussed elsewhere herein, solving the disparity across multiple stereo axes is a tbrra of constraint by using the solution on one axis to reinforce or contradict that of another axis. The disparit histograms are another form of constraint by limiting the set of possible solutions by filtering out spurious results in 2D space. Multi-level histograms constrain the solution by ensuring agreement of the solution across multiple frequencies in the image.
The iojeefive constraint aspect of the invention uses geometric rules about how features must correspond between images in the stereo pair to eliminate false disparity solutions. It. maps these geometric rules on the concept of an imective function in set theory .
in set theory there are four major categories of function type that map one set of items (the domain) onto another set (the co-domain):
1. !njective, surjective function {also known as a bijection): All elements in the co- domain arc reference exactly once by elements in the domain.
2. Injective, nan-surjective function: Some elements in the co-domain are
references at most once by elements in the domain. This means that not all elements in the co-domain have to be referenced, but no element will be
referenced more than once.
3. Non-injective, surjective function: All elements in the co-domain are referenced
one or more times by elements in the domain.
4. Non-injective, non-siirjective function: Some elements in the co-domain are
referenced one or more times by elements in the domain. This means that not all elements in the co-domain have to be referenced.
In the context of feature correspondence, the domain and co-domain are pixels from each of the stereo cameras on an axis. The references between the sets are the disparity values. For example, if every pixel in the domain (image A) had a disparity value of "¾", then flits means that a perfect bijection exists between the two images, since e ver pixel in the domain maps to the same pixel in the co-domain.
Given the way that features in an image are shifted between the two cameras, we know that elements in the co-domain (Image B) can only shift in one direction (i.e. disparity values are > 0) for diffuse features in the scene. When features exist at the same depth they will all shift together at the same rate, maintaining a bijection.
FIG. 42 shows an example of a bijection where every pixel in the domain maps to a unique pixel in the co-domain, hi this case, the image features are all at infinity distance and thus do not appear to shift between the camera images.
FIG. 43 shows anothe r example of a bijection. In tins ease ail the image features are closer to the cameras, but are all at the same depth and hence shift together.
However, since features will exist at different depths, some features will shift more than others and will sometimes even cross over each other. In this situation, occlusions in the scene will be occurring which means that sometimes, a feature visible in image "A" will be totally occluded by another object in the image "8" .
in this situation, not every feature in the co-domain image will be referenced if it was occluded in the domain image. Even still, it is impossible for a feature in the co-domain to be referenced more than one time by the domain. This means that while we cannot enforce a byective function, we can assert that the function must be injective. This is where the name "injective constraint" is derived.
FIG. 44 shows an example of an image with a foreground and background . Note that the foreground moves substantially between images. This causes new features to be revealed in the co- domain that will have .no valid reference in the domain. This is still art injective function, but not a bisection.
in accordance with the invention, now that we know we can enforce this constraint, we are able to use it as a form of error correction in the disparity solution. In an exemplary practice of the invention, a new stage would be inserted in the feature correspondence pipeline (either after the FDDE calculation but before histogram, voting, or perhaps after histogram voting) that checks for violations of mis
constraint. By maintaining a reference count for each pixel in the co-domain and checking to ensure that the reference count never exceeds I, we can determine that a violation exists. (See, e.g., FIG, 45, which shows an example of a detected violation of the injective constraint.)
In accordance with the invention, if such a violation is detected, there are several ways of addressing it. These approaches have different performance levels, implementation complexity and memory overheads that, will suggest which are appropriate in a given situation. They include the following;
1. First come, first served; The first element in the domain to claim an element in co-domain gets priority, if a second element claims the same co-domain element, we invalidate that match and mark it. as "invalid", invalid disparities would be skipped over or interpolated across later in the pipeline.
2. Best match wins: The actual image matching error or histogram vote count rc compared between the two possible candidate element in the domain against the contested element in the co- domain. The one with the best match wins.
Smallest disparity wins: During image reconstruction, typically errors caused by to small a disparity are less noticeable than errors with too high a disparity. As such, if there is contest for a given co-domain element, select the one with the smallest disparity and invalidate the others.
Seek alternative candidates; Since each disparity value is the result of selecting a minimum in the image comparison error function or histogram peak vote count, this means there ma be alternative possible matches, which didn't score as well . As such, if there is a contest for a given co-domain element, select the 2nd or 3rd best candidate in that order. This approach may need to iterate several times in order to ensure that all violations are eliminated across the entire domain, if after a given number of fall back attempts, the disparity value could be set to 'invalid 75 as described in { I). This attempt threshold would be a tradeoff between finding the ideal solution and computation time. The concept of alternative match candidates is illustrated, by way of example, in FIG. 46, which shows a graph of an exemplary image comparison error function. As shown therein, in addition to the global minimum error poin there are other error minimums that could act as alternative match candidates.
THE REPRESENTATION STAGE
Disparity and sample buffer index at 2D control points
An exemplary practice of the invention involves the use of a disparity value and a sample buffer index at 2D control points. This aspect works by defining a data structure representing a 2D coordinate in image space and containing a disparity value, which is treated as a "pixel velocity" in screen space with respect to a given movement of the view vector.
With a strong disparity solution, that single scalar value can be modulated with a movement vector to slide around a pixel in the source image in any direction in 2D, and it wiil produce a credible reconstruction of 3D image movement as if it had been taken from that different location.
In addition, the control points can contain a sample buffer index that indicates which of the camera streams to take the samples from. For example, a. given feature may be visible in only one of the cameras in which case we will want to change the source that the samples are taken from when reconstructing the final reconstructed image.
Not every pixel must have a control point since the movement of most pixels can be
approximated by interpolating the movement of key surrounding pixels. As such, there are several methods that can be used to establish when a pixel should be given a control poin t.. Given that the con trol points are used to denote an important depth change, the control points should typically he placed along edges in the image, since edges often correspond to depth changes.
Computing edges is a known technique already present in commercially available camera pipelines and image processing. Most conventional approaches are based on the use of image
convolution kernels such as the Sobe! filter, and its more complex variants and derivatives. These work b taking the first derivative of the image intensity to produce a gradient field indicating the rate of change of image intensity surrounding each pixel From this a second derivative can be taken, thus locating the peaks of image intensity change and thus the edges as would be perceived fay the human vision system.
Extraction of Unique Samples for Streaming Bandwidth Reduction
This aspect of the invention is based on the observation that many of the samples in the multiple camera streams are of the same feature and are thus redundant. With a valid disparity estimate, it can be calculated that a feature is either redundant or is a unique feature from a specific camera, and
features/samples can be flagged with a refe rence count of how many of the views "reference" that feature. Compression Method (or Streaming with Video
Using the reference count established above, a system in accordance with the invention can choose to only encode and transmit samples exactly one time. For example, if the system is capturing 4 camera streams to produce the disparity and control points and have produced reference counts, the system will be able to determine whether a pixel is repeated in all the camera views, or only visible in one. As such, the system need only transmit to the encoder the chunk of pixels from each camera that are actually unique. This allows for a bandwidth reduction in a video streaming session.
HEAD TRACKING
Tracking to control modulation of disparity values
Using conventional head tracking methods, a system in accordance with the invention can establish an estimate of the viewer head or eye location and/or orientation. With this information and the disparity values acquired from feature correspondence or within the transmitted control point stream, the system can slide the pixels along the head movement vector at a rate that is proportional to the disparity. As such, die disparity forms the radius of a "sphere" of motion for a given feature.
This aspect allows a 3D reconstruction to be performed simply by warping a 2D image, provided the control points are positioned along important feature edges and have a sufficiently high quality disparity estimate. In accordance with this method of the invention, no 3D geometry m the form of polygons or higher order surfaces is required.
Tracking to control position of 20 crop bos location and sua in reconstruction
n order to create the appearance of an invisible device display, the system of the invention must not only re-project the view from a different view origin, but must account for the tact thai as the viewer moves his or her head, they only see an aperture into the virtual scene defined by the perimeter of the device display. In accordance with a practice of the invention, a shortcut to estimate this behavior is to reconstruct the synthetic view based on the view origin and then crop the 2D image and scale it up to fill the v iew window before presentation, th minima and maxima of the c rop box being defined as a function of the viewer head location with respect to the display and the display dimensions.
Hybrid Markerless Head Tracking
An exemplary practice of the V3D invention contains a hybrid 2D/3D head detection component that combines a fast 2D head detector with the 3D disparity data from the multi-view solver to obtain an accurate v iewpoint position in 3D space relative to the camera system.
FIGS. 47A-B provid a flow diagram that illustrates the operation of the hybrid markerless head tracking system . As shown in FIGS. 47A-B, starting wi th an image captured by one of the color cameras, the system optionally converts to luminance and downsamp!es the image, and men passes it to a basic 2D facial feature detector. The 2D feature detector uses the image to extract an estimate of the head and eye position as well as the face's rotation angle relative to the image plane. These extracted 2D feature positions are extremely noisy from frame to frame which, if taken alone as a 3D viewpoint, would not be sufficiently stable for the intended purposes of the invention. Accordingly, the 2D feature detection is used as a starting estimate of a head position.
The system uses tins 2D feature estimate to extract 3D points from the disparity date that exists in the same coordinate system as the original 2D image. The system first determines an average depth for the face by extracting 3D points via the disparity date for a small area located in the center of the face. This average depth is used to determine a reasonable valid depth range that would encompass the entire head.
Using the estimated center of the face, the face's rotation angle, and the depth range, the system then performs a 2D ray march to determine a best-fit rectangle that includes 'the head. For both the horizontal and vertical axis, the system calculates multiple vectors that are perpendicular to the axis but spaced at different intervals. For each of these vectors, the system tests the 3D points starting from outside the head and working towards the inside, to the horizontal or vertical axis. When a 3D point is encountered that, falls within the previously designated valid depth range, the system considers that a valid extent of the head -rectangle.
From each of these ray marches along each axis, the system can determine a best-fit rectangle for the head, from which the system then extracts all 3 points that lie withi this best-fu rectangle and calculates a weighted average, if the number of valid 3D points extracted from this region pass a threshold in relation to the maximum number of possible 3D points in the region, then there is designated a valid 3D head position result.
FIG. 48 is a diagram depicting this technique for calculating the disparity extraction
two-dimensional rectangle (i.e., the ''best-fit .rectangle").
To compensate for noise in the 3D position, the system can interpolate from frame -to-fi¾me based on the time delta that has passed since the previous frame.
RECONSTRUCTION
2D warping reconstruction of specific view from samples and control points
This method of the invention works by taking one or more source images and set of control points as described previousl . The control points denote "handles" on the image which we can then move around in 2D space and interpolate the pixels in between. The system can therefore slide the control points around in 2D image space proportionally to their disparit value and create the appearance of an image taken from a different 3D perspective. The following are details of how the interpolation can be accomplished in accordance with exemplary- practices of the invention.
Lines
This implementation of 2D warping uses the line drawing hardware and texture filtering available on modem GPU hardware, such as in a conventional smartphone or other mobile device, it has the advantages of being easy to implement, fast to calculate, and avoiding the need to construct, complex connectivity meshes betwee the control points in multiple dimensions. It works by first rotating the source images and control points coordinates such that the rows or columns of pixels are parallel to the vector between the original image center and the new view vector. For purposes of this explanation, assume the view vector is aligned to image scaniines. Next, the system iterates through each scanline and goes through ah the control points for that scan! me. The system draws a line beginning and ending at each control point in 2D image space, but adds the disparity multiplied by the view vector magnitude with the x coordinate. The system assigns a texture coordinate to the beginning and end points that is equal to their original 2D location in the source image.
The GPU will draw the line and will interpolate the texture coordinates linearly along the line. As such, image data between (he control points will be stretched linearly. Provided control points are placed on edge features, the interpolation will not be visually obvious.
After the system has drawn ail the lines, the result is a re-projected image, which is then rotated back by the inverse of the rotation originally applied to align the view vector with the scaniines.
Polygons
Thrs approach is related to the lines but works by linking control points not only along a scanline but also between scaniines. in certain cases, this may provide a higher quality interpolation, than lines alone.
Stretch / Slide
This is an extension of the control points data, structure and effects the way the reconstruction interpolation is performed. It helps to improve the reconstruction qualit on regions of large
disparit /depth change. In such regions, for example on the boundary of a foreground and background object, it is not always idea to interpolate pixels between control points, but rather to slide the foreground and background independently of each other. This will open u a void in the image, but this gets filled with samples from another camera view.
The determination of when it is appropriate to slide versus the default stretching behavior can be made by analyzing the disparity histogram and checking for multi-modal behavior. If two strong modes are present, this indicates the control point is on a boundary where it would be better to allow the foreground and background to move independently rather than interpolating depth between them.
Other practices of the invention can include a 2D crop based on head location (see the discussion above relating to head tracking),, and rectification transforms for texture coordinates. Those skilled in the art will, understand that the invention can be practiced in connection with conventional 2D displays;, or various forms of head-mounted stereo displays (HMDs), which may include binocular headsets or lenticular displays.
Digital Processing Environmertt In Which in vention Ca Be implemented
Those skilled in the art will understand that the above described embodiments, practices and examples of the invention can be implemented using known network, computer processor and
telecommunications devices, in which the telecommimications devices can include known Forms of cellphones, smartphones. and other known forms of mobile devices, tablet computers, desktop and laptop computers, and known forms of digital network components and server cloud netxvork ciient
architectures that enable communications between such devices.
Those skilled in the art will also understand that method aspects of the present invention can be executed in commercially available digital processing systems, such as servers, PCs, lapto computers, tablet computers, cellphones, smartphones and other forms of mobile devices, as well as known forms of digital networks, including architectures comprising server, cloud, network, and client aspects, for communications between such devices.
The terms "computer software," "computer code product," and "computer program product" as used herein can encompass any set of computer-readable programs instructions encoded on a non- transitory computer readable medium. A computer readable medium can encompass any form of computer readable element, including, but not limited to, a computer hard disk, computer floppy disk, computer-readable flash drive, computer-readable RAM or ROM element or any other known means of encoding, storing or pro viding digital information, whether local to or remote from the cellphone, smaxtphone, tablet computer, PC, laptop, computer-driven television, or oilier digital processing device or system. Various forms of computer readable elements and media are well known in the computing arts, and their selection is left to the implementer.
In addition, those skilled in the art will understand that the invention can be implemented using computer program modules and digital processing hardware elements, including memory units and other data storage units, and including commercially available processing units, memor units, computers, servers, smartphones and other computing and telecommunications devices. The term, "modules", "program modules", "components", and the like include computer program instructions, objects, components, data structures, and the like mat can be executed to perform selected tasks or achie ve selected outcomes. The various modules shown in the drawin s and discussed in the description herein refer to computer-based or digital processor-based elements that can be implemented as software, hardware, firmware and/or other suitable components, taken separately or in combination^ that provide the functions described herein, and which ma be read from computer storage or memory, loaded into the memory of a digital processor or set of digital processors, connected via a bus. a communications network, or other com munications pathways, which, taken together, constitute an embodiment of the present invention.
The terms "dat storage module", "data storage element", "memory element" and the like, as used herein, can refer to any appropriate memory element usable for storing program instructions, machine readable files, databases, and other data structures. The various digital processing, memory and storage elements described herein can be implemented to operate on a single computing device or system, such as a server or collection, of servers, or they can be implemented and inter-operated on various devices across a network, whether in a server-client arrangement, server-cloud-cl ient arrangement, or other configuration in which client devices can communicate with allocated resources, functions or applications programs, or with a server, via a communications network. It will also be understood that computer program instnretions suitable for a practice of the present invention can be written in any of a wide range of computer programming languages, incl uding Java, C' -f- , and the like. It will also be understood that method operations shown in the flowcharts can b executed in different orders, and that not all operations shown need be executed, and that many other combinations of method operations are wi thin the scope of the invention as defined by the attached claims. Moreover, the functions provided by the modules and dements shown in the drawings and described in the foregoing description can be combined or sub-divided in rious ways, and still be within the scope of the invention as defined by the attached claims.
The Applicant has implemented various aspects and exemplary practices of the present invention, using, among others, the following commercially available elements:
1. A 7" 1280x800 IPS display.
2. Three PoiniGiey Chame!eon3 (CM3-U3-I 3S2C-€S) 1.3 Megapixel camera modules with i/3" sensor size assembled on a polycarbonate plate with shutter synchronization circuit.
3. Sunex DSL377A-650-F/2.8 Mi 2 wide-angle lenses.
4. An Intel Core i7-4650U processor which includes on-chip the following:
a. An Intel BD Graphics 5000 Integrated Graphics Processing Unit; and b. An Intel QuickSync video encode and decode hardware pipeline,
5. OpenCL API on an Apple Mac OS X operating system to implement, in accordance with exemplar)' practices of the invention described herein, Image Rectification, Fast Dense Disparity Estimate(s) (FDDE) and Multi-level Disparity Histogram aspects.
6. Apple Core Video and VideoToolho.x APIs to access QuiekSync video compression hardware.
7. OpenCL and OpeaGL AP!(s) for V3D view reconstruction in accordance with exemplary practices of the invention described herein.
The attached schematic diagrams FIGS. 49-54 depict system aspects of the invention, including digital processing devices and architectures in which the invention can be implemented. By way of example, FIG. 49 depicts a digital processing device, such as a commercially available smartphone, in which the invention can be implemented; FIG. 50 shows a full-duplex, bi-directional practice of the invention between two users and their corresponding devices: FIG. 51 shows the use of a system in accordance with the invention to enable a first user to view a remote scene; FIG. 52 shows a one-to- iany configuration in which multiple users (e.g., audience members) can. view either simultaneously or asynchronously using a variety of different viewing elements in accordance with the invention: FIG. 53 shows an embodiment of the invention in connection with generating an image data stream for the control system of an autonomous or self-driving vehicle; and FIG. 54 shows the use of a head-mounted display (HMD) in connection with the invention, either in a pass-through mode to view an actual, external scene (shown on the right side of FIG. 54), or to view prerecorded image content.
Referring now to FIG. 4 , the commercially available smartphone. tablet computer or other digital processing device 492 communicaies with a conventional digital communications network 494 via a communications pathway 495 of known form (the collective combination of device 492, network 494 and communications pathwa (s) 495 forming configuration 490), and the device 492 includes one or more digital processors 496. cameras 4910 and 4912. digital memory or storage elements) 4914 containing, among other items, digital processor-readable and processor-executable computer program instructions (programs) 4 16, and a display element 498. In accordance with known digital processing techniques, the processor 496 can execute programs 4916 to cany out various operations, including operations in accordance with the present invention.
Referring now to FIG. 50, the full-duplex, bi-directional practice of the invention between two users and their corresponding devices (collectively forming configuration 500) includes first user and scene 503, second user and scene 505, smartphones, tablet computers or other digital processing devices 502, 504, network 506 and communications pathways 508, 5010. The devices 502, 504 respectivel include cameras 5012, 5014. 5022, 5024, displays 5016, 5026, processors 501 8, 5028, and digital memory or storage elements 5020, 5030 (which may store processor-executable computer program instructions, and which may be separate from the processors).
The configuration 10 of FIG . 51 , in accordance with the invention, for enabling a first user 514 to view a remote scene 515 containing objects 5022, includes smartphone or other digital processing device 5038. which can contain cameras 5030,5032, a display 5034, one or more processors) 5036 and storage 5038 (which can contain computer program instructions and which can be separate from processor 5036). Configuration 510 also includes network 5024» communications pathways 5026, 5028, remote cameras 16, 518 with a view of the .remote scene 15, processors) 5020, and digital memory or storage eieme.ni(s) 5040 (which can contain computer program instructions, and which can be separate from processor 502.0).
The one-to-many configuration 520 of FIG. 52, in which multiple users (e.g., audience members) using smartphones, tablet computers or other devices 526.1, 526.2, 526.3 can view a remote scene or remote first user 522, either simultaneously or asynchronously, in accordance with the invention, includes digital processing device 524, network 5212 and communications pathways 5214, 5216.1, 5216.2, 5216.3. The smartphone or other digital processing device 524 used to capture images of the remove scene or first user 522, and the smartphones or other digital processing devices 526. L 526.2, 526.3 used by respective viewers/audience members, include respective cameras, digital processors, digital memory or storage elements) {which may store computer program instructions executable by the respective processor, and which may be separate from the processor), and displays.
The embodiment or configuration 530 of the invention, illustrated in FIG. 53, for generating an image data stream for the control system 5312 of an autonomous or self-driving vehicle 532, can include camera(s) 5310 having a view of scene 534 containing objects 536, processors) 538 (which may includs or have in communication therewith digital memory or storage elements for storing data and/or processor-executable computer program instructions) in communication, with vehicle control system 5312. The vehicle control system 5312 may also include digital storage or memory elements) 5314. which may include executable program, instructions, and which may be separate from vehicle control system 5312.
HMD-related embodiment or configuration 540 of the invention, illustrated in FIG. 54, can include the use of a head-mounted, display (HMD) 542 m connection with the invention, either in a pass- through mode to view an actual, external scene 544 containing objects 545 (shown on the right side of FIG, 54), or to view prerecorded image content or data representation 5410. The HMD 542., which can be a purpose-built HMD or an adaptation of a sniartphone or other digital processing device, can be In communication with an external processor 546, external digital memory or storage elements) 548 that can contain compute r program instructions 549, and/or in communication wi th a source of prerecorded content or data representation 5410, The HMD 542 shown in FIG, 54 includes cameras 5414 and 5416 which can have a view of actual scene 544; left and right displays 5418 and 5420 for respectively displaying to a user's left and right eyes 5424 and 5426; digital processor(s) 5412, and a liead/eye face tracking element 5422. The tracking element: 5422 can consist of a combination of hardware and software elements and algorithms, described in greater detail elsewhere herein, in accordance with the present invention. The processor element(s) 5 12 of the HMD can also contain, or have proximate thereto, digital memory or storage elements, which may store processor-executable computer program instructions.
in each of these examples, illustrated in FIGS. 49-54, digital memory or storage elements can contain digital processor-executable computer program instructions, which, when executed by a digital processor, cause the processor to execute operations in accordance wi th various aspects of the present invention.
Flowcharts Of Exemplary Practices Of The invention
FIGS. 55-80 are flowcharts illustrating method aspects and exemplary practices of the invention The methods depicted in these flowcharts are examples only; the organization, order and number of operations in the exemplary practices can he varied; and the exemplary practices and methods can be arranged or ordered differently, and include different functions, whether singly or in combination, while still being within the spirit and scope of the present invention. Items described below in parentheses are among other aspects, optional in a given practice of the invention. FIG. 55 is a flowchart of a V3D method 550 according to an exemplars' practice of the mvention, including the following operations:
551 : Capture images of second user;
552: Execute image ratification:
553: Execute feature correspondence, by detecting common features;
554: Generate data representation;
555: Reconstruct synthetic view of second user based oa representation;
356: Use head tracking as input to reconstruction.:
557: Estimate location of user's head/eyes;
558: Display synthetic view to first user on display screen used fay first user; and
559: Execute capturing, generating, reconstructing, and displaying such that the first user can have direct, virtual eye contact with second user through first users display screen, by reconstructing and displaying synthetic view of second user in which second user appears to be gazing directly at first user even if no camera has direct eye contact gaze vector to second user;
(Execute such that first user is provided visual impression of looking through display screen as a phy sical window to second user and visual scene surrounding second user, and first, user is pro vided irotnersive visual experience of second user and scene surrounding the second user);
(Camera shake effects are inherently eliminated, in that capturing, detecting., generating, reconstructing and displaying are executed such that first user lias virtual direct view through his display screen to second user and visual scene surrounding second user; and scale and perspective of image of second user and objects in v isual scene surrounding second use r are acc urately represented to first user regardless of user view distance and angle).
FIG. 56 is a flowchart of another V3B method 560 according to an exemplary practice of the invention, including the following operations:
56 i : Capture images of remote scene ; 562: Execute image rectification:
563; Execute feature correspondence function by detecting common features and measuring relative distance in image space between common features, to generate disparity values;
564: Generate data, representation, representative of captured images and corresponding disparity
values; 65: Reconstruct synthetic view of the remote scene, based on representation;
566: Use head tracking as input to reconstruction;
567: Display synthetic view to first user (on display screen used by first user);
(Estimate location of user's head/eyes) ; 568: Execute capturing, detecting, generating, reconstructing, and displaying such that user is provided visual impression of looking through display screen as physical window to remote scene, and user is provided an immersi e visual experience of remote scene);
(Camera shake effects are inherently eliminated, is that capturing, detecting, generating, reconstructing and displaying are executed such that first user has virtual direct view through his display screen to remote visual scene; and scale and perspective of image of and objects in remote visual scene are accurately represented regardless of view di tance and angle).
FIG. 7 is a flowchart of a self-portraiture V3D method 570 according to an exemplary practice of the invention, including the following operations: 571 : Capture images of user during setup time (use camera provided on or around periphery* of display screen of user's handheld device with, view of user's face during self-portrait setu time);
572; Generate tracking information (by estimating location of user's head or eyes relati ve to handheld device during setup time);
573: Generate data representation representative of captured images; 574: Reconstruct synthetic view of user, based on the generated data represeniation and generated tracking information;
575: Display to user the synthetic view of user (on the display screen during the setup time) (thereby enabling user, while setting up self-portrait, to selectively orient or position his gaze or head, or handheld device and its camera, with real-time visual feedback); 576: Execute capturing, estimating, generating, reconstructing and displaying such that, in self-portrait, user can appear to be looking directly into camera, even if camera does not have direct eye contact gaze vector to user.
FIG. 58 is a flowchart of a photo composition V3D method 580 according to an exemplary practice of the invention, including the following operations: 581 : At photograph setup time, capture images of scene to he photographed (use camera provided on a side of user's handheld device opposite display screes side of user's device);
582: Generate tracking information (by estimating location of user's head or eyes relative to handheld device during setup time) (wherein estimating a location of the user head or eyes relative to handheld de vice uses at least one camera on display side of handheld device, having a vie w of user's head or eyes during photograph setup time);
583: Generate data representation representative of captured images:
584: Reconstruct synthetic view of scene., based on generated data representation and generated
tracking information (synthetic view .reconstructed suc that scale and perspective of synthetic view have selected correspondence to user's viewpoint relati ve to handheld device and scene);
585: Display to user the synthetic view of the scene (on display screen during setup time) (thereby enabling user, while setting up photograph, to frame scene to be photographed, with selected scale and perspective within displa frame, with real-time visual feedback) (wherein user can control scale and perspective of sy nthetic view by changing position of handheld device relative to position of user's head).
FIG. 5 is a flowchart of an HMD-related V3D method 590 according to an exemplary practice of the invention, including the following operations:
591 : Capture or generate at least two image streams;
(using at least one camera attached or mounted on or proximate to external portion or surface of HMD);
(wherein captured image streams contain images of a scene), (wherein at least one camera is panoramic, night-vision, or thermal imaging camera); (at least one IR TOF camera or imaging device that directly provides depth); 392: Execute .feature correspondence function;
593; Generate data representation representative of captured images contained m the captured image streams;
(Representation can also he representati ve of disparity values or depth infbrmation); 594: Reconstruct two synthetic views, based on representation;
(use motion vector to modify respective view origins, during reconstructing, so as to produce intermediate image frames to be interposed between captured image frames in the captured image streams and interpose the intermediate image frames between the captured image frames so as to reduce apparent latency);
595: Display synthetic views to the user, via HMD;
596: (Track location/position of user's head/eyes to generate motion vector usable in reconstructing synthetic views);
597: Execute reconstructing and displaying such that each of the synthetic views has respective view origin corresponding to respective virtual camera location, wherein the resp ctive view origins are positioned such that, {he respective virtual camera locations coincide with respective locations of user's left and right eyes, so as to provide user with substantially natural visual experience of perspective, binocular stereo and occlusion exemplary practices of the scene, substantially as if user were directly viewing scene without an HMD.
FIG. 60 is a flowchart of another HMD-related V 3D method 600 according to an exemplary practice of the invention, including the following operations:
60 i : Capture or generate at. least two image streams; (using at least one camera);
(wherein captured image streams can contain images of a scene); (wherei captured image streams can be pre-recorded image content): (wherein ai least one camera is panoramic, night-vision, or thermal imaging): (wherei at least one IR TOF that directly provides depth); 602: Execute feature correspondence function;
603: Generate data representation representative of captured images contained in captured image streams;
(representation can also be representati ve of disparity values or depth information);
(data representation can be pre-recorded):
604: Reconstruct two synthetic views, based on representation;
(use motion vector to modify respective view origins, during reconstructing- so as to produce intermediate image frames to be interposed between captured image frames in the captured image streams and interpose the intermediate image frames between the captured image frames so a to reduce apparent latency); (track location/position of user's head/eyes to generate motion vector usable in reconstructing synthetic views);
605: Display synthetic views to the user, via HMD;
606: Execute reconstructing and displaying such that each of the synthetic views has respective view origin corresponding to respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of user's left and right eyes, so as to provide user with substantially natural visual experience of perspective, binocular stereo and occlusion exemplary practices of the scene, substantially as if user were directly viewing scene without an HMD.
F1.G. 61 is a flowchart of a vehicle control system-related method 610 according to an exemplary practice of tire invention, including the following operations:
61 1: Capture images of scene around at least a portion of vehicle (using at least one camera having a view of scene )
612: (Execute image .rectification); 13 : Execute feature correspondence function;
(by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between common features, to generate disparity values);
(detect common features between images captured by single camera over time);
(detect common features between corresponding images captured by two or more cameras);
614: Calculate corresponding depth information based on disparity values:
(or obtain depth information using IR TOF camera); 15: Generate from, the images and corresponding depth information an image data stream for use by the vehicle control system .
FIG. 62 is a flowchart of another V3D method 620 according to an exemplary practice of the invention, which can utilize a view vector rotated camera configuration and/or a number of the foil owing operations:
621 ; Execute image capture; 623: (of other user and scene surrounding other user); 624: (of remote scene);
625: (Use single camera (and detect common features between images captured over time)):
626: (Use at least one color camera);
627: (Use at least one infrared structured light emitter);
628: (Use at least one camera which is an infra-red tiniest-flight camera that directly provides depth infonnation);
629: (Use at least two cameras (and detect common features between corresponding images captured by respective cameras):
6210: (Camerais j for capturing images of the second user are located at or near the periphery or edges of a display device used by second user, display device used by second user having display screen viewable by second user and having a geometric center: synthetic view of second user corresponds to selected virtual camera location, selected virtual camera location corresponding to point at or proximate the geometric center) ;
62 ί 1 : ( U se a view vector ro tated camera configuration in which the locations of first and second
cameras define a line; rotate the line defined by first and second camera locations by a selected amount from selected horizontal or vertical axis to increase number of valid feature
correspondences identified in typical real-world settings by feature correspondence function) (first and second cameras positioned relative to each other along epipolar lines);
6212: (Subsequent to capturing of images, rotate disparity values back to selected horizontal or vertical orientation along with, captured images):
6213: (Subsequent to reconstructing of synthetic view, rotate synthetic view back to selected horizontal or vertical orie tation):
6214: (Capture using exposure cycling);
621.5: (Use at least three cameras arranged in substantially L-shaped configuration, such that pair of cameras is presented along first axis and second pair of cameras is presented along second axis substantially perpendicular to first axis).
FIG. 63 is a flowchart of an exposure cycling method 630 according to an exemplary practice of the invention, including the following operations; 631 : Dynamically adjust exposure of eamera{s) on irame-by~frame basis to improve disparity estimation in regions outside exposed region: take series of exposures, including exposures lighter than and exposures darker that} a visibility-optimal exposure; calculate disparity values tor each exposure; and integrate disparity values into an overall disparity solution over time, to improve disparity estimation;
632: The overall disparity solution includes a disparity histogram into which disparity values are
integrated, the disparity histogram being converged over time, so as to improve disparity estimation.
633: (analyze quality of overall disparity solution on respective dark, mid-range and light pixels to generate variance information used to control exposure settings of the eamerais), thereb to form a closed loop betwee n q uality of the disparity estimate and set of exposures requested from eamera{s))
634: (overall disparity solution includes disparity histogram: analyze variance of disparity histograms on respective dark, mid-range and light pixels to generate variance information used to control exposure settings of eamerais), thereby to form a closed loop between quality of disparity estimate and set of exposures requested from. camera(s».
FIG. 64 is a flowchart of an image rectification method 640 according to an exemplary practice of the in vention, incl uding the following operations:
641 : Execute image rectification:
642: (to compensate for optical distortion of each camera and relative misalignment of the cameras);
643: (executing image rectification includes applying 2D image space transform)
644: (applying 2D image space transform incl udes using GPGPU processor running shader program).
FIGS . 65A-B show a flowchart of a feature correspondence method 650 according to an exemplary practice of the invention, which can include a number of the following operations:
651 : Detect common features between corresponding images captured by the respective cameras;
652: (Detect common features between images captured fay single camera over time; measure relative distance in image space between common features, to generate disparity values);
653: (Evaluate and combine vertical - and horizontal -axis correspondence information);
654: (Apply, to image pixels containing disparity solution, a coordinate transformation to a unified coordinate system (un-rectified coordinate system of the captured images});
655: Use a dispantv histogram-based method of integrating data and determining correspondence: constructing disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel;
656: (Disparity histogram functions as probability density function (PDF) of disparity for given pixel, in. which higher values indicate higher probability of corresponding disparity range being valid for given pixel);
657: (One axis of disparity histogram indicates given disparity range; second axis of histogram
indicate number of pixels in kernel surrounding central pixel in question that are voting for given disparity range);
6SS: (Votes indicated by disparity histogram initially generated utilizing sum of square differences
[SSD] method: executing SSD method with relatively small kernel to produce fast dense disparity map m which eac pixel has selected disparity that represents lowest error; then, proces ing plurality' of pixels to accumulate into disparity histogram a tally of number of votes for given disparity in relatively larger kernel surrounding pixel in question);
659: (Transform the disparity histogram into a cumulative distribution function (CDF) from which width of corresponding interquartile range can be detemriiied, to establish confidence le vel in corresponding disparity solution):
6 10: (Maintain a count of number of statistically significant modes in histogram, thereby to indicate modality);
65.11 : (Use modality as input to reconstruction, to control application of stretch vs. slide reconstruction method)
6512: (Maintain a disparity histogram over selected time interval ;. and accumulate samples into
histogram, to compensate for camera noise or othe sources of motion or error):
6513 : (Generate fast disparity estimates for multiple independent axes then combine corresponding, respective disparity histograms to produce statistically more robust disparity solution};
6514: {Evaluate interquartile (IQ) range of CDF of given disparity histogram to produce IQ result; if IQ result, is indicative of area of poor sampling signal to noise ratio, due to camera over- or underexposure, then control camera exposure based on IQ result to improve poorly sampled, area of given disparity histogram);
6515: (Test for only a small set of disparity values using small-kernel SSD method to generate initial results; populate corresponding disparity histogram with initial results; then use histogram votes to drive further SSD testing within given range to improve disparity resolution over time), 651.6: (Extract sub-pixel disparity information from disparity histogram: 'where histogram indicates a maximum-vote disparity range arid an adjacent, ronner-up disparity range, calculate a weighted average disparity value based on ratio between number of votes for each of the adjacent disparity ranges);
6517: (The feature correspondence function includes weighting toward a center pixel in. a sum -of
squared differences (SSD) approach: apply higher weight to the center pixel for which a disparity solution is sought and a lesser weight outside the center pixel, the lesser weight being proportional to distance of given kernel sample from die center);
6518: (The feature correspondence function includes optimising generation of disparity values on
GPGPIJ computing stnictures);
6519: (Refine correspondence information over time);
6520: (Retain a disparity solution over a time interval, and continue to integrate disparity solution values for each irnage frame over the time interval, to converge on improved disparity solution by sampling over time);
652.1 : (Fill unknowns in a correspondence information set with historical data obtained from previously captured images: if a given image .feature is detected in an image captured by one camera, and no corresponding image .feature is found in a corresponding image captured by another camera, then utilize data for a pixel corresponding to the given image feature, from a corresponding, previously captured image).
FIG, 66 is a flowchart of a method 660 for generating a data representation, according to an exemplar;' practice of the invention, which can include a number of the following operations:
661 : Generate data structure representing 2D coordinates of control point in image space, and
containing a disparity value treated as a pixel velocity in screen space with respect to a given movement of a given view vector; and utilize the disparity value in combination with movement vector to slide a pixel in a gi ven source image in selected directions, in 2D, to enable a
reconstruction of 3D image movement;
662: (Each camera generates a respective camera stream; and the data structure contains a sample buffer index, stored in. association with control poin t coordinates, that indicates which, camera stream to sample in association wi th given control point);
663: (Determine whether a given pixel should be assigned a. control point);
664: (Assign control points along image edges: execute computations enabling identification of image 665: (Flag a given image feature with a reference count indicating how many samples reference the gi en image feature, to differentiate a uniquely referenced image feature, and sample corresponding to the uniquel referenced image feature, from repeatedly referenced image features; and utilize reference count extracting unique samples, to enable reduction in bandwidth requirements);
666: (Utilize the reference count to encode and transmit a given sample exactly once, even if a pixel or image feature corresponding to the sample is repeated in multiple camera views, to enable reduction in bandwidth requirements).
FIGS. 67A-B show a flowchart of an image reconstruction method 670, according to an exemplar practice of the invention, which can include a number of the 'following operations:
671 ; Reconstruct synthetic view based on data representation and tracking information; execute 3d image reconstruction by warping 2D image, using control points sliding given pixel along ahead movement vector at a displacement rate proportional to disparity, based on tracking information and disparity values;
672: (wherein disparity values are acquired from feature correspondence function or control, point data stream);
673: (Use tracking information to control 2D crop box: synthetic view is reconstructed based on view origin, and then cropped and scaled to fill user's display screen vie window; define minima and maxima of crop box as function of user's head location with respect to display screen and dimensions of display screen view window);
674: (Execute 2D warping reconstruction of selected, vie w based, on selected control points: designate set of control points, respective control points corresponding to respective, selected pixels in a source image: slide control points in selected directions in 2D space, wherein the control points are slid proportionall to respective disparity values; interpolate data values for pixels between the selected pixels corresponding to the control points; to create a synthetic view of the image from a selected new perspective in 3D space);
675: (Rotate source image and control point coordinates so that rows or columns of image pixels are parallel to the vector betwee the original source image center and the new view vector defined by the selected new perspective);
676: (Rotate the source image and control point coordinates to align the view vector to image scanlines; iterate through each scanline and each control point for a given scanline, generating a line element, beginning and ending at each control point in 2D image space, with the addition of the
corresponding disparity value multiplied by the corresponding view vector magnitude with the corresponding x-axis coordinate; assign a texture coordinate to the beginning and ending points of each generated line element equal to their respective, original 2D location in the source image; interpolate texture coordinates linearly along each line element to create a resulting image in which image data between the control points is linearly stretched);
677: (Rotate resulting image back by the inverse of the rotation applied to align the view vector with the scaniines):
678: (Link control points between scaniines, as well as along scaniines, to create polygon elements defined by control points, across which interpolation is executed);
679: (For a. given source image, selectively slide image foreground and image background
independently of each other: sliding is utilized in regions of large disparity or depth change);
6710: (Determine whether to utilize sliding: evaluate disparity histogram, to detect multi-modal behavior indicating that given control point is on an image boundary for which allowing foreground and background to slide independent of each other presents better solution than interpolating depth between foreground and background disparity histogram functions as probability density function (PDF) of disparity for a given pixel, in which higher values indicate higher probability of the corresponding disparity range being valid for the given pixel);
6711: (U se at least one sample integration function table (sift), the sift including a table of sample
integration functions for one or more pixels in a desired output -resolution of an image to be displayed to the user; a given sample integration function maps an input view origin vector to at least one known, weighted 2D image sample location in at .least one input image buffer).
FIG. 6-8 is a flowchart of a display me thod 680, according to an exemplary practice of the invention, which can include a number of the fotlowing operations:
681 : Display synthetic view to user on display screen;
682: (Display synthetic view to user on a 2D display screen: update display in real-time, based on tracking information, so that display appears to the user to be a window into a 3d scene responsive to user's head or eye location:
683: (Display synthetic view to user on binocular ste eo display device);
684: (Display synthetic view to user on lenticular display that enables auto-stereoscopic viewing).
FIG. 69 is a flowchart of a method 690 according to an exemplary practice of the invention, utilizing a multi-level disparity histogram, and which can also include the following: 691 : Capture images of scene, using at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis;
692: Execut feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, the feature correspondence function including constructing a multi-level disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel and the constructing of a multi-level disparity histogram includes executing a fast dense disparity estimate (FDDE) image pattern matching operation on successively lower-frequency do nsamp!ed versions of the input stereo images, the successively lower-frequency do nsampled versions constituting a set of levels of F DDE histogram votes:
692.1 Each level is assigned a level number, and each successively higher level is characterized by a lower image resolution:
692.2 (Downsampling is provided by reducing image resolution via low-pass filtering);
692.3 (Downsampling includes using a weighted summation of a kerne! in level jn-1 } to produce a pixel value in level |oj, and the normalized kernel center position remains the same across all levels);
692.4 (For a given desired disparity solution at fill I image resolution, the FDDE votes for even' image level are included in the disparity solution);
692.5 Maintain in memory unit a summation stack, for executing summation operations relating to feature correspondence);
693: Generate a multi-level histogram including a set of initially independent histograms at different levels of resolution:
693.1 : Each histogram bin in a given level represent votes for a disparity determined by the FDDE at that level;
693.2: Each histogram bin in a given level has an associated disparity uncertainty range, and the disparit uncertaint range represented by each histogram bio is a selected multiple wider than the disparity uncertainty range of a bin in the preceding le vel;
694: Apply a sub-pixel shift to the disparity values at each level during downsampling, to negate
rounding error effect: apply half pixel shift to only one of the images in a stereo pair at each level of downsampling;
694, 1 : Apply sub-pixel shift implemented inline, within the weights of the filter kernel utilized to implement the downsampling from level to level. 695: Execute histogram integration, including executing a recursive summation across all the f ODE levels;
695.1 ; During summation, modify the weighting of each level to control the amplitude of the effect of lower levels in overall voting, by apply ing selected weighting coefficients to selected levels;
696: infer a sub-pixel disparity solution from the disparity histogram, by calculating a sub- ixel offset based on the number of votes for the maximum vote disparity range and the number of votes for an adjacent, runner-up disparity range.
FIG. 70 is a flowchart of a method 700 according to an exemplars.' practice of the invention, utilizing RIID image space and including the following operations:
701 : Capture linages of scene, using at least first and second cameras hav ing a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and for each camera pair axis, execute image capture using the camera pair to generate image data;
702: Apply/execute rectification and undistortmg transformations to transform the image data into
Ul> image space;
703: iterative!}' downsample to produce multiple, successively lower resolution levels;
704: Execute FDDE calculations for each level to compile FDDE votes for each level;
705: Gather FDDE disparity range votes into a multi-level histogram;
706: Determine the highest ranked disparity range in the multi-level histogram;
707: Process the multi-level histogram disparity data to generate a final disparity result.
FIG. 71 is a flowchart of a method 710 according to an exemplary practice of the invention, utilizing an injective constraint aspect and including the following operations:
71 1: Capture images of a scene, using at least first and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair;
712: Execute a feature correspondence function by detecting common features between corresponding images captured by the respecti ve cameras and measuring a relative distance in image space between the common features, to generate disparity values, 'the feature correspondence function including: generating a disparit solution based on the disparity values, and applying an injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured b the first camera, and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution, in response to violation of the injective constraint, and wherein the infective constraint is that no element in the co-domain is referenced more than once by elements in the domain .
FIG. 72 is a flowchart of a method 720 for applying an injective constraint, according to an exemplary practice of the i n vention, including the follo wi ng operations:
721 : Maintain a reference count for each pixel in the co-domain;
722: Does .reference count for the pixels in the co-domain exceed 'i
723: If the count exceeds "1":.
724: Signal a violation and respond, to the violation with a selected error correction approach.
FIG , 73 is a flowchart of a method 730 relating to error correction approaches based on injective constraint, according to an exemplary practice of the invention, including one or more of the following:
731 : First-come, first-served: assign priority to the first element tn the domain to claim an element in the co-domain, and if a second element in the domain claims the same co-domain element, invalidating that subsequent match and designating tha subsequent match to be invalid;
732: Best match wins: compare the actual image matching error or corresponding histogram vote count between the two possible candidate elements in the domain against the contested element in the co-domain., and designate as winner the domain candidate with the best match;
733; Smallest disparity wins: if there is a contest between candidate elements in the domain for a given co-domain element, wherein each candidate element has a corresponding disparity, selecting the domain candidate with the smallest disparity and designating the others as invalid;
734: Seek alternative candidates: select and test the next best domain candidate, based on a selected criterion, and iterating the selecting and testing until the violation is eliminated or a computational time limit, is reached.
FIG . 74 is a flowchart of a head eye/iaee location estimation method 740 according to an exemplary practice of the invention, including the following operations:
74.1 : Capture images of the second user, using at least, one camera having a. view of the second user's 742: Execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values;
743: Generate a data representation, representative of the captured images and the corresponding
disparity values;
744: Estimate a three-dimensional (3D) location of the first user's he d, face or eyes, to generate
tracking information, which can include the following:
744.1 : Pass a captured image of the first user, the captured image including Ac first user's head and lace, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the face relative to an image plane;
744.2: Use an estimated center-of-face position, face rotation angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the head; 744.3: Extract from the best-fit rectangle alt 3D points that lie within, the best-fit rectangle, and calculate therefrom a representative 3D head position;
744.4: If the number of valid 3D points extracted from the best-fit rectangle exceeds a. selected threshold in relation to the maximum number of possible 3D points in the region, then signal a valid 3D head position result
745: Reconstruct a synthetic view of the second user, based on the representation, to enable a. display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, including reconstructing the synthetic view based on the generated data representation and the generated tracking information.
FIG. 75 is a flowchart of a method 750 providing further optional operations relating to the 3D location estimation shown in FIG. 74, according to an exemplary practice of the invention, including the following:
751 : Determine, front the first estimate of head and eye location and lace rotation angle, an estimated center-of-face position;
752: Determine an average depth value for the face by extracting three-dimensional (3D) points via the disparity values for a selected, small rea located around the estimated center-of-face position;
753: Utilize the average depth value to determine a depth range that is likely to encompass the entire head;
754: Utilize the estimated center-of-face position, face rotation angle, and depth range to execute a second ray march to determine a best -fit rectangle that includes the head:
755: Calculate, for both horizontal and vertical axes, vectors that are perpendicular to each, respective axis but spaced at different interval; 756: for each, of the calculated vectors, test the corresponding 3D points starting from a position outside the head region and working inwards, to the horizontal or vertical axis,
757: When a 3D point is encountered that fails within the determined depth range, denominate that point as a valid extent of a best-fit head rectangle;
758: From each ray march along each axis, determine a best-fit rectangle for the head, and extracting therefrom all 3D points that lie within the best-fit rectangle, and calculating therefrom a weighted average;
759: If the number of valid 3D points extracted from the best-fit rectangle exceed a selected threshold in relation to the maximum number of possible 3D points in the region, signal a valid 3D head position result,
FIG , 76 is a flowchart of optional sub-operations 760 relating to 3D location estimation, according to an exemplar - practice of the invention, which can include a number of the following:
761 : Downsanrple captured image before passing it to the 2D facial feature detector.
762: Interpolate image data from video frame to video frame, based o the time that has passed from a given video frame from a previous video frame.
763: Convert image data to luminance values.
FIG. 77 is a flowchart of a .method 770 according to an. exemplary practice of the invention, utilizing URUD image space and including the following operations;
771 : Captitre images of a scene,, using at least three cameras having a view of the scene, the cameras being arranged in a substantially 'T-shaped configuration wherein a first pair of cameras is disposed along a first axis and second pai of cameras is disposed along a second axis intersecting wi th, but angularly displaced from, the first axis, wherein the first and second pairs of came ras share common camera at or near the intersection of the first, and second axis, so that the first and second pairs of cameras represent respecti ve first and second independent stereo axes that share a common camera;
772: Execute a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values;
773: Generate a data representation, representative of the captured images and the corresponding disparity values;
774: Utilize an unrectified. undistorted (U. UD) image space to integrate disparity data for pixels between the first and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the URUD space is an image space in which polynomial lens distortion has been removed from the image data but the captured image remains unrectified, FIG, 78 is a flowchart of a method 780 relating to optional operations in RUD/URUD image space according to an exemplary practice of the invention, including the following operations:
781 : Execute a stereo correspondence operation on the image data in a rectified, itndistorted (RUD) image space, and storing resultant disparity data in a RUD space coordinate system;
782: Store the resultant disparity data in a URUD space coordinate system;
783: Generate disparity histograms from the disparity data ia either RUD or URUD space, and store the disparity histograms irt a unified URUD space coordinate system (and apply a URUD to RUD coordinate transformation to obtain per~axis disparity values).
FIG. 79 is a flow chart of a method 790 relating to private disparit histograms according to an exemplary practice of the invention, including the following operations:
791 : Capture images of a scene using at least, one camera, having a. view of the scene;
792: Execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generat disparity values, using a disparity histogram method to integrate data and determine correspondence, which can include:
792.1 : Construct a disparity histogram indicating the relative probability of a given.
disparity value being correct fo a given pixel;
792.2; Optimize generation of disparity values on a GPU computing structure, by generating, in the G PU computing structure, a plurality of output pixel threads and for each output pixel thread, maintaining a private disparit histogram in a storage element associated with the GPU computing stracture and physically proximate to the computation units of the GPU computing structure;
793: Generate a dat representation, representative of the captured images and the corresponding disparity values.
FIG. 80 is a flowchart of a method 800 fu rther relating to private disparity histograms according to mi exemplary practice of the invention, including the following operations:
801 : Store the private disparity histogram such that each pixel thread writes to and reads from the corresponding private disparity histogram on a dedicated portion of shared local memory in the GPU;
02: Organize shared local memory in. the GPU at. least in part into memory words; the pri vate
disparity1 histogram, is characterized by a series of histogram bins indicating the number of votes for a given disparity range; and if a maximum possible number of votes in the private disparity histogram, is known, multiple histogram bins can be packed into a single word of the shared local memory and accessed using bitwise GPU access operations.
FACIAL SIGNATURE ASPECTS OF THE INVENTION
Identification, authentication or matching of a user or subject, by the user's facial features, can be useful in a wide range of settings. These may include controlling or limiting access to systems, enabling rapid or simplified access to systems or to a particular use account or use profile on a system, or other security purposes. Exemplary practices and embodiments of the invention enable such identification, authentication or matching, by generating a Facial Signature based on images of the users or subject's face, or face and head, as described in greater detail below.
The following discussion and the corresponding, accompanying drawing figures, relate to exemplary Facial Signature aspects, embodiments and practices of the invention.
Those skilled in the art will understand that the digital processor elements of the embodiments of the invention depicted in the accompanying drawing figures, such as in. but not limited to, FIGS. 1 , 7-13. 18, 33, and 47-49, can be employed to execute the Facial Signature functions of exemplary practices and embodiments of the i nvention described herein, including image capture, image rectification, feature correlation/disparity value processing, and Facial Signature generation functions. By way of example, the Facial Signature aspects of the invention can be executed on otherwise conventional processing elements and platforms provided by or associated with known forms of desktop computers, laptop computers, tablet computers, smartphones, and associated additional or peripheral hardware elements, such as cameras, suitably configured in accordance with exemplars' practices of the invention.
Men ti. fication, and . Matc ng/A ufoenticattpn . Example
In an exemplary practice of the invention, the front-end aspects of the V3D processing pipeline described above, i.e. aspects of Image Capture, Image Rect fication and Feature Correspondence, are employed, but instead of constructing a representation intended for 3D streaming of a scene for visualizing it from different views (see, e.g., FIGS. 7 and 8, depicting exemplar ' practices and embodiments of the V3D invention), the V3D process front-end can be configured to construct, a "Facial Signature" for the purposes of subsequently identifying an individual person, or user of a system or resource, in a secure manner that is substantially more difficult to forge than a regular 2D facial image.
FIG. 85 is a flowchart of an exemplary practice of the Facial Signature aspects rising V3D process elements of the invention, includi ng capturing images of the user's or subject's face (851 ), executing image rectification to compensate for camera optical di tortion and alignment (852), executing feature correspondence to produce di parity /depth values (853), eliminating the image background (854), and generating a facial, signature data representation (855).
The enhanced level of security provided by the Faciai Signature aspect of the invention is enabled in part because the depth stereo estimation of the V3D method of the invention described in this document requires all of the facial features to be presented to the camera(s) at the correct distance ratios between the cameras or from the structure light or tinie-of-fligbt sensor. Creating a forgery would require an accurate physical model of the face in the real world. By requiring multiple poses, the forger's challenge of constructing au accurate 3D model becomes highly impractical.
By way of example, FIG, 81 illustrates an exemplary practice of the Facial Signature aspect of the invention, including obtaining images from the carnera(s) (81, 1). generating rectified images (81.2), executing disparity/depth estimation (81.3), executing background elimination (81.4), and combining with 2D color information (81.5a, 81.5b, 81.5c).. which can occur using multiple poses of the human user/subject, as described in greater detail below.
In this configuration, the facial signature could be a combination of the 3D facial contour information and the regular 2D image from one or more of the cameras. A system or process in
accordance with the Facial Signature aspect of die invention could either store the 3D contour data in the signature, or simply use the 2D image of the face but use the 3D facial contours jest to confirm that the image(s) depict an actual human face with credible 3D proportions that was viewed by the cameras at. the same location as die 2 image.
A method in accordance with the Facial Signature aspect can also include an enrollment phase in which the human user or subject would be requested, by the system, to move his or her head into different orientations, and, optionally, strike a number of alternative fe i l poses, such as "smile" or "wink", so that the system can establish a. robust scan (or multiple scans) of the human subject's facial proportions. In an exemplary practice of the invention, an enrolled Facial Signature is generated from these scans.
During the matching process, a few seconds of images of the user's or subject's head can be captured in real-time, resulting in hundreds of individual captures, each slightly different, and then correlated with the facial signature to confirm a match.
Exemplary practices of the invention, including the facial signature aspects of the invention, can be configured for a variety of purposes, including, but not limited to, the following:
Reliable identification of a unique individual:
The facial signature generated from the depth information extracted from the V3D front-end can be used to identify a specific individual. Such an identification would be more reliable than a
conventional 2D identification system clue to the ability to take into account the actual 3D coordinates of facial features with respect to other facial features. It would also be much more difficult to spoof, or forge. Conventional 2D facial identification systems can be easily spoofed by presenting an image of the real- users face from a display or hard-copy. However, because the disparity or depth detection algorithms of the present invention actually compute the distance to real world features from multiple perspectives, a Facial Signature process or system in accordance with the invention would eliminate such vulnerabilities. Security factor in an authentication system:
The facial signature aspects of the invention can be combined with other security factors, such as a fingerprin or a pass-code, to provide a high level of security for accessing a user accoun on a device or system, or for other authentication purposes, hi a hybrid configuration with a conventional 2D face identification system:
Although the 3D contour data could atone be used to identify a face, combining it with the existing 2D image from one or more cameras would add further security1.
In addition to a full 2D and 3D scan authentication, it would be possible to implement a simpler approach by fully identifying the human user or subject, via the 2D image, and using the facial signature only to confirm that an actual human-proportioned face was presented at an overlapping location of the 2D match. This would eliminate the possibility of forgery by presenting a fake 2D image of the person being spoofed, since the depth detecting system would perceive a flat image, rather than the correct 3D contours, and reject the attempted forgery.
Using a 20 bounding rectangle to reduce search space:
Existing 2D facial detection systems return one or more rectangles or "boxes" defining the 2D extent of a human subject's face location. In an exemplary practice of the present invention, a 2D facial detection operation could be executed, and then used to minimize the amount of processing required for the 3D calculation by limiting the calculations to within that box. Utilizing the 2D facial detection operation in this manner can reduce the system's power ssumpt on, and reduce the time required for facial recognition on a given device.
Forge y protection through multiple poses in the facial signature:
It is theoretically possible to construct a 3D sculpture or 3D print of a person's head to spoof a visual authentication system. However, this possibility can be eliminated by using multiple pose from a large number of captured image frames to confirm identity. This can be implemented by employing a high-performance depth detection system that can scan many poses in a short time frame, and requiring the user or subject to perform head movements or facial expression changes.
Enrollment process:
As part of the system, a user would train the device by generating a unique fecial signature. In an exemplary practice of the invention, the system would request the user to present a series of desired head movements relative to the device, or a series of facial expressions, suck as "smile" or "wink." In an exemplars- practice of the invention, an enrolled Facial Signature is generated from these scans. By collecting a. series of possible poses, the signature would have higher dimensionality than would a single pose. Matching process :
With a properly trained system, the matching process could simply observe or capture images of the user for a few seconds, and with a sufficiently high-performance depth detection system, would capture many frames of 3D and 2D data, in accordance with the invention, tins would be correlated with the racial signature captured during the enrollment process, and a probability of match score would be generated. This score would then be compared with a threshold to confirm or deny an identify match.
Evolving or updating the facial signature over time:
Although the facial features of a human user or subject typically will remain relatively constant over time, changes will occur over time, due to changing head and facial hair, ageing, and other factors. Accordingly, in an exemplary practice of the invention, the facial signature itself is updated or evolved over time. In particular, an exemplary practice of the facial signature method of the invention includes updating or evolving the facial signature itself on. every successful match, or on every «th successful match, where n is a selected integer, in order to accommodate these changes.
Hi oj L as .^
in accordance with the invention, one method of representing the facial signature is in the form of one or more combined histograms taken directly from the summation of per-pi xel disparit histogram s within the feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiv ng depth, "These combined histograms represent the normalized relative proportion of facial, feature depths across a plane, parallel to the user's face.
In this regard, FIGS. 82-83 show an exemplary image processed in accordance with a exemplary practice of the Facial Signature aspects of the invention. FIG. 82 is an example of an image of a human user or subject captured by at least one camera, and FIG. 8 is an example of a representat ion of image data, corresponding to the image of FIG. 82. processed in acco rdance with an exemplary practice of the invention. FIG. 84 shows a histogram representation corresponding to the image(s) of FIGS. 82-83, generated in accordance with an exemplary practice of the Facial Signature aspects of the invention.
As shown in the histogram of PIG. (which corresponds to the image(s) of FIGS. 82 and 83), the X-axis of the histogram would represent a disparity (or depth) range, and the Y-axis would represent the normalized count of image samples that fell within that range. Daring the populating of tire histogram, a conventional 2D face detector can be employed to provide a face rectangle and location of the basic facial features, such as eyes, nose and mouth. See, e.g., FIG. 83, which indicates, among other aspects, a rectangle surrounding the human subject's face. Only samples within the facial rectangle surroundi g the face would 'be accumulated into the combined histogram, along with a rejection of samples falling outside the statistical majority of those within the facial rectangle, in order to remove background samples. In addition., the disparity and depth points could he projected into a canonical coordinate system defined b a plane constructed from the basic facial features. During the enrollment process, the histogram would be accumulated over several frames over a period of time, but each set of samples would be transformed to always lie on the facial plane. This allows for many samples of the facial depth relationships to be captured into one or more combined histograms, potentially for a series of poses. This ts analogous to taking a moid of a user's or subject's face, but by using statistical processes in accordance with the invention.
During identification,, the same process would be performed on the candidate user's face. Once a candidate identification histogram ss captured, it would he subtracted from the set of enrolled histograms and the vector distance would constitute a matching score. By comparing the matching score against a programmable threshold, access could be granted or denied.
In addition, this method could be used hi isolation or paired in a hybrid configuration with conventional 2D image matching of the face to provide a further authentication factor.
FIGS. 85-88 are flowcharts illustrating method aspects and exemplary practices of the invention. The methods depicted in these flowcharts axe examples only ; the organization, order and number of opemtions in the exempian' practices can be varied; and the exemplary practices and methods can be arranged or ordered differently, and include different functions, whether singly or in combination, while still being wi thin the spiri t and scope of the present invention, items described below in parentheses are, among other aspects, optional in a given practice of the invention.
FIG. 85 is a flowchart of a .method 850 for generating a facial signature data representation,, according to an exemplary practice of the invention, which can include a number of the following operations:
851 : Capture images of user's or subject's face or face and head;
851.1.: (using at least one camera having a view of the user's face);
851.2: (can use multiple cameras);
851.3: (can include T-O-F or structured light, camera);
852: Execute image rectification function to compensate for camera optical distortion and alignment:
853: Execute feature correspondence runction(s}:
853. i: Detect common features between corresponding images captured by camerai s) or single camera over time; 853.2: Generate disparity/depth values and feature correspondence data representation
representative of captured images and corresponding disparity/depth values;
854: Eliminate background poriion(s) of image;
855: Generate- facial signature data representation (can be histogram-based representation),
FIG. 86 is a flowchart of further 'method aspects 860 for generating a faciai signature data representation, according to an exemplary practice of the invention, which can incl ude a number of the following operations:
860, 1 : (The metiiod or system can uiibze stereo depth estimation t verify that human facial features are presented to eanierai s) at correct distance ratios between camera(s) Or front structured light or time-of-flight sensor);
860,2: (The aspect of identifying a human user or subject takes into account actual 3D coordinates of fecial features with respect to oilier fecial features);
860.3: (The feature correspondence function or depth detection function includes computing distances between facial features from multiple perspectives):
860,4: (The facial signature can be combination, of 3D faciai contour information and. 2D image data, from one or more camera(s));
860. : (3D contour data can be stored in facial signature data representation);
860.6: (A. facial signature generated in accordance with the invention can be utilized as a security factor in an authentication system, either alone or in combination with other security factors);
806.7: (Update a faciai signature on every successful match, or on every n successful matches,
where « is a selected integer);
860,8 : (3D f cial contour data can be combined with 2D image data from one or more cameras in a conventional 2D face identification system, to create a hybrid 3D/2D face identification system);
860.9: (3D faciai contour data can be used to confirm that a face having credible 3D human facial proportions was presented to the caniera(s) at an overlapping spatial location of captured 2D tmage(s));
860, 10: (A 2D bounding rectangle, defining a 2D extent of the human user's or subject's face location, can be used to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption);
860.11: (The facial signature data representation can. be a histogram-based facial signature data
representation).
FIG. S7 is a flowchart of method aspects 870 for generating a histogram-based facial signature dat representation, according to an exemplary practice of the invention, which ca include a number of the following operations:
870.1: (The facial signature is represented as one or more histograms obtained from a summation of per~pixel disparity histograms within feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiving depth);
870 ,2: (The histogram represents normalized retative proportioo of facial feature depths across a plane parallel to the user's or subject's face):
870,3 : (The X-axis of histogram represents a given disparity or depth range;
the Y-axis represents normalized count of image samples that fall within given range);
870.4: (During population of the histogram, a conventional 2D face detector can provide a face
rectangle and location of basic facial features, and only samples within face rectangle are accumulated into histogram);
870.5: (Samples f lling utside a statistical majority of samples within the face rectangle can be rejected, so as to remove background samples (defining "background" as anythin outside the statistical majority-' of samples within the face rectangle));
870.6: (Disparity and depth points cars be projected into a canonical coordinate system defined by a plane geometrically constructed from or defined by basic facial features such as eyes, nose- mouth); 870.7: (The histogram e la t tion can be used m eonmination with conventional 2D face matching to provide an additional authentication factor).
FIG. 88 is a flowchart of a facial signature method aspect: 880, including enrollment; and matching phases of an exemplary facial signature method in accordance with the invention, which can include a mmiber of the following operations:
88 i : Capture images (using at least one camera) for the enrollment phase (can utilize and require a selected number («) poses of die human user or subject):
882: Execute enrollment phas functions:
882. ! : (Each set of samples of captured: image frames undergoes an affme transform to lie on a common facial plane, to enable multiple samples of facial depth relationships to be accumulated into a histogram);
882,2: (Multiple samples of facial depth relationships are accumulated into a histogram across a series of facial positions or poses);
883: Capture images (using at least one camera) for the matching phase (can utilize and require a selected nam her («) poses of the human user or subject) ;
884: Execute matching phas functions:
884, 1 ; (Candidate histogram is accumulated o ver multiple captured image frames over a
period of time);
884.2: (Once candidate histogram is accumulated, it is subtracted from a set of enrol led
histograms to generate vector distance constituting degree-of-match score; then compare degree -of- atch score to selected threshold to confirm or deny match with each enrolled signature). While the foregoing description and the accompanying drawing figures provide details that will enable those skilled in the art to practice aspects of the invention, it should be recognized that the description is illustrati ve in nature and that many modifications and variations thereof will be apparent to those skilled in the art having the benefit of these teachings. It is accordingly intended that the invention herein be defined solely b any claims that ma be appended hereto aod that the invention be interpreted as broadly as permitted fay the prior art

Claims

We claim:
1. A method of generating a facia! signature for use in identifying a human user or subject, the method comprising;
capturing images of the user's face, the capturing comprising utilizing at least one camera having a view of the user's or subject's face;
executing an image rectification function to compensate for optical distortion and alignment of the at least one camera;
executing a feature correspondence function by detecting common features between
corresponding images captured by the at .least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity val ues; and
utilizing the feature correspondence data representation to generate a facial signature data representation, the .facial signature data representation being usable to accurately identify the user or subject, in a secure, difficult to forge manner.
2. The method of claim 1 wherein the capturing comprises utilizing at least two cameras, each having a view of the user's or subject's face: and wherein executing a feature correspondence function comprises detecting common features between corresponding images captured by the respective cameras.
3. Hie method of claim 1 wherein:
the capturing comprises utilizing at least one camera having a view of the users or subject's face and which is an infra-red time-oi-fiight camera or structured light camera that directly provides depth information: and
the feature correspondence data representation is representative of the captured images and corresponding depth information.
4. The method of claim 1 wherein:
the capturing comprises utilizing a single camera having a view of the user's or subject's face; and
executing a feature correspondence function comprises detecting common features between images captured by the single camera over time and measuring a relative di tance in image space between the common features, to generate disparity values.
5. The method of claims 2 or 3 wherein the identifying utilizes stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the came ras or from die structured light or ttme-of-flight sensor,
6. Hie method of claim 5 wherei the identifying takes into account: the actual 3D coordinates of facia! feature with respect to other facial features.
7. The method of claim 6 wherein the feature correspondence function or depth detection function comprises computing distances between fecial features from multiple perspectives.
8. The method of claims 2 or 3 wherein the facial signature is a combination of 3D facial contour information and a 2D image from one or more of the cameras.
9. lire method of claim 8 wherein 3D contour data is stored in the faciai signature data
representation.
ί 0. The method of claims 2 or 3 wherein the faciai signature is utilized as a security factor in an authentication system.
11. The method of claim 10 wherein the faciai signature is utilized as a security factor in an authentication system in combinatio with other security factors.
12. The method of claim 11 wherein the other security factors comprise any of a pass-code, a fingerprint or other biometric information.
13. The method of claim 8 wherein the 3D facial contour dat is combined with a 2D imag from one or more cameras in a con ventional 2D face identification system t create a hybrid 3D 2D face identification system .
14. The method of claim .13 wherein the method utilizes 3 D facial contour data solely to confirm that a face having credible 3D human racial proportions was presented to the cameras at art overlapping spatial location of the captured 2D image.
15. The method of claim 2 or 3 further comprising utilizing a ZD bounding rectangle, defining the 2D extent of the face location, to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption.
1.6. The method of claims 2 or 3 further comprising prompting the user or subject to present multiple distinct fecial poses or head positions, and utilizing a depth detection system to scan the nntitiple faciai poses or head positions across a series of image frames, so as to increase protection against forgery of the faciai signature,
17. The method of claims 2 or 3 wherein generating a unique facial signature further comprises executing an enrollment phase, the enrollment phase comprising:
prompting the user or subject to present to the cameras a plurality' of selected head movements or positions, or a series of selected facial poses, and collecting image frames from a plurality of head positions or facial poses for use in generating the unique facial signature representative of the user.
IS. The method of claim 17 further comprising a matching phase, the matching phase
comprising:
utilizing the cameras to capture, over an interval of time, a plurality of frames of 3D and 2D data representative of the risers face;
correlating the captured data with the facial signature generated during the enrollment phase, thereb to gene ate a probability of match score; and
comparing the probability of match score with a selected threshold value, thereby to confirm or deny an identity match.
19. The method of claim 18 further comprising:
during the enrollment phase, generating an enrolled facial signature containing data
corresponding to multiple image scans of a user's or subject's face, the multiple image scans
corresponding to a plurality of the user's or subject's head positions or facial poses; and during die matching phase, requiring at .least a minimum number of captured image frames corresponding to different facial or head positions matching the multiple scans within the enrolled signature.
20. The method of claim 19 farther comprising: generating a histogram based facial signature representation, whereby a facial signature is represented as one or more histograms obtained from a summation of per~pixel disparity histograms within the feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiving depth.
21. The method of claim 20 wherein the histograms represent the normalized relative proportion of facia! feature depths across a plane parallel to the user's face.
22. The method of claim 21 wherein the X-axis of the histogram represents a given disparity or depth range, and the Y-axis of the histogram represents the normalized count of image samples that tali within the given range.
23. The method of claim 22 wherein, during population of the histogram, a conventional 2D fao defector provides a face rectangle and location of basic facial features, and wherein only samples within the face rectangle are accumulated into the histogram,
24. The method of claim 23 further comprising rejecting samples falling outside the statistical majority of samples within the face rectangle, so as to remove background samples.
25. The method of claim 24 further comprising projecting disparity and depth points into a canonical coordinate system defined by a plane constructed from the 3D coordinates of the basic fecial features.
26. The method of claim 22 wherein, during the enrollment phase, a histogram is accumulated over multiple captured image frames over a period of time,
27. The method of claim 26 wherein, during the enrollment phase, each set of samples of the captured image frames undergoes an affine transform to lie on a common fecial plane, to enable multiple samples of facial depth relationships to be accumulated into a. histogram.
28. The method of claim 27 wherein, during the enrollment phase, multiple samples of fecial depth relationships are accumulated into a histogram across a series of facial positions or poses,
29. The method of claim 28 wherein, during the matching phase, a candidate histogram :is accumulated over multiple captured image frames over a period of time.
30. The method of claim 29 wherein, during the matching phase, once a candidate histogram is accumulated, it is subtracted from a set of enrolled histograms to generate a vector distance constituting s degree-of-mateh score.
31. The method of claim 30 further comprising comparing the degreeof-match score to a selected threshold to confirm or deny a match with each enrolled signature.
32. The method of claim 3 i wherein the histogram, representation is used in combination with conventional 2D face matching to provide an additional authentication factor.
33. The method of claim 1 wherein the facial signature is utilized as a .factor in an authentication process in which a human user or subject is successfully authenticated if selected criteria are met, and further comprising: updating the facial signature on every success fill match,
34. The method of claim ! wherein the facial signature is utilized as a factor in an authentication process in which a human user or subject is successfully authenticated if selected criteria are met, and further comprising: updating the .racial signature on every nib. successful match, where n i a selected integer.
35. A program product for use with a digital processing system., tor generating a facia! signature for use in identifying a human user or subject the digital processing system comprising at least one camera having a view of the user's or subject's face, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
capture images of the user's or subject's f ace, utilizing the at least one camera;
execute an image rectification function to compensate for optical -distortion and alignment of the at least one camera;
execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data representation representative of the captured images and the corresponding disparity values; and
utilize the feature correspondence data representation to generate a facial, signature data representation, the facial signature data representation being usable to accurately identify the user or subject in a secure, difficult to forge manner.
36. The program product of claim 35 wherein the capturing comprises utilizing at least two cameras, each having a view of the user's or subject's face; and wherein executing a feature
correspondence function comprises detecting common features between corresponding images captured by the respective cameras.
37. The program product of claim 35 wherein:
the capturing comprises utilizing at least one camera having a view of the user's or subject's face and which is an infra-red time-of-flight camera or structured tight camera, that directly provides depth information; and
the feature correspondence data representation is representative of the captured images and corresponding depth information.
38. The program product of claim 35 wherein;
the capturing comprises utilizing a single camera having a view of the users or subject's face; and executing a feature correspondence function comprises detecting common features between images captured by the single camera overtime and measuring a relative distance in image space between the common features, to generate disparity values.
39. The program product of claims 36 or 37 wherein the identifying utilizes stereo depth estimation to verify that human facial features are presented to the cameras at the correct distance ratios between the cameras or from the structured light or time-of~flight sensor.
40. The program product of claim 39 wherein the identifying takes into account the actual 3D coordinates of facial features with respect to other fecial features.
41. The program product of claim 40 wherein the feature correspondence function or depth detection function comprises computing distances between facial features from multiple perspecti ves.
42. The program product of claims 36 or 37 wherein the facial signature is a combination of 30 facia! contour information and a 2D image from one or more of the cameras.
43 , The program product of claim 42 wherein 3D contour data is stored in the facial signature data representation.
44. The program product of claims 36 or 37 wherein the facial signature is utilized as a security factor in an authentication system .
45. The program product of claim 44 wherein the facial signature is utilized as a security factor in an authentication system in combination with other security factors.
46. The program product of claim 45 wherein the other security factors comprise any of a pass- code, a fingerprint or other biometrtc information.
47. The program product of claim 42 wherein the 3D facial contour data is combined with a 2D image from one or mor cameras in a conventional 2D face identification system to create a hybrid 3D/2D fees identification system.
48. The program product of claim 47 wherein 3D facial contour data is utilized solely to confirm that a face having credible 3D human facial proportions was presented to the cameras at an overlapping spati al l ocation of the captured 2D image.
49. The program product of claims 36 or 37 further comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
utilize a 2D bounding rectangle, defining the 2D extent: of the face location, to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption,
50. The program product of claims 36 or 37 further comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
prompt the user or subject to present multiple distinct facial poses or head positions, and utilizing a depth detection system to scan the multiple facial poses or head positions across a series of image frames, so as to increase protection against forgery of the facial signature.
51. The program product of claims 36 or 3? wherein generating a unique facial signature further comprises executing an enrollment phase, the enrollment phase comprising:
prompting the user or subject to present, to the cameras a plurality of selected head movements or positions, or a series of selected facial poses, and collecting image frames from a plurality of head positions or facial poses for use in generating the unique facial signature representative of the user.
52. The program product of claim 5.1 further comprising digital processor-executable program instructions stored on. a non-transitory digital processor-readable mediitm, which when executed in the digital processing resource cause the digital processing resource to:
enable a matching phase, the matching phase comprising:
utilizing the cameras to capture, over an interval of time, a plurality of frames of 3D and 2D data representative of the user's face;
correlating the captured data with the facial signature generated during the enrollment phase, thereby to generate a probability of match score; and
comparing the probability of match score with a selected threshold value, thereby to confirm or deny an identity match.
53. The program product of claim 52 further comprising digital processor-executable program instractions stored on a non-transitory digital processor- readable medium, which when executed in the digital processing resource cause the digital processing resource to;
generate, during the enrollment phase, an enrolled fecial signature containing data corresponding to multiple image scans of a users or subject's face, the multiple image scans corresponding to a plurality of the user's or subject's head positions or facial poses; and
require, during the matching phase, at least a minimum number of captured image frames corresponding to different facial or head positions matching the multiple scans within the enrolled signature.
54. The program product of claim 53 further comprising digital processor-executable program instructions stored on a noii-ttansitorv digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to;
generate a histogram based facial signature representation, whereby a facial signature is represented as one or more histograms obtained from a summation of per-pixei disparity histograms within, the feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiving depth.
55. The program product of claim 54 wherein the histograms represent the normalized relative proportion of facial feature depths across a plane parallel to the user's face.
56. The program product of claim 55 wherein the X-axis of the histogram represents a given disparity or depth range, and the Y-axis of the histogram represents the normalized count of image samples that fall within the gi ven range.
57. The program product of claim 56 wherein, during population of the histogram, a conventional 2D face detector provides a lace rectangle and location of basic facial features, and wherein only samples within the face rectangle are accumulated into the histogram .
58. The program product of claim 57 further comprising digital processor-executable program instructions stored on a non-transitory digital processor- readable medium, which when executed in the digital processing resource cause the digital processing resource to reject samples failing outside die statistical majority of samples within the face rectangle, so as to remove background samples.
59. The program product of claim 58 ftirther comprising digital processor-executable program instructions stored on a non-transitory digital processor- readable medium., which when executed in the digi tal processing resource cause the digital processing resource to project disparity and depth points into a canonical coordinate system defined by a plane constructed from the 3D coordinates of the basic fecial features.
60. The program product of claim 56 wherein, during the enrollment phase, a histogram is accumulated over multiple captured image frames over a period of time.
61. The program product of claim 60 wherein, during the enrollment phase, each set of samples of the captured image frames undergoes an affme transform, to lie on a common facial plane, to enable multiple samples of facial depth relationships to be accumulated into a histogram.
62. The program product of claim 61 wherein, during the enrollment phase, multiple samples of facia! depth relationships are accumulated into a histogram across a series of facial positions or poses.
63. The program product of claim 62 wherein, during the matching phase, a candidate histogram is accumulated over multiple captured image frames over a period of time.
64. The program product of claim 63 wherein, during the matching phase, once a candidate hi stogram is accumulated, it is subtracted from a set of enrolled histograms to generate a vector distance constituting a degree-okmatch score.
65. The program product of claim 64 further comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to compare the degree-of-match score to a selected threshold to confirm or deny a match, with each enrolled signature.
66. The program product of claim 65 wherein the histogram representation is used in combinatio with conventional 2D face matching to provide an additional authentication factor.
67. The program product of claim 35 wherein the facial signature is utilized as a factor in an authentication process in which, a human user or subject is successfully authenticated if selected criteria are met, and further comprising: updating the facial signature on every successful match.
68. The program product of claim 35 wherein the facial signature is utilized as a factor in an authentication process in which a human user or subject is successfully authenticated if selected criteria are met, and further comprising: updating the facial signature on every nth successful match, where « is a selected integer.
69. A digital processing system, for generating a facial signature for use in identifying a human u ser or subject, the digital processing system comprising at least one camera having a view of the user's or subject's face, and a digital processing resource comprising at least one digital processor, the digital processing resource being operable to:
capture images of the user's or subject's face, utilizing die at least one camera;
execute an image rectification function to compensate for optical distortion and alignment of the at least one camera;
execute a feature correspondence function fay detecting common features between corresponding images captured by the at least one camera, asd measuring a relative distance in image space between the common features, to generate disparity values and a feature correspondence data, representation
representative of the captured images and the corresponding disparity values: and
utilize the feature correspondence data representation to generate a facial signature data representation, the facial signature data representation being usable to accurately identify the user or subject, in a secure, difficult to forge manner.
70. The system of claim 69 wherein the system comprises at least two cameras having a view of the user's or subject's face; the capturing comprises utilizing the least two cameras; and wherein executing a feature correspondence function comprises detecting common features between
corresponding images captured by the respective cameras.
71. The system of claim 69 wherein:
the capturing comprises utilizing at least one camera having a view of the users or subject's face and which is an infra-red time-of-fiight camera or structured light camera that directly provides depth information; and
the feature correspondence data representation is representati ve of the captured images and corresponding depth information.
72. The system of claim 69 wherein:
the capturing comprises utilizing a single camera having a view of the user's or subject's face; and
executing a feature correspondence function comprises detecting common features between images captured b the single camera overtime and measuring a relative distance in image space between the common features, to generate disparity values.
73. The system of claims 70 or 71 wherein the identifying utilizes stereo depth, estimation to ve ity that human facial features are presented to the cameras at the correct distance ratios between the cameras or from the structured light or timeof-flight sensor.
74. The system of claim 73 wherein the identif ing take into account the actual 3D coordinates of facial features with respect to other facial features.
75. The system of claim 74 wherein the feature correspondence function or depth, detection function comprises computing distances between facial features from multiple perspectives.
76. The system of claims 70 or 71 wherein the facial signature is a combination of 3D facia! contour information and a 2D image from one or more of the cameras.
77. The system of' claim 76 wherein 3D coufcour data is stored in the facial signature data representation.
78. The system of claims 70 or 7.1 wherein the facial signature is utilized as security factor in an authentication system.
79. The system of claim 78 wherein the fecial signature is utilized as a security factor in an authentication system in combination with other security factors.
80. The system of claim 79 wherein the other security factors comprise any of a pass-code, a .mrgerorini or other biometric information.
81. The system of claim 76 wherein the 3D facia! contour data is combined with a 20 image from one or more cameras in a conventional 2D face identification system to create a hybrid 3D/2D face identifi eati on system .
82. The sy stem of claim 81 wherein the system utilizes 3D facial contour data solely to confirm that a face having credible 3D hitman facial proportions was presented to the cameras at an overlapping spatial location of the captured 2D image.
83. The system of claims 70 or 71 wherein the digital processing resource is further operable to: utilize a 2D bounding rectangle, defining the 2D extent of the face location, to limit search space and limit calculations to a region defined by the rectangle, thereby increasing speed of recognition and reducing power consumption.
84. The system of claims 70 or 71 wherein the digital: processing resource is further operable to: prompt the user or subject to present .multiple distinct facia! poses or head positions, and utilizing a depth detection system to scan the multiple facial poses or head positions across a series of image frames, so as to increase protection against forgery of die facial signature.
85. The system of claims 70 or 71 wherein generating a unique facial signature further
comprises executing an enrollment phase, the enrollment phase comprising:
prompting the user or subject to present to the cameras a plurality of selected head movements or positions, or a series of selected facial poses, and collecting image frames from a plurality of head positions or facial poses for use in generating the unique facial signature representative of the user.
86. The system of claim 85 wherein the digital processing resource is further operable to: enable a matching phase, the matching phase comprising;
utilizing the cameras to capture, over an interval of time, a plurality of frames of 3D and 2D data representative of the user's face;
correlating the captured data with the facial signature generated during the enrollment phase, thereby to generate a probability of match score; and
comparing the probability of match, score with a selected threshold value, thereby to confirm or den an identity match.
87. The system of claim 86 wherein the digital processing resource is further operable to: generate, during the enrollment phase, an enrolled facial signature containing data corresponding to midtipie image scans of a users or subject's lace, the multiple image scans corresponding to a plurality of the user's or subject's head posi ti ons or facial poses: and
require, during the matching phase, at least a minimum number of captured image frames corresponding to different facial or head positions matching the multiple scans within the enrolled signature.
88. The sy stem of claim. 8? wherein the digital processing resource is further operable to:
generate a histogram, based facia! signature representation, whereby a facial signature is represented as one or more histograms obtained from a summation of per-pixei disparity histograms within the feature correspondence calculation, or generated from depth data from a sensor capable of directly perceiving depth.
89. The system of claim 88 wherein, the histograms represent the normalized relative proportion of facial feature depths across a plane parallel to the user's face.
90. The system of clai 89 wherein the X-axis of the histogram represents a given disparity or depth range, and the Y-axis of the histogram represents tile normalized count of image samples that fail within the given range.
91. The system of claim 90 wherein, during population of the histogram, a conventional 2D face detector provides a face rectangle and location of basic facial features, and wherein only samples within the face rectangle are accumulated into the histogram.
92. The system of claim 91 wherein the digital processing resource is further operable to cause the digital processing resource to reject samples falling outside die statistical majority of samples within the race rectangle, so as to remove background samples.
93. The system of claim 92 whe rein the digital processing resource is further operable to cause the digital processing resource to project disparity and depth points into a canonical coordinate system defined by a plane constructed from the 3D coordinates of the basic facial features.
94. The system of claim 90 wherein, during the enrollment phase, a histogram is accumulated over multiple captured image frames over a period of time.
95. The system of claim 94 wherein, during the enrollment phase, each set of samples of the captured image frames undergoes an. affine transform to lie on a common facial plane, to enable multiple samples of fecial depth relationships to be accumulated into a histogram.
96. The system of claim 95 wherein, during the enrollment phase, multiple samples of facial depth relationships are accumulated into a histogram across a series of facial posi tions or poses.
97. The system of claim 96 wherein, during the matching phase, a candidate histogram is accumulated over multiple captured image frames over a period of time.
98. The system of claim 97 wherein, during the matching phase, once a candidate histogram is accumulated, it is subtracted from a set of enrolled histograms to generate a vector distance constituting a degree-of-mateh score.
99. The system of claim 98 wherein the digital processing resource is further operable to compare the degree~of-match score to a selected threshold to confirm or deny a match with each enrolled signature.
100. The system of claim 99 wherein the histogram representation is «scd in combination with conventional 2D face matching to provide an additional authentication factor.
101. The system of claim 69 wherein the facial signature is utilised as a factor in an
authentication process in which a human riser or subject is successfully authenticated if selected criteria are met, and further comprising: updating the facial signature on every successful match.
102. The system of claim 69 wherein the facial signature is utilized as a factor in an
authentication process in which a human user or subject is successfully authenticated if selected criteria are met, and furtiier comprising: updating the facial signature on ever nth successful match, where n is a selected integer.
103. A video communication method that enables a first user to view a second user with direct virtual eye contact with the second user, tile method comprising:
capturing images of the second user, the capturing comprising utilizing at least One camera having a view of the second user's face;
generating a data representation, representative of the captured images;
reconstructing a synthetic view of the second user, based on the representation; and
displaying the synthetic view to the first, user on a display screen used by the first user;
the capturing, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen, by the reconstructing and displaying of a synthetic view of the second user in which the second user appear to be gazing directly at the first user, e ven if no camera 'has a direct eye contact gaze vector to the second user.
104. A video communication method that enables a user to view a remote scene in a manner that gives the user a visual impression of being present with respec t: to the remote scene, the method comprising:
capturing images of the remote scene, the capturing comprising utilizing at least two cameras each having a view of the remote scene;
executing a feature correspondence function by detecting common features between
corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values;
generating a data representation, representative of the captured images and the corresponding disparity values;
reconstructing a synthetic view of the remote scene, based on the representation; and displaying the synthetic vie to the first user on a display screen used by the first user;
the capturing, detecting, generating, reconstructing and displaying being executed such that: (a) the user is rodded the visual impression of looking through his display screen as a physical window to the remote scene, and
(b) the user is provided an immersi ve visual experience of the remote scene,
ί 05 , A method of facilitating self-portraiture of a user utilizin a handheld device to take the self-portrait, the handheld mobile device having a display screen for displaying images to the user, the method comprising;
providing at least one camera around the periphery of the display screen, the at least one camera having a view of the user's face at a self portrait setup time during which the user is setting up the self- portrait:
capturing images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen;
estimating a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information;
generating a data represen tation, .representative of the cap tured images;
reconstructing a synthetic view of the user, based on the generated data representation and the generated tracking information;
displaying to the user, on the display screen during the setup time, the synthetic view of the user; thereby enabling the user, while setting up the self-portrait, to selectively orient or position his gaze or head, or the handheld device and its camera, with realtime visual feedback .
106, A method of facilitating composition of a photograph of a scene, by a user utilizing a handheld device to take the photograph, the handheld device having a display screen on a first side for displaying images to the user, and at least one camera on a second, opposit side of the handheld device, for capturing images, the method comprising:
capturing images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting up the photograph;
estimating a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information;
generating a data representation, representative of the captured images;
reconstructing a synthetic view of the scene, based on the generated data representation and the generated tracking information, the synthetic view being reconstructed such that the scale and perspective of the synthetic view has a selected correspondence to the user's viewpoint relative to the handheld device and the scene: and
displaying to the user, on the display screen during the setup time, the synthetic view of the scene; thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspective within the display frame, with realtime visual feedback,
107. A method of displaying images to a user utilizing a binocular stereo head-mounted display (HMD), the method comprising: capturing at least two image streams using at least one camera attached or moun ed on or proximate to an external portion or surface of the HMD, the captured image streams containing images of a scene;
generating a data representation, representative of captured images contained in the captured image streams;
reconstructing two synthetic views, baaed on the representation; and
displaying the synthetic views to the user, via the HMD;
the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide w th respective locations of the user's left and right eyes,
so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
108. A method of capturing and displaying image content on a binocular stereo head-mounted display (HMD), the method comprising;
capturing at least two image streams using at least one camera, the captured image streams containing images of a scene;
generating a data representation, representative of captured images contained in the captured image streams;
reconstructing two synthetic views, based on. the representation; and
displaying the synthetic views to a user, via the HMD;
the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes,
so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion, aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
109, A method of generating an image data stream for use by a. control system of an autonomous vehicle, the method comprising:
capturing images of a scene around at least a portion of the vehicle, the capturing comprising utilizing at least one camera having a view of the scene;
executing a feature correspondence function by detecting common features between
corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity val ues:
calculating corresponding depth information based on the disparit values; and generating from the images and corresponding depth mformation an image data stream for use by the control system.
ί 10, A program product for use with a digital processing system, for enabling a. first user to view a second user with direct virtual eye contact with the second user, the digital processing system comprising at least one camera having a view of the second user's face, a display screen for use by the first user, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program, instructions stored on a non-transitory digital processor- readable medium, winch when executed in the digital processing resource cause the digital processing resource to:
capture images of the second user, utilizing the at least one camera;
generate a data representation, representative of the captured images;
reconstruct a synthetic view of the second user, based on the .representation; and
display the synthetic view to the first user on the display screen for use by the first, user:
tire capturing, generating, reconstructing and displaying bein executed such that the first user can have direct virtual eye contact with the second user through the first user's display screen, by the reconstructing and displaying of a synthetic view of the second use r in which the second user appears to be gazing directly at the first user, even if no camera has a direct eye contact gaze vector to the second user.
111, A program product for use with a digital processing system, for enabling a first user to view a remote scene wrth the visual impression of being present with respect to the remote scene, the digital processing system comprising at least two cameras, each having a view of the remote scene, a display screen fo use by the first user, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructioiis stored on a non-transitory digital processor-readable medium., which when executed in the digital processing resource cause the digital processing resource to:
capture .images of the remote scene, utilizing the at least two cameras;
execute a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and .measuring a relative distance in image space between the common features, to generate disparity- values;
generate a data representation, representative of the captured images and the corresponding disparity values;
reconstruct a synthetic view of the remote scene, based on the representation; and
display the synthetic view to the .first user on the display screen;
the capturing, detecting, generating, reconstructing and displaying being executed such thai: (a) the first user is provided the visual impression of looking through Ms display screen as a physical window to the remote scene, and
{¾) the first user is provided an immersive vistiai experience of the remote scene.
1.12. A program product for use with a handheld digital processing device, for facilitating self- portraiture of a user utilizing the handheld device to take the self portrait, the handheld device having a digi tal processor, a display screen for display ing images to the user, and at least one camera around the periphery of the display screen, the at least one camera having a view of the user's face at a self portrait setup lime during which the user is setting up the self portrait, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processor cause the digital processor to:
capture images of the user during the setu time, utilizing the at least one camera around the periphery of the display screen;
estimate a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information;
generate a data representation, representati ve of the captured images;
reconstruct a synthetic view of the user, based on the generated data representation and the generated tracking information; and
display to the user, on the display screen during the setup time, the synthetic vie -of the user; thereby enabling the user, while setting up the self-portrait to selectively orien or position his gaze or head, or the handheld device and its camera, with reaitime visual feedback.
1 13. A program product for use with a handheld digital processing device, for facilitating composition of a photograph of a scene by a user utilizing the handheld device to take the photograph, the handheld device having a digital processor, a display screen on a first side for displaying images to the user, and at least one camera on a second, opposite side of the handheld device, for capturing images, the program product comprising digital processor-executahle program instructions stored on a .non- transitory digital processor-readable medium, which when executed in the digital processor cause the digital processor to:
capture images of the scene, utilizing the at least one camera, at a photograph setup time during winch the user is setting up the photograph;
estimate a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information;
generate a data representation, representati ve of the captured images;
reconstruct a synthetic view of the scene, based on the generated data representation and th generated tracking information, the synthetic view being reconstructed such that the scale and -perspective of the synthetic view has a selected correspondence to the user's viewpoint relative to the handheld device and the scene; and
display to the user, on the display screen during the setup time, the synthetic view of the scene;
thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspective within the display frame, with reaitime visual feedback.
1 14. A program product for enabling display of images to a user utilizing a binocular stereo head-mounted display (HMD), the HMD having at least one camera attached or mounted on or proximate to a external portion or surface of the HMD, the HMD having, or being i communication with, a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digi tal processing resource to:
capture at least two image streams using the at least one camera, d e captured image streams containing image of a scene;
generate a data representation, representati ve of captured images contained « t e captured image streams;
reconstruct two synthetic views, based an the representation; and
display the synthetic views to the user, via the HMD;
the reconstructing and displaying being executed such that each of the synthetic vie s has a respective view origin corresponding to a respective virtual camera location, wherein the respective view- origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes,
so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene without an HMD.
1 15. A program product for enabling display of captured image content to a user utilizing a binocular stereo head-mounted display (HMD), the captured image content comprising at least two image streams captured or generated by at least one camera, the captured image streams containing images of a scene, and the HMD having, or being in communication with, a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
generate a data representation, representative of captured images contained in the captured image streams;
reconstruct two synthetic views, based on the representation; and
display the synthetic views to a user, via the HMD;
the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view origins are positioned such that the respective virtual camera locations coincide with respective locations of the user's left and right eyes.
so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if die user were directly viewing the scene without an HMD,
.1 16, A program product for enabling the generation of an image data stream for use by a control system of an autonomous vehicle, the vehicle having at least one camera with a view of a scene around at least a portion of the vehicle and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non- transitory digital processor-readable medium, which when executed in the digital processing resoiifce cause the digital processing resource to:
capture images of the scene aro und at least, a portion of the vehicle, using the at least one camera; execute a. feature correspondence function by detecting common features between corresponding images captured by the at least one camer and measuring a relative distance in image space between the common features, to generate disparity values;
calculate corresponding depth information based on the disparity values; and
generate from the images and corresponding depth information an image data stream for us by die control system.
1 17. A digital processing system for enabling a first user to view a second user with direct virtual eye contact with the second user, the digital processing system comprising:
at least one camera, having a view of the second user's face;
a display screen for use by the .first user; and
a digital processing resource comprising at least one digital processor, the digital processing resource being operable to:
capture images of the second user, utilizing the at least one camera:
generate a data representation, representative of the captured images;
reconstruct a synthetic view of the second user, based on the representation; and
display the synthetic view to the first user on the display screen for use by the first user;
the capturing, generating, reconstructing and displaying being executed such that the first user can have direct virtual eye contact wi h the second user through the first user' display screen, by the reconstructing and displaying of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, even if no camera has a direct eye contact gaze vector to the second user.
1 18. A digital processing sy stem for enabling a first user to view a remote scene with the visual impression of being present with respect, to the remote scene, the digital processing system comprising:
at least two cameras, each having a view of the remote scene;
a display screen for use by the first user; and
a digital processing resource comprising at least one digital processor, the digital processing resource being operable to:
capture images of the remote scene, utilizing the at least two cameras;
execute a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values;
generate a data representation, representative of the captured, images and me corresponding disparity values;
reconstruct a synthetic, view of the remote scene, based on the representation; and display the synthetic view to the first user on the display screen;
the capturing, detecting, generating, reconstructing and displaying being executed such that: (a) the first user is provided the visual impression of looking through his display screen as a physical window to the remote scene, and
<b) the first user is provided an immersive visual experience of the remote scene.
119. A system operable in a handheld digital processing device, for facilitating self-portraiture of a user utilizing the handheld device to take the self portrait, the system comprising;
a digital processor;
a display screen for displaying images to the user; and
at least one camera around the periphery of the display screen, the at least on camera having a view of the user's face at a self portrait setup time during which the user is setting up the self portrait; the system being operable to:
capture images of the user during the setup time, utilizing the at least one camera around the periphery of the display screen;
estimate a location of the user's head or eyes relative to the handheld device during the setup time, thereby generating tracking information;
generate a data represen tation , representative of the captured images;
reconstruct a synthetic view of the user, based on the generated data representation and the generated tracking information; and
display to the user, on the display screen during the setup time, the synthetic view of the user; thereby enablin the user, while setting up the self-portrait, to selectivel orient or position his gaze o head, or the handheld device and its camera, with realtime visual feedback.
120. A system operable in a handheld digital processing de v ice, for facilitating composition of a photograph of'a scene by a user utilizing the handheld device to take the photograph, the system comprising;
a digital processor;
a display screen on a first side of the handheld de vice for displaying images to the user, and at least one camera on a second, opposite side of the handheld device, for capturing images; the system being operable to:
capture images of the scene, utilizing the at least one camera, at a photograph setup time during which the user is setting up the photograph;
estimate a location of the user's head or eyes reiative to the handheld device during the setup time, thereby generating tracking information;
generate a data representation, representative of the captured images;
.reconstruct a synthetic vie of the scene, based on the generated data representation, and the generated tracking infonnation, the synthetic view being reconstructed such, that the scale and perspective of the synthetic view has a selected correspondence to the user's viewpoint relative to the handheld device and the scene; and display to the user, on the display screen during the .setup time, the synthetic view of the scene: thereby enabling the user, while setting up the photograph, to frame the scene to be photographed, with selected scale and perspecti ve within the display frame, with realtime visual feedback .
121, A system for enabling display of images to a user utilizing a binocular stereo head-mounted display (HMD), the system comprising:
at least one camera attached or mounted on or proximate to an external portion or surface of the HMD; and
a digital processing resource comprising at least one digital processor;
the system being operable to:
capture at least two image streams -using the at least one camera, the captured image streams containing images of a scene;
generate a dam representation, representative of captured images contained in the captured image streams:
reconstruct two synthetic views, based, on the representation; and
display the synthetic views to the user, -via the HMD;
the reconstructing and displaying being executed such that each of the synthetic views has a respective view origin corresponding to a respective virtual camera location, wherein the respective view- origins are positioned such dial, the respective virtual camera locations coincide with respective locations of the users left and right eyes,
so as to pro vide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of die scene, substantially as if the user were directly viewing the scene without an HMD,
122. A system for enabling display of captured image content to a user utilizing a binocular stereo head-mounted display (HMD), the system comprising a digital processing resource comprising at least one digital processor; the captured image content comprising at least two image streams captured or generated by at l east one camera, the captured image streams containing images of a scene, and the HMD having, or being in communication with, a digital processing resource comprising at least one digital processor,
the system being operable to:
generate a data representation, representative of captured images contained in the captured image streams;
reconstruct two synthetic views, based on the representation; and
display the synthetic views to a user, via the HMD:
the reconstructing and displaying being executed such that each of the synthetic views has a respecti ve vie origin corresponding to a respective virtual camera location., wherein the respecti ve view origins are positioned such that the respective virtual camera locations cotnctde with respective locations of the user's left and right eyes. so as to provide the user with a substantially natural visual experience of the perspective, binocular stereo and occlusion aspects of the scene, substantially as if the user were directly viewing the scene wi thout an HMD,
123, At) image processing system for enabling the generation of an image data stream for use by a control system of an autonomous vehicle, the image processing system comprising:
at least one camera with a view of a scene around at least a portio of the vehicle; and
a digital processing resource comprising, at least one digital processor:
the system being operable to:
capture images of the scene around at least a portion of the vehicle, using the at least one camera; execute a feature correspondence function by detecting common features between corresponding images captured by die at least one camera and measuring a relative distance in image space between the common features, to generate disparity values;
calculate corresponding depth information based on the disparity values; and
generate .f om the images and corresponding depth information an image data stream for use by the control system,
124. A video capture and processing method, comprising:
capturing images of a scene, the capturing comprising -utilizing at least fi rst and second cameras having a view of the scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pai r axis: and
executing a feature correspondence function by detecting common features between
corresponding images captured by the respecti e cameras and measuring a relative distance m image space between the common features., to generate disparity values, wherein the feature correspondence function comprises:
constructing a multi-level disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising:
executing a Fast Dense Disparity Estimate (FDDE) image pattern matching operation on successively lower-frequency downsampled versions of the input stereo images, the successively lower- frequency downsampled versions constituting a set of levels of FDDE histogram votes.
.1.25, A video capture and processing method, comprising:
capturing images of a scene, the capturing comprising utilizing at least fi st and second cameras hav ng a vie of the scene, the cameras being arranged along an axis to configure a stereo camera pair; executing a feature correspondence function by detecting common features between
corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparity values, the feature correspondence function further comprising:
generating a disparity solution based on the disparity values;
applying an infective constraint to the disparity solution based on domain and co-domain. wherein the domain comprises pixels for a given image captured by the first camera and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in response to violation of the injective constraint, wherein the ir ect ve constraint is that no element in the co-domain is referenced more than once by elements in the domain.
126, A video capture .method that enables a first user to view a second user with direct virtual eye contact with the second user, the method comprising:
capturing images of the second user, the cap turing comprising utilizing at least one camera having a view of the second user's face;
executing a feature correspondence function by detecting common features between
corresponding images captured by the at least, one camera and measuring a relative distance in image space between the common features, to generate disparity values:
generating a data representation, representative of the captured images and the corresponding disparity values;
estimating a three -dimensi coal (3D) location of the first user's head, face or eyes., thereby generating tracking information; and
reconstructing a synthetic view of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking information; and wherein the location estimating comprises:
passing a captured image of the first user, the captured image including the first user's head and face, to a two-dimensional (2D) facial feature detector that utilize the image to generate a first estimate of head and eye location and a rotation angle of the face relati ve to an image plane;
util izing an estimated center-of-face position, face rotation angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the head;
extracting from the best-fit rectangle all 3D points that lie within the best-fit rectangle, and calculating therefrom a representative 3D head position; and
if the num ber of valid 3D points extracted from the best-fit rectangle exceeds a selected threshold in relation to the maximum number of possible 3D points in the region, then signaling a valid 3D head position result.
127. A video capture and processing method comprising:
capturing images of a scene, the capturing comprising utilizing at .least three cameras having a v iew of the scene, the cameras being arranged in a substantially "L" -shaped configuration wherein a first- pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, whetein the first and second pairs of cameras share a common camera at or near the intersection of the first and second axis, so that the first and second pairs of cameras represent respective first and second independent stereo axes that share a common camera; executing a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relati e distance in image space between the common features, to generate disparity values;
generating a data representation, representative of the captured images and the corresponding disparity values; and further comprising:
utilizing an unrectified, undistorted (URUD) image space to integrate disparity data for pixels between, the first and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the URUD space is an image space in which polynomial leas distortion has been removed from the image data but. the captured image remains unrectified.
128. A video capture and processing method comprising;
capturing images of a scene, the capturing comprising utilizing at least one camera having a view of the scene:
executing a feature correspondence function by detecting common features between
corresponding images captured by the at least one camera, and measuring a relati ve distance in image space between the common features, to generate disparit values; and
generating a data representation, representative of th captured images and the corresponding disparity values:
wherein the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence, the disparity histogram -based method comprising:
constructing a disparity histogram indicating the relative probability of a given disparity value bein correct for a given pixel; and
optimizing generation of dispari ty values on a. GPU computing structure, the optimizing comprising:
generating, in. the GPU computing structure, a plurality of output pixel threads;
for each output pixel, thread, maintaining a pri vate disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU' computing structure.
129, A program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least fust and second cameras having view of a scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor- readable medium, which when executed in the digital processing resource cause the digital processing resource to
capture images of the scene, utilizing the at least first and second cameras; and
execute a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and .measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises constructing a multi-level disparity histogram indicating the relative probability of a given disparity value being correct for a given pixel, the constructing of a multi-level disparity histogram comprising:
executing a Fast Dense Disparity Estimate (FDDE) image pattern matching operation on successively lower-frequency downsampied versions of the input stereo images, the successively lower- frequency downsampied versions constituting a set of levels of FDDE histogram votes,
130. A program product for use with a digital processing system, the digital processing system comprising at least first and second cameras having a view of a scene, the cameras being arranged along an axis to configure a stereo camera pair having a camera pair axis, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
capture images of the scene, utilizing the at least first and second cameras; and
execute a feature correspondence fiinction by detecting common features between corresponding images captured by the respecti ve cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature correspondence function comprises: generating a disparity solution based on the disparity values; and
applying an injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured by the first camera and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in .response to violation of the injective constraint, wherein the injective constraint is that no element in the co-domain is referenced more than once by elements in the domain.
131. A program product for use with a digital processing system, for enabling a first user to view a second user with direct virtual eye contact with the second user, the digital processing system comprising at least one camera having a view of the second user's face, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a non-transitory digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
capture images of the second user, util izing the at least one camera;
execute a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values;
generate a data representation, representative of the captured images and the corresponding disparity values;
estimate a three-dimensional (3D) location of the first user's head, face or eyes, thereby generating tracking information; and
reconstruct a synthetic view of the second user, based on the representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises reconstructing the sv otlietic view based on the generated data representation and the generated tracking information; and wherein the 3D location estimating comprises:
passing a captured image of the first use , the captured image including the first user's head and face, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimat of head and eye location and a rotation angle of die face relative to an image plane;
utilizing an estimated cettter-of-face position, face rotation, angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the head:
extracting from the best-fit rectangle all 3D points that lie within the best-fit rectangle, and calculating therefrom a representative 3D bead position; arid
if the number of valid 3D points extracted from the best-fit rectangle exceeds a selected threshold in relation to the maximum number of possible 3D points in the region, then signaling a valid 3D head position result.
132, A program product for use with a digital processing system, for enabling capture and processing of images of a scene, the digital processing system comprising (i) at least three cameras having a view of the scene, the cameras being arranged in a substantially "L"-shaped configuration wherein a first pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, wherein the first and second pairs of cameras share a common camera at or near the intersection of the first and second axis, so that the first, and second pairs of cameras represent respective first and second independent stereo axes that share a common camera, and (ii) a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program instructions stored on a .non- transitor digital processor-readable medium, which when executed in the digital processing resource cause the digital processing resource to:
capture images of the scene,, utilizing the at least three cameras;
execute a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a -relative distance in image space between the common features, to generate disparity values;
generate a data representation, representati ve of the captured images and the corresponding disparity values; and
utilize an linreetifted, undistorted (UR.UD) image space to integrate disparity data for pixels between the first, and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the TjRliD space is an image space in which polynomial lens distortion has been removed from the image data but the captured image remains unrectified,
133, A program product for use with a digital processing system, for enabling image capture and processing, the digital processing system comprising at least one camera having a view of a scene, and a digital processing resource comprising at least one digital processor, the program product comprising digital processor-executable program infractions stored on anon-transitory digital processor-readable medium,, which when executed in the digital processing resource cause die digital processing resource to: capture images of the scene, utilizing the at least one camera
execute a feature correspondence function by detecting common features between corresponding linages captured by the at least one camera and measuring a relative distance in linage space between the common features, to generate disparity values; and
generate a data representation, representati e of the captured images and die corresponding disparity values;
wherein the feature correspondence function utilizes a disparity histogram-based method of integrating data and determining correspondence, the disparity histogram-based method comprising:
constructing a disparity histogram indicating the relative probability of a given disparity value being correct for a gi ven pixel; and
optimizing generation of disparity values on a GPU computing structure, the optimizing comprising:
generating, in the GPU computing structure, a plurality of output pixel threads;
for each output ixel thread, maktaining a private disparity histogram, in a storage element: associated ith the GPU computing structure and physically proximate to the computation units of the GPU computing structure,
134. A video capture and processing system, the system comprising:
at least first and second cameras having a view of a. scene, the cameras bein arranged along an axis to configure a stereo camera pair having a camera pair axis; and
a digital processor operable to receive image data from the cameras and process the recei ved image data;
the system being operable to:
capture images of the scene, utilizing the at least first and second cameras; and
execute, utilizing the processor, a feature correspondence function by detecting common features between corresponding images captured by the respecti e cameras and measuring a relative distance in image space between the common features, to generate disparity values, wherein the feature
correspondence function comprises:
constructing, utilizing the processor, a multi-level disparity histogram indicating the relative probabilit of a given disparity value being correct for a gi en pixel, the constructing of a multi-level disparity histogram comprising:
executing, utilizing the processor, a Fast Dense Disparity Estimate (FDDE) image pattern matching operation on successively lower-frequency downsarnpled versions of the input stereo images, the successively lower-frequency downsarnpled versions constituting a set of levels of FDDE histogram votes.
135, A video capture and processing system, the system comprising:
at least first and second cameras having a view of a scene, the cameras being arranged along an axis to configure a stereo camera pair, and a digital processor operable to receive image data from the cameras and rocess the received image data;
the system being operable to;
capture images of the scene, utilizing the at least, first and second cameras;
execute, utilizing the processor, a feature correspondence function by detecting common features between corresponding images captured by the respective cameras and measuring a relative distance in image space between the common features, to generate disparit - values, the feature correspondence function further comprising:
generating, utilizing the processor, a disparity solution based on the disparity values;
applying, utilizing the processor, an injective constraint to the disparity solution based on domain and co-domain, wherein the domain comprises pixels for a given image captured by the first camera and the co-domain comprises pixels for a corresponding image captured by the second camera, to enable correction of error in the disparity solution in response to violation of the injective constraint, wherein the injective constraint is that no element in the co-domain is .referenced more than once by elements in tire domain,
136. A video capture system that enables a first user to view a second user with direct virtual eye contact with the second user, the system comprising:
at least one camera having a view of the second user's face; and
a digital processor operable to receive image data from the at least one camera and process the received image data;
the system bein operable to;
capture images of the second user, utilizing the at least one camera;
execute, utilizing the processor, a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relative distance in image space between the common features, to generate disparity values;
generate, utilizing the processor, a date representation, representati e of the captured images and the corresponding disparit values;
estimate, utilizing the processor, a three-dimensional (3D) location of the first users head, face or eyes, thereby generating tracking information and
reconstruct, utilizing the processor, a synthetic view of the second user, based on the
representation, to enable a display to the first user of a synthetic view of the second user in which the second user appears to be gazing directly at the first user, wherein the reconstructing of a synthetic view of the second user comprises reconstructing the synthetic view based on the generated data representation and the generated tracking information; and wherein the location estimating comprises:
passing a captured image of the first user, the captured image including the first user's head and face, to a two-dimensional (2D) facial feature detector that utilizes the image to generate a first estimate of head and eye location and a rotation angle of the face relati e to an image plane; utilizing an estimated ceiiter-oMace position, face rotation angle, and head depth range based on the first estimate, to determine a best-fit rectangle that includes the bead;
extracting from the best-fit rectangle all 3D points that He within the best-fit rectangle, and calculating therefrom a representative 3D head position; and
if the number of valid 3D points extracted from the best-fit rectangle exceeds a selected threshold in relation to the maximum number of possible 3D points in the region, then signaling a valid 3D head position result.
137. A video capture and processing system, the system comprising:
at least three cameras having a view of a scene, the cameras being arranged in a substantially "L"-shaped configuration wherein a first pair of cameras is disposed along a first axis and second pair of cameras is disposed along a second axis intersecting with, but angularly displaced from, the first axis, wherein the first and second pairs of cameras share a common camera at or near the intersection of the first and second axis, so that the first and second pairs of cameras represent respective first and second independent stereo axes that share a common camera; and
a digital processor operable to recei ve image data from the at least three cameras and process the received image data;
the system being operable to:
capture images of the scene, utilizing the at least three cameras;
execute, utilizing die processor, a feature correspondence function by detecting common features between corresponding images captured by the at least three cameras and measuring a relative distance in image space between the common features, to generate disparity values;
generate, utilizing the processor, a data representation, representative of the captured images and the corresponding disparity values; and further comprising:
utilization, by the processor, of an unrectified, undistorted (U.R.UD) image space to integrate disparity data, for pixels between the first and second stereo axes, thereby to combine disparity data from the first and second axes, wherein the URUD space is an image space in which polynomial lens distortion has been removed from the image data but the captured image remains unreetified.
138, A video capture and processing method system, the system comprising:
at least one camera having a view of the scene; and
a digital processor operable to recei e image data from the at least one camera and process the received image data;
the system being operable to:
capture images of the scene, utilizing the at least one camera;
execute, utilizing the processor, a feature correspondence function by detecting common features between corresponding images captured by the at least one camera and measuring a relati ve distance in image space between the common features, to generate disparity values; and
generate, utilizing the processor, a data, representation, representative of the captured images and the corresponding disparity values: wherein the feature correspondence function utilizes a disparity histogram-based method of integrating data and detennming correspondence, the disparity lristogam-based method comprising:
constructing, utilizing the processor, a disparity histogram indicating the relative probability of given disparity value being correct for a gi ven pixel; and
optimizing generation of disparity values on a GPU computing structure, the optimizing comprising.
generat ing, in the GPU' computing structure, a pl urality of output pixel threads;
for each output pixel thread, maintaining a private disparity histogram, in a storage element associated with the GPU computing structure and physically proximate to the computation units of the GPU computing structure.
PCT/US2016/032213 2015-03-21 2016-05-12 Facial signature methods, systems and software WO2016183380A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US15/573,475 US10853625B2 (en) 2015-03-21 2016-05-12 Facial signature methods, systems and software
EP16793565.9A EP3295372A4 (en) 2015-05-12 2016-05-12 Facial signature methods, systems and software
US17/107,413 US20210192188A1 (en) 2015-03-21 2020-11-30 Facial Signature Methods, Systems and Software

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201562160563P 2015-05-12 2015-05-12
US62/160,563 2015-05-12
PCT/US2016/023433 WO2016154123A2 (en) 2015-03-21 2016-03-21 Virtual 3d methods, systems and software
USPCT/US2016/023433 2016-03-21

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/023433 Continuation-In-Part WO2016154123A2 (en) 2015-03-21 2016-03-21 Virtual 3d methods, systems and software

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/573,475 A-371-Of-International US10853625B2 (en) 2015-03-21 2016-05-12 Facial signature methods, systems and software
US17/107,413 Continuation US20210192188A1 (en) 2015-03-21 2020-11-30 Facial Signature Methods, Systems and Software

Publications (1)

Publication Number Publication Date
WO2016183380A1 true WO2016183380A1 (en) 2016-11-17

Family

ID=57248607

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/032213 WO2016183380A1 (en) 2015-03-21 2016-05-12 Facial signature methods, systems and software

Country Status (2)

Country Link
EP (1) EP3295372A4 (en)
WO (1) WO2016183380A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3289971A1 (en) * 2016-08-29 2018-03-07 Panasonic Intellectual Property Management Co., Ltd. Biometric device and biometric method
CN109299729A (en) * 2018-08-24 2019-02-01 四川大学 Vehicle checking method and device
CN109492571A (en) * 2018-11-02 2019-03-19 北京地平线机器人技术研发有限公司 Identify the method, apparatus and electronic equipment at human body age
WO2019180538A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Remote user identity validation with threshold-based matching
US10762663B2 (en) 2017-05-16 2020-09-01 Nokia Technologies Oy Apparatus, a method and a computer program for video coding and decoding
US10892901B1 (en) 2019-07-05 2021-01-12 Advanced New Technologies Co., Ltd. Facial data collection and verification
WO2021004055A1 (en) * 2019-07-05 2021-01-14 创新先进技术有限公司 Method, device and system for face data acquisition and verification
US10997736B2 (en) 2018-08-10 2021-05-04 Apple Inc. Circuit for performing normalized cross correlation
US11227405B2 (en) * 2017-06-21 2022-01-18 Apera Ai Inc. Determining positions and orientations of objects
CN114937271A (en) * 2022-05-11 2022-08-23 中维建通信技术服务有限公司 Intelligent communication data input and correction method
US11960639B2 (en) 2021-08-29 2024-04-16 Mine One Gmbh Virtual 3D methods, systems and software

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040240711A1 (en) * 2003-05-27 2004-12-02 Honeywell International Inc. Face identification verification using 3 dimensional modeling
US20050063566A1 (en) * 2001-10-17 2005-03-24 Beek Gary A . Van Face imaging system for recordal and automated identity confirmation
US20070183653A1 (en) * 2006-01-31 2007-08-09 Gerard Medioni 3D Face Reconstruction from 2D Images
US20120121142A1 (en) * 2009-06-09 2012-05-17 Pradeep Nagesh Ultra-low dimensional representation for face recognition under varying expressions
US20130021490A1 (en) * 2011-07-20 2013-01-24 Broadcom Corporation Facial Image Processing in an Image Capture Device
US20130070116A1 (en) * 2011-09-20 2013-03-21 Sony Corporation Image processing device, method of controlling image processing device and program causing computer to execute the method
US20140050372A1 (en) * 2012-08-15 2014-02-20 Qualcomm Incorporated Method and apparatus for facial recognition
US20150066764A1 (en) * 2013-09-05 2015-03-05 International Business Machines Corporation Multi factor authentication rule-based intelligent bank cards

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7242807B2 (en) * 2003-05-05 2007-07-10 Fish & Richardson P.C. Imaging of biometric information based on three-dimensional shapes

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050063566A1 (en) * 2001-10-17 2005-03-24 Beek Gary A . Van Face imaging system for recordal and automated identity confirmation
US20040240711A1 (en) * 2003-05-27 2004-12-02 Honeywell International Inc. Face identification verification using 3 dimensional modeling
US20070183653A1 (en) * 2006-01-31 2007-08-09 Gerard Medioni 3D Face Reconstruction from 2D Images
US20120121142A1 (en) * 2009-06-09 2012-05-17 Pradeep Nagesh Ultra-low dimensional representation for face recognition under varying expressions
US20130021490A1 (en) * 2011-07-20 2013-01-24 Broadcom Corporation Facial Image Processing in an Image Capture Device
US20130070116A1 (en) * 2011-09-20 2013-03-21 Sony Corporation Image processing device, method of controlling image processing device and program causing computer to execute the method
US20140050372A1 (en) * 2012-08-15 2014-02-20 Qualcomm Incorporated Method and apparatus for facial recognition
US20150066764A1 (en) * 2013-09-05 2015-03-05 International Business Machines Corporation Multi factor authentication rule-based intelligent bank cards

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3295372A4 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10561358B2 (en) 2016-08-29 2020-02-18 Panasonic Intellectual Property Management Co., Ltd. Biometric device and biometric method
EP3289971A1 (en) * 2016-08-29 2018-03-07 Panasonic Intellectual Property Management Co., Ltd. Biometric device and biometric method
US10413229B2 (en) 2016-08-29 2019-09-17 Panasonic Intellectual Property Management Co., Ltd. Biometric device and biometric method
US10762663B2 (en) 2017-05-16 2020-09-01 Nokia Technologies Oy Apparatus, a method and a computer program for video coding and decoding
US11227405B2 (en) * 2017-06-21 2022-01-18 Apera Ai Inc. Determining positions and orientations of objects
GB2585168A (en) * 2018-03-23 2020-12-30 Ibm Remote user identity validation with threshold-based matching
WO2019180538A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Remote user identity validation with threshold-based matching
US10839238B2 (en) 2018-03-23 2020-11-17 International Business Machines Corporation Remote user identity validation with threshold-based matching
GB2585168B (en) * 2018-03-23 2021-07-14 Ibm Remote user identity validation with threshold-based matching
US10997736B2 (en) 2018-08-10 2021-05-04 Apple Inc. Circuit for performing normalized cross correlation
CN109299729B (en) * 2018-08-24 2021-02-23 四川大学 Vehicle detection method and device
CN109299729A (en) * 2018-08-24 2019-02-01 四川大学 Vehicle checking method and device
CN109492571A (en) * 2018-11-02 2019-03-19 北京地平线机器人技术研发有限公司 Identify the method, apparatus and electronic equipment at human body age
US10892901B1 (en) 2019-07-05 2021-01-12 Advanced New Technologies Co., Ltd. Facial data collection and verification
WO2021004055A1 (en) * 2019-07-05 2021-01-14 创新先进技术有限公司 Method, device and system for face data acquisition and verification
US11960639B2 (en) 2021-08-29 2024-04-16 Mine One Gmbh Virtual 3D methods, systems and software
CN114937271A (en) * 2022-05-11 2022-08-23 中维建通信技术服务有限公司 Intelligent communication data input and correction method
CN114937271B (en) * 2022-05-11 2023-04-18 中维建通信技术服务有限公司 Intelligent communication data input and correction method

Also Published As

Publication number Publication date
EP3295372A4 (en) 2019-06-12
EP3295372A1 (en) 2018-03-21

Similar Documents

Publication Publication Date Title
US11106275B2 (en) Virtual 3D methods, systems and software
US20210192188A1 (en) Facial Signature Methods, Systems and Software
EP3295372A1 (en) Facial signature methods, systems and software
KR101994121B1 (en) Create efficient canvas views from intermediate views
US11755956B2 (en) Method, storage medium and apparatus for converting 2D picture set to 3D model
CN109615703B (en) Augmented reality image display method, device and equipment
US10158939B2 (en) Sound Source association
US20190213778A1 (en) Fusing, texturing, and rendering views of dynamic three-dimensional models
US9684953B2 (en) Method and system for image processing in video conferencing
US9286718B2 (en) Method using 3D geometry data for virtual reality image presentation and control in 3D space
US9813693B1 (en) Accounting for perspective effects in images
US20150379720A1 (en) Methods for converting two-dimensional images into three-dimensional images
US9380263B2 (en) Systems and methods for real-time view-synthesis in a multi-camera setup
US10453244B2 (en) Multi-layer UV map based texture rendering for free-running FVV applications
US20180310025A1 (en) Method and technical equipment for encoding media content
JPH07296185A (en) Three-dimensional image display device
Schenkel et al. Natural scenes datasets for exploration in 6DOF navigation
CN114358112A (en) Video fusion method, computer program product, client and storage medium
US20230122149A1 (en) Asymmetric communication system with viewer position indications
US20230152883A1 (en) Scene processing for holographic displays
TW201528208A (en) Image mastering systems and methods
WO2023093279A1 (en) Image processing method and apparatus, and device, storage medium and computer program product
US11960639B2 (en) Virtual 3D methods, systems and software
US11189047B2 (en) Gaze based rendering for audience engagement
CN116685905A (en) Multi-view video display method, device, display apparatus, medium, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16793565

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016793565

Country of ref document: EP