uDepth: Real-time 3D Depth Sensing on the Pixel 4 (2023)

Posted by Michael Schoenberg, uDepth Software Lead and Adarsh Kowdle, uDepth Hardware/Systems Lead, Google Research

The ability to determine 3D information about the scene, called depth sensing, is a valuable tool for developers and users alike. Depth sensing is a very active area of computer vision research with recent innovations ranging from applications like portrait mode and AR to fundamental sensing innovations such as transparent object detection. Typical RGB-based stereo depth sensing techniques can be computationally expensive, suffer in regions with low texture, and fail completely in extreme low light conditions.

Because the face unlock feature on Pixel 4 must work at high speed and in darkness, it called for a different approach. To this end, the front of the Pixel 4 contains a real-time infrared (IR) active stereo depth sensor, called uDepth. A key computer vision capability on the Pixel 4, this technology helps the authentication system identify the user while also protecting against spoof attacks. It also supports a number of novel capabilities, such as after-the-fact photo retouching, depth-based segmentation of a scene, background blur, portrait effects and 3D photos.

(Video) Monocular Depth Estimation on UE4 | NNEngine Demo

Recently, we provided access to uDepth as an API on Camera2, using the Pixel Neural Core, two IR cameras, and an IR pattern projector to provide time-synchronized depth frames (in DEPTH16) at 30Hz. The Google Camera App uses this API to bring improved depth capabilities to selfies taken on the Pixel 4. In this post, we explain broadly how uDepth works, elaborate on the underlying algorithms, and discuss applications with example results for the Pixel 4.

Overview of Stereo Depth Sensing
All stereo camera systems reconstruct depth using parallax. To observe this effect, look at an object, close one eye, then switch which eye is closed. The apparent position of the object will shift, with closer objects appearing to move more. uDepth is part of the family of dense local stereo matching techniques, which estimate parallax computationally for each pixel. These techniques evaluate a region surrounding each pixel in the image formed by one camera, and try to find a similar region in the corresponding image from the second camera. When calibrated properly, the reconstructions generated are metric, meaning that they express real physical distances.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (1)
Pixel 4 front sensor setup, an example of an active stereo system.

To deal with textureless regions and cope with low-light conditions, we make use of an “active stereo” setup, which projects an IR pattern into the scene that is detected by stereo IR cameras. This approach makes low-texture regions easier to identify, improving results and reducing the computational requirements of the system.

What Makes uDepth Distinct?
Stereo sensing systems can be extremely computationally intensive, and it’s critical that a sensor running at 30Hz is low power while remaining high quality. uDepth leverages a number of key insights to accomplish this.

(Video) The "VTuber" and its (Technical) Future

One such insight is that given a pair of regions that are similar to each other, most corresponding subsets of those regions are also similar. For example, given two 8x8 patches of pixels that are similar, it is very likely that the top-left 4x4 sub-region of each member of the pair is also similar. This informs the uDepth pipeline’s initialization procedure, which builds a pyramid of depth proposals by comparison of non-overlapping tiles in each image and selecting those most similar. This process starts with 1x1 tiles, and accumulates support hierarchically until an initial low-resolution depth map is generated.

After initialization, we apply a novel technique for neural depth refinement to support the regular grid pattern illuminator on the Pixel 4. Typical active stereo systems project a pseudo-random grid pattern to help disambiguate matches in the scene, but uDepth is capable of supporting repeating grid patterns as well. Repeating structure in such patterns produces regions that look similar across stereo pairs, which can lead to incorrect matches. We mitigate this issue using a lightweight (75k parameter) convolutional architecture, using IR brightness and neighbor information to adjust incorrect matches — in less than 1.5ms per frame.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (2)
Neural depth refinement architecture.

Following neural depth refinement, good depth estimates are iteratively propagated from neighboring tiles. This and following pipeline steps leverage another insight key to the success of uDepth — natural scenes are typically locally planar with only small nonplanar deviations. This permits us to find planar tiles that cover the scene, and only later refine individual depths for each pixel in a tile, greatly reducing computational load.

Finally, the best match from among neighboring plane hypotheses is selected, with subpixel refinement and invalidation if no good match could be found.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (3)
Simplified depth architecture. Green components run on the GPU, yellow on the CPU, and blue on the Pixel Neural Core.

When a phone experiences a severe drop, it can result in the factory calibration of the stereo cameras diverging from the actual position of the cameras. To ensure high-quality results during real-world use, the uDepth system is self-calibrating. A scoring routine evaluates every depth image for signs of miscalibration, and builds up confidence in the state of the device. If miscalibration is detected, calibration parameters are regenerated from the current scene. This follows a pipeline consisting of feature detection and correspondence, subpixel refinement (taking advantage of the dot profile), and bundle adjustment.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (4)
Left: Stereo depth with inaccurate calibration. Right: After autocalibration.

For more details, please refer to Slanted O(1) Stereo, upon which uDepth is based.

Depth for Computational Photography
The raw data from the uDepth sensor is designed to be accurate and metric, which is a fundamental requirement for face unlock. Computational photography applications such as portrait mode and 3D photos have very different needs. In these use cases, it is not critical to achieve video frame rates, but the depth should be smooth, edge-aligned and complete in the whole field-of-view of the color camera.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (5)
Left to right: raw depth sensing result, predicted depth, 3D photo. Notice the smooth rotation of the wall, demonstrating a continuous depth gradient rather than a single focal plane.

To achieve this we trained an end-to-end deep learning architecture that enhances the raw uDepth data, inferring a complete, dense 3D depth map. We use a combination of RGB images, people segmentation, and raw depth, with a dropout scheme forcing use of information for each of the inputs.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (6)
Architecture for computational photography depth enhancement.

To acquire ground truth, we leveraged a volumetric capture system that can produce near-photorealistic models of people using a geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. We added Pixel 4 phones to the setup and synchronized them with the rest of the hardware (lights and cameras). The generated training data consists of a combination of real images as well as synthetic renderings from the Pixel 4 camera viewpoint.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (7)
Data acquisition overview.

Putting It All Together
With all of these components in place, uDepth produces both a depth stream at 30Hz (exposed via Camera2), and smooth, post-processed depth maps for photography (exposed via Google Camera App when you take a depth-enabled selfie). The smooth, dense, per-pixel depth that our system produces is available on every Pixel 4 selfie with Social Media Depth features enabled, and can be used for post-capture effects such as bokeh and 3D photos for social media.

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (8)
uDepth: Real-time 3D Depth Sensing on the Pixel 4 (9)
Example applications. Notice the multiple focal planes in the 3D photo on the right.

Finally, we are happy to provide a demo application for you to play with that visualizes a real-time point cloud from uDepth — download it here (this app is for demonstration and research purposes only and not intended for commercial use; Google will not provide any support or updates). This demo app visualizes 3D point clouds from your Pixel 4 device. Because the depth maps are time-synchronized and in the same coordinate system as the RGB images, a textured view of the 3D scene can be shown, as in the example visualization below:

uDepth: Real-time 3D Depth Sensing on the Pixel 4 (10)
Example single-frame, RGB point cloud from uDepth on the Pixel 4.

Acknowledgements
This work would not have been possible without the contributions of many, many people, including but not limited to Peter Barnum, Cheng Wang, Matthias Kramm, Jack Arendt, Scott Chung, Vaibhav Gupta, Clayton Kimber, Jeremy Swerdlow, Vladimir Tankovich, Christian Haene, Yinda Zhang, Sergio Orts Escolano, Sean Ryan Fanello, Anton Mikhailov, Philippe Bouchilloux, Mirko Schmidt, Ruofei Du, Karen Zhu, Charlie Wang, Jonathan Taylor, Katrina Passarella, Eric Meisner, Vitalii Dziuba, Ed Chang, Phil Davidson, Rohit Pandey, Pavel Podlipensky, David Kim, Jay Busch, Cynthia Socorro Herrera, Matt Whalen, Peter Lincoln, Geoff Harvey, Christoph Rhemann, Zhijie Deng, Daniel Finchelstein, Jing Pu, Chih-Chung Chang, Eddy Hsu, Tian-yi Lin, Sam Chang, Isaac Christensen, Donghui Han, Speth Chang, Zhijun He, Gabriel Nava, Jana Ehmann, Yichang Shih, Chia-Kai Liang, Isaac Reynolds, Dillon Sharlet, Steven Johnson, Zalman Stern, Jiawen Chen, Ricardo Martin Brualla, Supreeth Achar, Mike Mehlman, Brandon Barbello, Chris Breithaupt, Michael Rosenfield, Gopal Parupudi, Steve Goldberg, Tim Knight, Raj Singh, Shahram Izadi, as well as many other colleagues across Devices and Services, Google Research, Android and X.

FAQs

What is 3D depth sensing? ›

Depth sensors are a form of three-dimensional (3D) range finder, which means they acquire multi-point distance information across a wide Field-of-View (FoV). Standard distance sensing technologies typically measure distance using one or more sensors with comparatively narrow Fields-of-View.

What is the difference between LiDAR and depth sensor? ›

The functional difference between LiDAR and other forms of ToF is that LiDAR uses pulsed lasers to build a point cloud, which is then used to construct a 3D map or image, while ToF applications create "depth maps" based on light detection, usually through a standard RGB camera.

What is the purpose of the depth sensor? ›

Depth Imaging. Depth cameras use sensing technology to infer the distance (or depth) of points in the scene from the camera. They output image sequences in which each frame is a depth image where pixel values represent the distance from the camera (see Figure 1).

What does a 3D depth camera do? ›

This tool judges the depth and distance of a subject with the complete accuracy of highly advanced camera sensors. These high-resolution sensors perform three-dimensional range giving them the ability to receive more information from an image than the standard sensing technology.

What are the advantages of 3D sensing camera? ›

3D images give a clearer view

This allows objects and obstacles to be safely identified in three dimensions on the basis of the stereoscopic principle. The result is a reliable driver assistance system.

What are the examples of 3D sensors? ›

3 Common Types of 3D Sensors: Stereo, Structured Light, and ToF. In many robotic applications, 3D sensing has become the de facto choice for tasks such as near-field object detection and collision avoidance, surface and object inspection, and map creation.

Is there a better technology than LiDAR? ›

Radar systems, on the other hand, have a longer range and are better at detecting objects that are farther away. LiDAR is also more expensive and requires a clear line of sight, while radar can detect objects through fog, rain, and other obstacles.

Which LiDAR sensor is best? ›

The YellowScan Voyager can acquire up to 15 target echoes, an approximate range of up to 2,500 feet, and a field of view of 100-degrees. After producing the most powerful LiDAR solution it has ever made, YellowScan won the 2022 award for Best LiDAR Scanner with their Voyager platform.

Why use LiDAR instead of camera? ›

Lidar vs Cameras

However, the fundamental difference between Lidar and camera technology is that Lidar emits the light it sees, whereas cameras don't. This gives Lidar the ability to calculate incredibly accurate distances to many objects that are simultaneously detected.

What are the different types of depth sensors? ›

Depth Cameras Detect depth mainly using four different methods:
  • Stereo Sensor.
  • Time-of-Flight.
  • Structured Light.
  • Light Detection Ranging.
Nov 16, 2021

What are the different depth sensors? ›

There are three main methods for depth sensing: structural light, time of flight and camera array. In structural light, a laser is used to project a known pattern.

What device on the camera controls depth of field? ›

The aperture is the setting that beginners typically use to control depth of field. The wider the aperture (smaller f-number f/1.4 to f/4), the shallower the depth of field. On the contrary, the smaller the aperture (large f-number: f/11 to f/22), the deeper the depth of field.

What is the disadvantages of depth camera? ›

That is because noise sources may cause holes in depth images or invalid depth measurements . ... ... The limitations of depth image sensors are low resolution, short sensing distance, and sensitivity to optical interference [117] . ... ...

What does a depth camera do on a smartphone? ›

Depth-sensing means nothing but measuring the distance from a device to an object or the distance between two objects. A depth-sensing camera is used for this purpose where it automatically detects the presence of any object nearby and measures the distance to it on the go.

How does depth camera work on smartphone? ›

Depth cameras on smartphones aren't like most of the other cameras on phones. You can't take a photo solely using the depth camera, like you would with an ultra-wide, macro, or telephoto lens — the depth camera simply helps the other lenses judge distances.

What is the difference between stereo camera and depth camera? ›

Stereo Depth

For a stereo camera, all infrared noise is good noise. Stereo depth cameras have two sensors, spaced a small distance apart. A stereo camera takes the two images from these two sensors and compares them. Since the distance between the sensors is known, these comparisons give depth information.

What are the applications of 3D cameras? ›

Some of the uses of 3D machine vision cameras in industrial settings include optical gauging, non-contact 3-D measurements, barcode & OCR reading, and process control in manufacturing and logistics.

What is 3D image sensors? ›

3D imaging sensors generally operate by projecting (in the active form) or acquiring (in the passive form) electromagnetic energy onto/from an object followed by recording the transmitted or reflected energy.

What is the difference between a 3D sensor and a 3D camera? ›

How do optical 3D sensors work? Cameras generate a 2D array of pixels, where each pixel represents the grayscale or color value of the corresponding area in the scene. In contrast, a 3D sensor generates a 2D array where each pixel represents the distance of the corresponding point in the scene to the sensor.

What are 3D viewing devices? ›

A 3D display is a display device capable of conveying depth to the viewer. Many 3D displays are stereoscopic displays, which produce a basic 3D effect by means of stereopsis, but can cause eye strain and visual fatigue.

What is the difference between 2D and 3D sensors? ›

Unlike in 2D projections, where perspective changes objects' appearance and their perceived size, in 3D they have a consistent size true to their real-world dimensions, no matter the distance to the sensor. Furthermore, the exact orientation of the object with respect to the sensor's position can be estimated.

Why is Elon Musk against LiDAR? ›

The first is that it is very expensive. The second, on the other hand, is almost philosophical. Musk himself explained that “mounting Lidars on the car means filling it with expensive appendages. But on a car every added accessory is a bad thing: it is ridiculous to fill the car with these devices.

What is the downside of LiDAR? ›

Another disadvantage of using LiDAR data for terrain analysis is the difficulty and uncertainty of data interpretation and analysis. LiDAR data can provide rich and detailed information about the terrain, but it can also be ambiguous and complex to interpret and analyze.

What did Elon Musk say about LiDAR? ›

Musk likes to point out that humans do not come equipped with lidar and can still drive. He uses this analogy to defend his company's attempt to develop self-driving software algorithms capable of autonomous driving based on vision.

Which smartphone has LiDAR sensor? ›

Samsung Galaxy S21 Ultra: This is a high-end Android phone that comes with a LiDAR sensor, which is used for improved augmented reality (AR) experiences and better autofocus.

What devices have a LiDAR sensor? ›

Apple's newer generation products, the iPhone 12 Pro and Pro Max, iPhone 13 Pro and Pro Max, iPhone 14 Pro and Pro Max, and iPad Pro feature a built-in LiDAR scanner. This enables Apple device users to create realistic, accurate, and fast 3D representations of close-range objects and environments.

Does LiDAR work better in the dark? ›

Advantages of LIDAR at Night

Unaffected by darkness: Unlike traditional vision-based systems that rely on visible light, LIDAR uses laser pulses to measure distances. This makes it highly effective at night, as it is not impacted by the absence of natural light.

Why do I need LiDAR on my phone? ›

The LiDAR works at sensing where objects in the picture are, and helps the camera focus on the objects that should be focused on. Compared to smartphones that don't use LiDAR, Apple's “Pro” iPhone models are able to capture more light and take a crisper picture.

Can LiDAR see through rain? ›

Lidar experiences significantly less image distortion than the camera in environmental conditions such as rain due to the aperture size, shutter speed and return processing of the sensor. Water doesn't obscure the lidar signal and range images, even if there are water droplets on the sensor.

Why does Tesla avoid LiDAR? ›

He's also said that any company that relies on this type of tech is doomed. He argues that LiDAR is too expensive and that mapping the world and keeping it up-to-date are too costly. Instead, Tesla focuses on vision-based systems, which he believes are more effective and cost-efficient.

What are the 5 physical sensors? ›

Nerves relay the signals to the brain, which interprets them as sight (vision), sound (hearing), smell (olfaction), taste (gustation), and touch (tactile perception).

What are the two types of image sensor? ›

Sensor types : CCD and CMOS

CMOS and CCD are the two most important and common technologies for the image sensor market.

What is RGBD camera? ›

RGB-D Sensors are a specific type of depth-sensing devices that work in association with a RGB (red, green and blue color) sensor camera. They are able to augment the conventional image with depth information (related with the distance to the sensor) in a per-pixel basis.

Which camera mode allows for the greatest depth of field? ›

To achieve a deep depth of field, you need to use a very small aperture, such as f/22, f/32, or f/64. You also need to use a very slow shutter speed, which can range from seconds to minutes or even hours.

What three settings control depth of field? ›

Actually, there are three ways you can control depth of field in camera: by aperture, focal distance, and focal length. Let's take a look at each method and this great cheat sheet by Digital Camera World. Adjust your aperture. Use a low f-stop (f2.

Which camera mode would provide a deep depth of field? ›

For a large depth of field, use your camera's “landscape” mode. For a shallow depth of field, shoot in “portrait” mode.

Is LiDAR a depth camera? ›

LiDAR uses the light detection technique to calculate depth. It measures the time it takes for each laser pulse to bounce back from an obstacle. This pulsed laser measurement is used to create 3D models (also known as a point cloud) and maps of objects and environments.

Do photos have depth perception? ›

The retinal image is only two-dimensional, similar to a photograph. There is no depth in a photographic print. The ability to see and place objects at various distances from us is called depth perception.

Why do phone cameras have no depth of field? ›

Phone cameras tend to have smaller apertures. Hence, the focus-fall-off rate depends on both the sensor size and the aperture... bigger means faster fall-off. This thus reduces your depth of field and thus much less blur on a cellphone camera.

What is the true depth camera? ›

The TrueDepth camera captures accurate face data by projecting and analyzing thousands of invisible dots to create a depth map of your face and also captures an infrared image of your face.

Why does my phone have three lenses? ›

From top to bottom, these are the long-focus camera, the color camera, and the monochrome camera. The long-focus camera helps to magnify distant objects and include them in your photo. The monochrome camera plays an important role in capturing details.

Which camera setting is best suited for capturing shallow depth of field when using a smartphone? ›

But if you go completely manual, you need a wide aperture (f/1.4 is ideal) for a shallow depth of field. And on a phone camera, this is no different. You can manually adjust the settings to get a soft bokeh effect on portraits or macro shots. You need to select the widest aperture your phone app has.

How do I add depth to my camera? ›

How to Add Depth to a Photo
  1. Use the Rule of Thirds Composition Technique. The most important thing to do is ensure you have a foreground and a background. ...
  2. Use a Frame within a Frame. ...
  3. Use Converging Lines. ...
  4. Use Distance and Focal Length. ...
  5. Use Vertical Lines. ...
  6. Use Dynamic Tension.

What is depth in Android? ›

The Depth API helps a device's camera to understand the size and shape of the real objects in a scene. It uses the camera to create depth images, or depth maps, thereby adding a layer of AR realism into your apps.

How does 3D sensing work? ›

How Does 3D Sensing Work? The technology behind 3D sensing is based on the projection of a light source toward an object and the collection of the same light waves after reflection to determine the shape and position of the object.

What is the difference between 2D and 3D LiDAR sensor? ›

The key difference between 2D and 3D lidar systems comes down to accuracy. While 2D lidar systems can measure distance by bouncing light off a single surface, 3D lidar systems expand on that by using multiple beams of light simultaneously to create a three-dimensional view of the area.

Is depth perception 3D? ›

Depth perception is the ability to perceive the world in three dimensions (3D) and to judge the distance of objects. Your brain achieves it by processing different pictures from each eye and combining them to form a single 3D image.

What is a 3D motion sensor? ›

3D Motion Detection uses radar technology to identify the distance of an object so that a camera begins recording once someone is within a designated range.

What is 3D LiDAR mapping? ›

LiDAR is a remote sensing method. LiDAR technology uses the light from a laser to collect measurements. These are used to create 3D models and maps of objects and environments.

How does a 3D LiDAR scanner work? ›

In LiDAR, laser light is sent from a source (transmitter) and reflected from objects in the scene. The reflected light is detected by the system receiver and the time of flight (TOF) is used to develop a distance map of the objects in the scene.

Is 2D or 3D better? ›

2D animation has attractive budget requirements but 3D offers more range of movement as it's done in a three-dimensional space. 2D animation involves less complexity and processing time, but 3D offers more dynamism and freedom to create worlds of your own choice.

What are the disadvantages of 3D machine vision system? ›

One con of 3D stereo vision is that it requires two cameras, raising the costs. 3D stereo vision is prevalent in robotics and surveillance applications. These 3D vision systems measure distances using the flight of time principle (ToF).

Why 3D image is better than 2D? ›

2D graphics are widely used in animation and video games, providing a realistic, but flat, view of movement on the screen. 3D graphics provide realistic depth that allows the viewer to see into spaces, notice the movement of light and shadows, and gain a fuller understanding of what's being shown.

What is the most accurate LiDAR sensor? ›

Accurate, high precision UAV LiDAR solution

The YellowScan Vx20-100 is our most accurate LiDAR UAV mobile mapping system, incorporating a Riegl Mini-VUX-1UAV LiDAR sensor and an Applanix APX-20 UAV GNSS-Inertial system for highly accurate point cloud production.

Which is better LiDAR or cameras? ›

Lidar vs Cameras

However, the fundamental difference between Lidar and camera technology is that Lidar emits the light it sees, whereas cameras don't. This gives Lidar the ability to calculate incredibly accurate distances to many objects that are simultaneously detected.

How do you know if you have 3D vision? ›

The eye doctor will ask you to wear what looks like a pair of sunglasses, then show a book with images, often of a butterfly or reindeer, cartoon characters or circles and other shapes. These images are actually in 3D, and as you identify the 3D images 'popping out of the page' your 3D vision (stereopsis) is measured.

Do humans see in 3D or 4D? ›

Do humans see in 4D? Even though we are 3D beings who live in a 3D world, our eyes actually only see in 2D. Our retina has only a 2D surface area with which it can detect light coming into our eye. The key here is that what the 4D being sees in its retina is 3-dimensional, not 4-dimensional.

Is the human eye 3D or 4D? ›

Thus, each human face possesses concurrently a unique volumetric structure and surface pattern in three dimensions (or 3D) and a temporal pattern across time in four dimensions (or 4D).

References

Top Articles
Latest Posts
Article information

Author: Annamae Dooley

Last Updated: 06/09/2023

Views: 5940

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Annamae Dooley

Birthday: 2001-07-26

Address: 9687 Tambra Meadow, Bradleyhaven, TN 53219

Phone: +9316045904039

Job: Future Coordinator

Hobby: Archery, Couponing, Poi, Kite flying, Knitting, Rappelling, Baseball

Introduction: My name is Annamae Dooley, I am a witty, quaint, lovely, clever, rich, sparkling, powerful person who loves writing and wants to share my knowledge and understanding with you.