Stereoscopic Vision for the Blind or Visually Impaired

This page discusses how The vOICe technology supports binocular vision with suitable stereoscopic camera hardware. In orientation and mobility applications for blind users, the orientation component is supported well by the standard single camera setup, but the mobility component can benefit from better depth and distance perception in order to detect nearby objects and obstacles. Binocular vision makes this possible, and will thus further enhance the applicability and versatility of The vOICe as an ETA (Electronic Travel Aid) for the blind in addition to its general vision substitution and synthetic vision features that realize augmented reality for the blind.

In designing an electronic travel aid (ETA) for the blind, a key advantage of a sonar device over a camera used to be that it is relatively easy with sonar to measure distances to nearby obstacles, allowing one to generate a warning signal when there is a collision threat. In using a single camera this turns out to be extremely hard, because the distance information in a static camera view is essentially ambigious and requires much a priori knowledge about the physical world to derive distances from recognized objects. This is what people blinded in only one eye do all the time, without apparent effort, but for machine vision, the required recognition of objects remains a daunting task. One partial solution is to derive size and distance information from video sequences of a single camera while moving around (this is what people blind in one eye do as well), but a more powerful and reliable method is to make use of binocular vision, also called stereoscopic vision or stereopsis. By comparing the slight differences in images obtained from two different simultaneous viewpoints, the distances to nearby objects and obstacles can be estimated. This approach also works with static scenes. Moreover, by now knowing distances and apparent (angular) sizes in the camera view, the user can deduce the actual sizes of nearby objects without first having to recognize them. The vOICe binocular processing uses so-called anaglyphic video input. An anaglyph is an image that is created by combining two viewpoints through different color filters. For instance, the left-eye view may be taken through a red filter while the right-eye view is taken through a green or cyan filter. Next, the two differently colored views are overlaid on top of each other. Sighted viewers can then again see the image in 3D by looking at the anaglyph through red-green glasses: the red filter in front of the left eye blocks the green or cyan component and transmits only the left-eye red image while the green filter in front of the right eye transmits only the right-eye image. The human brain subsequently combines the slightly different views into one perceived three-dimensional view. Distance information is then apparent from something called "disparity", the small visual displacements of the red and green/cyan color components for nearby objects. Whereas the left-eye and right-eye view may coincide perfectly for far-away objects, the mismatch for a nearby object tells how close this object is. This is the principle upon which The vOICe's 3D stereo vision support is based. Instead of the human brain, The vOICe now analyses the different color components for visual displacements and from that derives a distance or depth map. By default, The vOICe stereo vision option will map distance to brightness, but since this goes at the expense of visual information at large distances (important for orientation) and at the expense of surface textures, other options are available for experienced users to map distance information to spatialized sound without loosing the other valuable visual information. However, distinguishing foreground and background will again be harder than with the default stereo vision distance mapping mode.
The following example illustrates how a binocular view with stereo images from a visually highly cluttered scene gets processed by The vOICe into a distance map where brighter (louder) means closer. The nearby tree trunk clearly stands out in the resulting distance map, as are some parts of the parked car that is right behind the tree, whereas the clutter of the visually complex distant background is completely suppressed. The sky and distant houses, other parked cars and trees are rendered invisible in favour of nearby objects and obstacles. The 18K MP3 audio sample shows the corresponding soundscape for the extracted distance map. Registered users can load the anaglyph image into The vOICe Learning Edition after switching to the Stereoscopic View mode via the menu Options | 3D to get a distance map derived in real-time from the disparities in this anaglyph view. When experimenting with other anaglyphs, note that the color filters may have to be different, and more importantly, that only anaglyphs derived from greyscale views will give good results due to the required strict separation of left and right view.
Various stereoscopic vision options can be set via the menu Edit | Stereoscopic Preferences. Among the radio button mapping options is also the possibility to sound the left eye image to the left ear and the right eye image to the right ear (or the other way around by swapping the color filter selections for left and right eye views, that is, sound the left eye image to the right ear and the right eye image to the left ear). When the camera views are properly calibrated to have coinciding views for large distances, the soundscapes will be the same as without stereo vision for and distant items and landmarks, and differences will arise only from visual disparity at close range. A key advantage over sounding a depth map is that distant landmarks - important for orientation - are not discarded, but a disadvantage is that visual disparity will be less salient than when sounding the corresponding depth map. The following example illustrates for a view from the Avatar 3D science fiction movie how the disparity of nearby objects causes a subtle but noticeable stronger spatialization in the corresponding anaglyph soundscape. It is not yet known if blind people can learn to exploit this. Can you hear differences between the regular non-anaglyph soundscape and the 3D anaglyph soundscape? Home-made setup using HeavyMath Cam 3D driver or Minoru 3D webcamAn easy and cheap way to quickly hack together very basic stereo vision support for The vOICe is to use two identical webcams with a WDM driver (most modern webcams should qualify). The third-party program HeavyMath Cam 3D lets you capture from the two webcams and show live anaglyph video on the screen. Alternatively, you can use the Minoru 3D webcam, in combination with Microsoft AMCAP to show the anaglyph view on the screen. The Minoru 3D webcam already contains two identical webcams mounted in a convenient rigid frame. In addition, you run the registered version of The vOICe, and you turn on the active window client sonification mode via Control F9. Next you Alt tab to the HeavyMath Cam 3D, Minoru 3D or AMCAP window to capture and sound the live anaglyph view with The vOICe. Next you switch The vOICe to its stereo vision mode via the menu Options | 3D | Stereoscopic View, and depending on the current settings in the menu Edit | Stereoscopic Preferences, The vOICe will use the anaglyph screen view to calculate and sound a live depth map or sound the left camera view to the left ear and the right camera view to the right ear. Moreover, blind users can perform horizontal and vertical camera calibration of the two views independently, by selecting the option for sounding the difference between left-eye view and right-eye view, and then adjusting (mis)alignment until all distant visual items vanish from the view, indicating a perfect match between left and right view for large distances. A limitation with the current procedure is that you will also see/hear the window borders and menu of the anaglyph window, but this can be alleviated by setting a relatively high capture resolution such as VGA. You will also typically need to adjust some parameter settings in The vOICe, such as for disparity, to obtain acceptable results. Of course with two separate webcams you also need to improvise some stable fixture to mount and adjust the two webcams such that their views coincide at infinity. Finally, always first start the anaglyph viewing software and only then The vOICe, such that The vOICe will not connect to (and thereby block) one of the two webcams that the anaglyph viewing software needs to connect to.

Home-made setup using Microsoft Kinect
If you have any program that shows a live Kinect depth map on the computer screen, you can again make use of The vOICe active window client sonification mode via Control F9, and Alt tab to the window that shows the Kinect depth map. Thus it is very easy to create an auditory display version of the Kinect for the Blind project that won second place in the Russian finals of the Microsoft Imagine Cup 2011. In this case you do not need the stereo vision features of The vOICe because the Kinect device and its driver directly gives a real-time 3D depth map. Unfortunately, the Kinect completely fails in typical outdoor lighting conditions where sunshine overwhelms its projected infrared dot patterns, and the Kinect is also too bulky for unobtrusive head-mounted use.

Stereo vision hardware
With the exception of the Minoru 3D Webcam, there are no suitable and affordable stereo video cameras on the market yet. However, a physicist or electronics engineer should be able to design and construct a dedicated stereo camera ("3D camera") setup by combining some standard commercially available components. One could use two black-and-white (greyscale) cameras to have greyscale video directly. The image on the left shows the simplified schematics that are obtained when using greyscale cameras. The greyscale video signal from the left-eye camera could be used as the "red" (R) signal for the RGB input of a video capture card while the greyscale video signal from the right-eye camera could form the "green" (G) or "cyan" (G+B) signal for this same RGB input. Note that the two cameras need to be "genlocked" to have synchronized video signals that can be captured as separate video signals and then merged into one color signal. The use of genlock requires at least one of the cameras to offer a synchronization input. Without genlock support in the cameras, the capture card must have proper provisions for video frame synchronization.
Vendors that can offer an end-user hardware solution for anaglyph video generation for The vOICe are welcome to report, for possible inclusion in the third-party suppliers page.

As an alternative for using black-and-white cameras, in whist case one could also use two genlocked color cameras, in which case one should first mix the RGB signals from each color camera into greyscale video, because we need greyscale-based anaglyphs for good results. This is shown schematically in the image on the left. The image on the right again stresses that picking and capturing individual color components directly is not a good idea: that would for instance render a bright red object on a black background invisible in the right-eye view, while in reality its brightness should still make it stand out - as needed for making a distance map from the left-right viewing disparity! Next, the greyscale video signal from the left-eye camera could be used as the "red" (R) signal for the RGB input of a regular video capture card while the greyscale video signal from the right-eye camera could form the "green" (G) or "cyan" (G+B) signal for this same RGB input.
The vOICe's advanced stereo vision functionality has so far undergone only limited testing and good results under all circumstances cannot be guaranteed. Especially in mobile applications, one has to carefully consider the possible safety hazards caused by any depth mapping artefacts and inaccuracies. Step-downs in particular will remain hard to detect reliably, while it can be clearly be dangerous when any nearby or fast-approaching objects go entirely undetected ("time-to-impact" is often a more relevant measure than actual physical distance, and is applied in The vOICe's monocular "collision threat analysis" option). On the other hand, some depth mapping artefacts and inaccuracies may be tolerable. For instance, (small) parts of nearby objects may appear to be at a larger distance as long as some parts of these nearby objects still get their correct nearby depth reading. "False alarms" where (even small) parts of distant objects appear to be at close range are more disturbing. Any receding objects may even be deliberately filtered out as they would not normally present a safety hazard, while a reduction of clutter could reduce the mental load for the blind user.
Left and right camera images must be carefully calibrated, such that they coincide at infinity (parallel camera model). Lacking that, the machine equivalent of the medical condition of "strabismus" (eye misalignment) may yield poor stereo vision results.

For adding speech recognition support during mobile use, visit The vOICe Command add-on page. It allows for switching the stereoscopic vision mode on and off by simply speaking a command into the microphone.