Tuesday, 19 March 2013

Lee Going Perceptual - Part Five

Voice Control And Other Animals

Rather than repeat my blog post, here is the link to the original one:


And my video for this week:

And There's More

After I posted the blog I solved the Voice Control problem.  Here is the code and solution - hurray!

First you set up your grammar:

// Grammar intialization
pxcUID gid;

And then you look for the respective value in the callback function:

gLatestVoiceCommand = -1;
int labelvalue = cmd->label;
if ( labelvalue > 0 )
gLatestVoiceCommand = labelvalue;

Presto, a VERY fast voice recognition system!   When realisation dawned, I kicked myself, then wanted to tell someone, so I am telling my blog :)

Signing Off

Decided to break my challenge tradition and do some Ultimate Coding during the week to get a good demo for GDC ready.  The size of another project on my plate shrinks this week so I should have some quality time :)

Monday, 11 March 2013

Lee Going Perceptual - Part Four

This Is Lee Calling

With the realisation that I couldn't top my week three video, I decided the smart thing was to get my head down and code the necessaries to turn my prototype into a functioning app. This meant adding a front end to the app and gets the guts of the conferencing functionality coded.

I also vowed not to bore the judges and fellow combatants to tears this week, and will stick mainly to videos and pictures.

Latest Progress

This main video covers the new additions to the app, and attempts to demonstrate conferencing in action. With only one head and two cameras, the biggest challenge was where to look.

I also made a smaller video from a mobile camera so you can see both PC’s in the same shot, and you will also see a glimpse of what the Perceptual Camera data looks like once it’s been chewed up by network packets.

Top priority in week five will be to reduce the network packet size and improve the rendered visuals so we don’t see the disconcerting transition effects in the current version.

How My Perceptual 3D Avatar Works

One of the Ultimate Coder Judges asked how the 3D avatar was constructed in the week three demo, and it occurs to me that this information may be of use to other coders so here is a short summary of the technique.

The Gesture Camera provides a 16-bit plane of depth data streaming in at 30 frames per second, which produces very accurate measurements of distance from the camera to a point on the subject in front of the camera. This depth data also provides a reference offset to allow you to lookup the colour at that point too.

Once the camera is set-up and actively sending this information, I create a grid of polygons, 320 by 240 evenly spaced on the X and Y axis. The Z axis of the vertex at each corner is controlled by the depth data, so a point furthest from the camera would have the greatest Z value. Looking at this 3D construct front on you would see the polygons with higher Z values nearer to the render viewpoint. I then take the camera colour detail for that point and modify the ‘Diffuse’ element of the respective vertex to match it. The 3D model is not textured. The vertex coordinates are so densely packed together that they produce a highly detailed representation of the original colour stream.
This process is repeated 30 times per second in sync with the rate at which the video stream outputs each frame providing a high fidelity render.  Points that are too far in the distance have the alpha component of the diffuse set to zero making them invisible to the user. This removes the backdrop from the rendered mesh creating an effective contour.

The advantage in converting camera stream data into vertex data is that you have direct access to a 3D representation of any object in front of the camera, and the possibility exists to apply reduction and optimisation algorithms from the 3D world that could never have been used on a 2D stream.

Voice Over In Pain

Here Is a summary of my attempt to get voice networking into my app. I tried to compile Linphone SDK on Visual Studio, no joy, an old VS2008 project I found on Google Code, no joy, LibJingle to see if that would help, no joy, checked out and attempted to compile MyBoghe, many dependency errors (no joy). After what was about 6 hours of fruitless toil, I found myself looking closer to home. Turns out Dark Basic pro released a module many moons back called DarkNET which provides full TCP/UDT networking commands, and yes you guessed it, built-in VOIP commands!  A world of pain has been reduced to about six commands that are fully compatible with the language I am using. Once I discovered this, my conferencing app came on in leaps and bounds.

Signing Off

As promised I have kept my blog shorter this week. I hope you liked the app and video, and please let me know if you would like blog five to include lots of source code. Next week is my last week for finishing app functionality, so we shall see VOIP (so you can hear and speak in the conference call) and optimisations and compatibility testing so you can run the app in a variety of scenarios.  Given the time constraints, I am aiming to limit the first version of the app to two users in order to cram as much fidelity into the visuals as possible.  This will also give me time to refine and augment the Perceptual Computing elements of the app, and show off more of what the Gesture Camera can do.

P.S. Hope Sascha is okay after sitting on all those wires. Ouch!

Source Code For Swipe Tracking

A request in the comments section of the sister blog to this one suggested some source code would be nice. I have extracted the best 'Perceptual Bit' from week four which is the swipe gesture detection. As you will see, it is remarkably simple (and a bit of a fudge), but it works well enough most of the time. Here you go:

// track left/right sweeps
if ( bHaveDepthCamera )
if ( iNearestX[1]!=0 )
iNearestY[1] = 0; 
if ( iSwipeMode==0 && iNearestX[1] > 160+80 )
if ( iSwipeMode==1 && iNearestX[1] > 160+0 )
if ( iNearestX[1] < 160+80 )
if ( iSwipeModeLives < 0 ) iSwipeMode=0;
if ( iSwipeMode==2 && iNearestX[1] > 160-80 )
if ( iNearestX[1] < 160+0 )
if ( iSwipeModeLives < 0 ) iSwipeMode=0;
if ( iSwipeMode==3 && iNearestX[1] < 160-80 )
// swiped
iNearestY[1] = 5; 
iSwipeMode = 0;
if ( iSwipeModeLives < 0 ) iSwipeMode=0;

Monday, 4 March 2013

Lee Going Perceptual - Part Three

Gazing At The Future

Welcome back to my humble attempt to re-write the rule book on teleconferencing software, a journey that will see it dragged from its complacent little rectangular world. It’s true we've had 3D for years, but we've never been able to communicate accurately and directly in that space. Thanks to the Gesture Camera, we now have the first in what will be a long line of high fidelity super accurate perceptual devices.  It is a pleasure to develop for this ground breaking device, and I hope my ramblings will light the way for future travels and travellers. So now, I will begin my crazy rant.

Latest Progress

You may recall that last week I proposed to turn that green face blob into a proper head and transmit it across to another device. The good news is that my 3D face looks a lot better, and the bad news is that getting it transmitted is going to have to wait.  In taking the advice of judges, I dug out a modern webcam product and realised the value-adds where nothing more than novelties. The market has stagnated, and the march of Skype and Google Talk do nothing more than perpetuate a flat and utilitarian experience.

I did come to appreciate however that teleconferencing cannot be taken lightly. It’s a massive industry and serious users want a reliable, quality experience that helps them get on with their job. Low latency, ease of use, backwards compatibility and essential conferencing features are all required if a new tool is to supplant the old ones.

Voice over I.P Technology

I was initially temped to write my own audio streaming system to carry audio data to the various participants in the conferencing call, but after careful study of existing solutions and the highly specialised disciplines required, I decided to go the path of least resistance and use an existing open source solution. At first I decided to use the same technology Google Talk uses for audio exchange but after a few hours of research and light development, it turns out a vital API was no longer available for download, mainly because Google had bought the company in question and moved the technology onto HTML5 and JavaScript. As luck would have it, Google partnered with another company who they did not buy called Linphone, and they provide a great open source solution that is also cross platform compatible with all the major desktops and mobiles.  


A long story short, this new API is right up to date and my test across two Windows PCs, a Mac and an iPad in four way audio conferencing mode worked a treat.  Next week I shall be breaking down the sample provided to obtain the vital bits of code needed to implement audio and packet exchange between my users. As a bonus, I am going to write it in such a way that existing Linphone client apps can call into my software to join the conference call, so anyone with regular webcams or even mobile phones can join in. I will probably stick a large 3D handset in the chair in place of a 3D avatar, just for fun.

On a related note, I have decided to postpone even thinking about voice recognition until the surrounding challenges have been conquered. It never pays to spin too many plates!

Gaze Solved? – Version One

In theory, this should be a relatively simple algorithm. Find the head, then find the eyes, then grab the RGB around the eyes only. Locate the pupil at each eye, take the average, and produce a look vector. Job’s a good one, right? Well, no. At first I decided to run away and find a sample I once saw at an early Beta preview of the Perceptual SDK which created a vector from face rotation which was pretty neat. Unfortunately that sample was not included in Beta 3, and it was soon apparent why. On exploring the commands for getting ‘landmark’ data, I noticed my nose was missing. And more strikingly, all the roll, pitch and yaw values where empty too. Finding this out from the sample saved me a bucket load of time had I proceeded to add the code to my main app first. Phew. I am sure it will be fixed in a future SDK (or I was doing something silly and it does work), but I can’t afford the time to write even one email to Intel support (who are great by the way). I needed Gaze now!

I plumbed for option two, write everything myself using only the depth data as my source.  I set to work and implemented my first version of the Gave Algorithm. I have detailed the steps in case you like it, and want to use it:
  1. Find the furthest depth point from the upper half of the camera depth data
  2. March left and right to find the points at which the ‘head’ depth data stops
  3. Now we know the width of the head, trace downwards to find the shoulder
  4. Once you have a shoulder coordinate, use that to align the Y vector of the head
  5. You now have a stable X and Y vector for head tracking (and Z of course)
  6. Scan all the depth between the ears of the face, down to the shoulder height
  7. Add all depth values together, weighting them as the coordinate moves left/right
  8. Do the same for top/bottom weighting them with a vertical multi-player
  9. You are essentially using the nose and facial features to track the bulk of the head
  10. Happily, this bulk determines the general gaze direction of the face
  11. You have to enhance the depth around the nose to get better gaze tracking

I have included my entire source code to date for the two DBP commands you saw in the last blog so you can see how I access the depth and colour data, create 3D constructs and handle the interpretation of the depth information.  This current implementation is only good enough to determine which corner of the screen you are looking at, but I feel with more work this can be refined to provide almost pinpoint accurate gazing.

Interacting with Document

One thing I enjoyed when tinkering with the latest version was holding up a piece of paper, maybe with a sketch on it, and shout ‘scan’ in a firm voice.  Nothing happened of course, but I imaged what could happen. We still doodle on paper, or have some article or clipping during a meeting. It would be awesome if you could hold it up, bark a command, and the computer would turn it into a virtual item in the conference. Other attendees could then pick it up (copy it I guess), and once received could view it or print it during the call. It would be like fax but faster! I even thought of tying in your tablet to the conference call too, so when a document is shared, it instantly goes onto a tablet carousel so everyone who has a tablet can view the media. It could work in reverse too, so you could find a website or application, and then just wave the tablet in front of the camera, the camera would detect you are waving your tablet and instantly copy the contents of the tablet screen to the others in the meeting.  It was around this time I switched part of my brain off so I could finish up and record the video for your viewing pleasure.

Developer Tips

TIP 1 : Infra-red gesture cameras and 6AM sun rise do not mix very well. As I was gluing Saturday and Sunday together, the sun’s rays blasted through the window and disintegrated my virtual me. Fortunately a wall helped a few hours later.  For accurate usage of the gesture camera, ensure you are not bathed in direct sunlight!

TIP 2 : If you think you can smooth out and tame the edges of your depth data, think again. I have this one about five hours of solid thought and tinkering, and I concluded that you can only get smoothing by substantially trimming the depth shape. As the edges of a shape leap from almost zero to full depth reading, it is very difficult to filter or accommodate it. In order to move on, I moved on, but I have a few more ideas and many more days to crack this one. The current fussy edges are not bad as such, but it is something you might associate with low quality and so I want to return to this. The fact is the depth data around the edges is very dodgy, and some serious edge cleaning techniques will need to be employed to overcome this feature of the hardware.

The Code

Last week you had some DBP code. This week, try some C++. Here is the code which shows some pretty horrible unoptimised code, but it's all there and you might gleam some cut and paste usage from something that's been proved to compile and work:

Next Time

Now I have the two main components running side by side, the 3D construction and the audio conferencing, next week should be a case of gluing them together in a tidy interface. One of the judges has thrown down the gauntlet that the app should support both Gesture Camera AND Ultrabook, so I am going to pretend the depth camera is ‘built’ into the Ultrabook and treat it as one device. As I am writing the app from scratch, my interface design will make full use of touch when touch makes sense and intuitive use of perception for everything else.

P.S. The judges’ video blog was a great idea and fun to watch!   Hope you all had a good time in Barcelona and managed to avoid getting run over by all those meals on wheels.