Tracking a Human Skeleton and Ball Simultaneously Using Kinect (20 - 25 FPS)

Mustapha Othman
Jul 27, 2016
10 min read

Capture and Framework Development

In this chapter, all resources used in developing the capture framework are outlined and examined in detail. Capabilities of both the hardware and software are reviewed and then the setup and preferences of the final capture framework is discussed.

Resources Used:

Microsoft Kinect: The Microsoft Kinect device is a collection of a required resources for tracking both the human player and the ball. The resources that are bundled with the Kinect device are:

Depth Camera: A depth camera is required for capturing the depth data of the environment. In other words, it measures the distance of objects from the camera. This serves as the basis for measuring the positions of both the ball and the player’s joints in 3 - dimensional space.

RGB Camera: An RGB camera is needed for detection of the ball. Computer vision technique of colour filtering is used to isolate the ball in the RGB image provided by the RGB camera. This however means that a prior knowledge of the ball colour is necessary and that the ball colour must be sufficiently distinct from the surrounding objects.

Human Skeleton Detection algorithms: Microsoft Kinect Software Development Kit provides the algorithms for human skeleton detection and also provides the functions for accessing the tracked data. Khoshelham (2011) already showed that the Kinect tracking algorithm functions to a desirable degree within the range of 1m to 5m from the Kinect camera. With best measurements of a maximum error of 3cm recorded between 1m and 3m distance from the Kinect. His findings corroborate the choice of keeping the experiment area to a limited space of 4m by 4m with the Kinect camera at one end and the player 3m away from the Kinect.

Blue Backdrop: The blue backdrop used could have been any other neutral colour. Its purpose is to provide a blue screen effect and block out background noise that may propagate as a result of surrounding objects with similar colour as the ball or reflective objects in the background that may confuse the tracking system. [As will be seen in the images, I never even got to use the blue backdrop. Although, I had plans to use it thinking it would greatly improve the tracking speed and accuracy in the final system].

Computer Vision Library: EmguCV (EmguCV, 2008), a C# wrapper of the widely known OpenCV library is used to assist in image processing of the captured RGB images. The library provides helpful functions like image dilation, smoothing, colour filtering, grayscale and binary conversion of colour images, blob detection and circle detection in binary images; all of which play a major role in tracking the ball.

Coding4Fun Toolkit: The Coding4Fun Kinect Toolkit (Rutkas et. al., 2013) is a library of helper functions mainly developed to serve as extension methods for the Kinect Sensor class. They tend to improve the compatibility of the Kinect SDK.

Custom Program: I had to write a C# program in visual studio that links and manages everything together and serves as the ball juggling tracking program. The program provides an interface for starting and stopping tracking activity and also provides an option for writing the captured data (both the ball and human skeleton) to an external xml file for future processing. The program also provides feedback during tracking to let the user know tracking is in progress. The feedback is in form of a match-stick figure overlaid on the recorded RGB image to represent the human player and a circle that represents the tracked ball. The match-stick figure and the circle are animated in real time with the recorded data during tracking so high errors in tracking data can be easily detected by an observer. However, for lesser tracking errors, a post processing of the recorded data is necessary.

Surrounding Lighting: The surrounding lighting needs to be constant during the tracking process as a major change in the surrounding lighting would affect the ball colour perceived by the RGB camera. If the colour change is high enough that it drifts out of the colour tracking threshold, the system will be unable to detect the ball.Capture and Framework DevelopmentIn this chapter, all resources used in developing the capture framework are outlined and examined in detail. Capabilities of both the hardware and software are reviewed and then the setup and preferences of the final capture framework is discussed.Resources Used:Microsoft Kinect: The Microsoft Kinect device is a collection of a required resources for tracking both the human player and the ball. The resources that are bundled with the Kinect device are:Depth Camera: A depth camera is required for capturing the depth data of the environment. In other words, it measures the distance of objects from the camera. This serves as the basis for measuring the positions of both the ball and the player’s joints in 3 - dimensional space.RGB Camera: An RGB camera is needed for detection of the ball. Computer vision technique of colour filtering is used to isolate the ball in the RGB image provided by the RGB camera. This however means that a prior knowledge of the ball colour is necessary and that the ball colour must be sufficiently distinct from the surrounding objects.Human Skeleton Detection algorithms: Microsoft Kinect Software Development Kit provides the algorithms for human skeleton detection and also provides the functions for accessing the tracked data. Khoshelham (2011) already showed that the Kinect tracking algorithm functions to a desirable degree within the range of 1m to 5m from the Kinect camera. With best measurements of a maximum error of 3cm recorded between 1m and 3m distance from the Kinect. His findings corroborate the choice of keeping the experiment area to a limited space of 4m by 4m with the Kinect camera at one end and the player 3m away from the Kinect.Blue Backdrop: The blue backdrop used could have been any other neutral colour. Its purpose is to provide a blue screen effect and block out background noise that may propagate as a result of surrounding objects with similar colour as the ball or reflective objects in the background that may confuse the tracking system. [As will be seen in the images, I never even got to use the blue backdrop. Although, I had plans to use it thinking it would greatly improve the tracking speed and accuracy in the final system].Computer Vision Library: EmguCV (EmguCV, 2008), a C# wrapper of the widely known OpenCV library is used to assist in image processing of the captured RGB images. The library provides helpful functions like image dilation, smoothing, colour filtering, grayscale and binary conversion of colour images, blob detection and circle detection in binary images; all of which play a major role in tracking the ball.Coding4Fun Toolkit: The Coding4Fun Kinect Toolkit (Rutkas et. al., 2013) is a library of helper functions mainly developed to serve as extension methods for the Kinect Sensor class. They tend to improve the compatibility of the Kinect SDK.Custom Program: I had to write a C# program in visual studio that links and manages everything together and serves as the ball juggling tracking program. The program provides an interface for starting and stopping tracking activity and also provides an option for writing the captured data (both the ball and human skeleton) to an external xml file for future processing. The program also provides feedback during tracking to let the user know tracking is in progress. The feedback is in form of a match-stick figure overlaid on the recorded RGB image to represent the human player and a circle that represents the tracked ball. The match-stick figure and the circle are animated in real time with the recorded data during tracking so high errors in tracking data can be easily detected by an observer. However, for lesser tracking errors, a post processing of the recorded data is necessary.Surrounding Lighting: The surrounding lighting needs to be constant during the tracking process as a major change in the surrounding lighting would affect the ball colour perceived by the RGB camera. If the colour change is high enough that it drifts out of the colour tracking threshold, the system will be unable to detect the ball.

Fig 2.1 The Setup Diagram

How the system works

The whole ball juggling tracking system can be broken down into 3 major tasks:

Tracking the human player
Tracking the ball.
Putting both tracked data in the same co-ordinate space with real world measurements

1. Tracking the Human Player

Thanks to the Kinect and the Kinect SDK, tracking the human player is highly simplified to the task of activating the tracking state flag and reading the Skeleton frame data output by the SDK. However, in order to display the tracking result to an observer, the skeletal joints data is mapped to the colour space using the SDK’s coordinate mapper methods and the resulting points are used to draw a matchstick figure on top of the colour image. In recent releases of the SDK (versions 1.6 and above), the output data are in real world measurements of metres. The Skeleton frame contains 20 major skeleton joints data including the head, neck, torso, hip, and major joints that form the forelimbs and hindlimbs. Whenever a skeleton frame is ready, an event is fired and a callback function is used to access the most recently tracked joints.

2. Tracking the Ball

a. Detect the ball 2-d position in the RGB frame

b. Detect the ball depth in the depth frame

c. Convert the detected position to real world measurements

d. Speed problem and optimisations

a. Detecting the Ball 2-d position in the RGB frame

Fig2.2 Comparative result of color filtering the RGB frame from kinect camera, using a threshold near the ball colour

Fig 2.3 Result of colour filtering the colour image

Fig 2.4 Result of Canny Edge algorithm on the binary image

The Kinect SDK is used to initialize both colour frame capture and depth frame capture (This is already a prerequisite of skeleton frame capture mentioned above). Whenever a colour frame is ready, a callback function is used to immediately access the most recent colour data output from the Kinect RGB camera. With the help of the Kinect coding4fun third party library, the colour frame is converted to a format acceptable to the EmguCV library for further processing. The image is first colour filtered using the ball colour as threshold. This isolates the ball in a binary image (Fig 2.3). A canny edge algorithm is applied to the colour filtered image to obtain only the edges from the binary image (Fig 2.4). This is done so as to subsequently reduce the intensive process of the Hough circle algorithm when applied on a binary image with a large blob. The reason for doing this will be better understood if the Hough circle detection algorithm is explained further. The Hough circle algorithm processes the position of every non-black pixel and uses a threshold to compute if a certain number of such non-black pixels fall within an approximate circumference of a circle. As the Hough circle algorithm processes every non-black pixel in the binary image when searching for the circle, reducing the blob to only edges, highly optimizes and speeds up the Hough circle detection process in general.

b. Detecting the Ball Depth in the Depth Frame

The 2-d position of the ball obtained from the previous step is used to reference the depth frame in order to find the distance of the ball away from the camera. However, because there is a slight drift in position between the Kinect’s RGB camera and the depth camera, it means the raw ball position data from the RGB frame cannot be used to reference the depth of the ball in the depth frame as this will result in a slight shifting error that will prove fatal especially when the ball is far away from the Kinect device. In other words, there is a slight drift in positions of the viewing frustums of the 2 cameras (Kinect’s RGB camera and depth camera). Hence, the RGB / colour space is slightly different from the depth space. To resolve this issue, the position of the centre of the ball obtained from the colour frame is mapped (converted) to its corresponding depth frame equivalent using the Coordinate Mapper methods provided by the SDK. The result of this conversion is thus, the 3-d position of the ball in the depth space. It is worth mentioning here that although the measurements of the Z-axis (depth axis) in the depth frame is in real world measurements (metres), the measurements along its X and Y axes are not.

c. Converting the 3-d Position of the Ball to real world measurements

Up to this step, most of the process of detecting the ball has already been carried out. The importance of this step is just to ensure that the measurements of the ball position being made by the system is in the same coordinate space as the measurements of the human player. Also, this step most importantly, helps in accuracy testing of the ball tracking system later when measurements made by the system will be compared against known positions of the ball in the real world. To convert the ball position data to real world measurements, the 3-d ball position obtained in the depth space in the previous step is mapped to the Skeleton space [as of version 2.0 of the Kinect SDK, the Skeleton Space is now called Camera Space (Pterneas, 2014)] using the designated coordinate mapper methods of the SDK. In doing this, there are 3 corresponding points of the detected ball centre. Each point belongs to one of either colour space, depth space or skeleton space (3D space with real world measurements having the Kinect device as the origin). This is to allow the ball position to be known and usable in the three different frames. For example, the colour space equivalent centre of the ball is used to draw a circle on top of the colour image in order to provide a real time visual feedback of the tracking process to an observer. Whereas the skeleton space equivalent of the ball centre is stored away together with the human skeletal joints data in an external XML file for future processing or reference.

d. Speed Problems and Optimisations

Fig 2.5 Using Region of Interest around the detected ball to optimize performance

While the initial system seemed to be working, there was an unforeseen performance problem. The frame rate was a very low average of about 5- 10 FPS (Frames per Second). This meant that a lot of frames were being dropped since the Kinect frame rate is expected to be around 30 FPS. On further inspection of the program code, it seemed that the Hough Circle algorithm seemed to be taking more than 70% of the total processing time. Although, the Hough Circle algorithm was highly optimised by the Canny Edge algorithm which considerably reduced the number of pixels taken into consideration when it searches for a circle every frame, the fact that it has to process each and every pixel (of the 640 x 480 image) every frame to consider if it is plotted or not causes the speed issue. To resolve this, an ROI (Region of Interest) was drawn around the detected ball centre every frame. Both the Canny Edge process and Hough Circle search was limited to the ROI of a previous frame except in the first frame and a previous frame where a ball was not detected (Fig 2.5). This algorithm improved the total frame processing speed to about 20 – 25 FPS

3. Putting both tracked data in the same co-ordinate space with real world measurements

Fig 2.6 The tracking system with real time tracking feedback overlaid on colour image

References

EmguCV. 2008. [Online]. Available at: <http://www.emgu.com/wiki/index.php/Main_Page> [Accessed 2012].

Khoshelham, K. 2011. Accuracy Analysis of Kinect Depth Data. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 38(5). pp.133 - 138.

Pterneas, V. 2014. Understanding Kinect Coordinate Mapping. [Online] Available at: <http://pterneas.com/2014/05/06/understanding-kinect-coordinate-mapping/> [Accessed: 06 / 2014]

Rutkas, C., Peek, B., and Fernandez, D. Coding4Fun Kinect Toolkit. [Online]. Available at: <http://c4fkinect.codeplex.com/> [Accessed 2013].