How can we use body motion and gestures from multiple participants in shared gaming experiences for big screens?

Details

We used MoveNet, which is a TensorFlow model that detects body keypoints such as noses, elbows, and knees.

In early deployment we had some issues with performance and accuracy, which we resolved by utilizing the GPU’s processing power and tuning the confidence.

This model provided raw keypoint data, which we then refined and wrote custom programs to turn into pose detection. For example, if the height of a hand keypoint is higher than the head keypoint, we can determine that the user is raising their hand.

We recorded videos of people performing various actions, which were then replayed and used to verify prediction accuracy.

Finally, we refined the process to identify users in different zones to support multiplayer, and to have Unity work with multiplayer input. We also made sure to include a config file that allows us to have different setups, such as different camera location, angle, and player position.

In the future, we could explore the research topic of giving “memory” to this process and allowing it to identify keypoints on one person consistently. For example, if we find a group of keypoints at one specific position, and they are still close to the same position in the next frame, it’s likely that they are keypoints are on the same person. As of right now, it simply detects points and doesn’t make this association. We could also continue to work on the handling of noise and detection of false data.

Overall, we are happy with and impressed by the detection capabilities in different lighting and conditions.