Augmenting Sports Videos with VisCommentator


Augmenting sports videos with data visualizations is gaining traction given its ability to communicate insights and explicate playerstrategies engagingly. However, creating such augmented videos is challenging, as it requires considerable domain expertise in sportsanalytics and video editing skills. To ease the creation process, we present a design space to characterize augmented sports videosat element- (what are the constituents) and clip-levels (how the constituents are organized) by systematically reviewing 233 examplescollected from TV channels, teams, and leagues. The design space provides guidance for selecting data insights and visualizations forvarious purposes. Informed by the design space, we present VisCommentator that facilitates the creation of augmented videos fortable tennis with data insights and visualizations recommendations. A user study with seven domain experts confirms the usefulnessand effectiveness of the system. Another study with sports fans found the resulting videos informative and engaging.

Demo Application

Currently, online demo is not available due to the hardware limited (e.g., GPU required). Source code will be released once paper accepted.


Demo1: The augmented clip reveals that the rotation speed of the ball is so fast that the player in red cloth can only return it to a narrow area with the highest possibility to the top-left corner of his opponent’s court. This obviously gives a chance to his opponent, who performs an active attack in the next move and successfully win the game.


No Video

Design Space

missing Fig. 4. A clip-level design space for augmented sports videos: Data Level (vertical) and Narrative Order (horizontal). The number in cells depict their combination occurrences in our corpus. Darker cells mean more occurrences. The last row and column present the ratio of each option to its dimension.

Table 1. An element-level design space for augmented sports videos. We use the combanation frequency to support the visualizations recommendation in the system.
Narrative Data Level Data Type point line area glyph label content progress camera
Linear event kinematics 484 223 0 83 20 11 15 19
playerdistance 0 73 0 3 67 0 0 1
playerarea 0 0 39 0 0 0 0 0
playertext 0 0 0 0 107 0 0 0
courtarea 0 8 63 0 4 0 0 0
playeraction 11 2 0 85 177 2 461 67
playerformation 3 66 20 6 21 0 9 1
ballevent 50 15 15 12 3 0 7 0
object kinematics 45 16 0 4 3 2 2 1
playerdistance 0 1 0 0 8 0 0 0
playerarea 0 1 7 0 0 0 0 1
playertext 0 0 0 0 12 0 1 0
Flash Forward event kinematics 208 230 0 24 12 77 6 8
playerdistance 1 40 0 1 45 0 0 0
playerarea 0 0 22 0 0 0 0 0
playertext 0 0 0 0 16 0 0 0
courtarea 0 3 24 0 1 1 0 0
playeraction 0 1 0 23 44 5 222 22
playerformation 1 19 5 1 8 0 3 0
ballevent 69 10 51 3 8 0 0 0
object kinematics 12 13 0 0 0 6 0 0
playerarea 0 0 1 0 0 0 0 0
playertext 0 0 0 0 1 0 0 0
Flash Backward event kinematics 34 29 0 9 1 0 6 0
playerdistance 0 2 0 0 0 0 0 0
playerarea 0 0 4 0 0 0 0 0
playertext 0 0 0 0 6 0 0 0
courtarea 0 3 0 0 0 0 0 0
playeraction 0 1 0 5 11 0 106 20
playerformation 0 3 0 0 0 0 0 0
ballevent 5 0 0 0 4 0 0 0
object kinematics 1 1 0 0 0 0 0 0
playerarea 0 0 2 0 0 0 0 0
ZigZag event kinematics 129 31 0 9 0 2 4 1
playerdistance 0 2 0 0 1 0 0 0
playerarea 1 0 13 0 0 0 0 0
playertext 0 0 0 0 6 0 0 0
courtarea 0 5 6 0 1 0 0 0
playeraction 0 0 0 1 22 0 182 27
playerformation 0 3 1 0 0 0 0 0
ballevent 3 0 0 0 0 0 0 0
Grouped event kinematics 8 6 0 3 2 0 1 0
playerdistance 0 4 0 0 3 0 0 0
playerarea 0 0 0 0 0 0 0 0
playertext 0 0 0 0 1 0 0 0
courtarea 0 1 1 0 0 0 0 0
playeraction 0 0 0 2 2 0 6 5
playerformation 0 2 0 0 0 0 0 0
ballevent 2 0 2 0 0 0 0 0
TimeFolk event kinematics 56 22 0 0 0 23 2 2
playerdistance 0 3 0 0 5 0 0 0
playerarea 0 0 11 0 0 0 0 0
playertext 0 0 0 0 2 0 0 0
courtarea 0 4 4 0 1 0 0 0
playeraction 7 11 0 2 5 4 41 6
playerformation 0 4 1 0 6 0 0 0
ballevent 3 8 5 0 1 2 0 0
object kinematics 3 2 0 0 0 5 0 0
playerdistance 0 2 0 0 2 0 0 0

Machine Learning Library and Code

1. BodyPix2

We use BodyPix2 from google to detect and segment players from a video. BodyPix2 is devloped based on Tensorflow.js and can be ran in NodeJs environments. The hyper-parameters we used are:

// for bodyPix.load
    "architecture": "ResNet50",
    "outputStride": 16,
    "quantBytes": 4
 // for net.segmentPerson
    "flipHorizontal": false,
    "internalResolution": "full",
    "segmentationThreshold": 0.7,
    "maxDetections": 10,
    "scoreThreshold": 0.2,
    "nmsRadius": 20,
    "minKeypointScore": 0.3,
    "refineSteps": 10

2. Event Recognition

We use spatial temporal graph ConvNet (ST-GCN) [1] to recongize player event such as the technique a player used in one stroke.

2.1. Input & output

According to the pipeline porposed in ST-GCN, we first collect the table tennis video with high resolution (1080p) and frame rate (50fps). Then we use the pose estimation algorithm to retrieve the body joints from each frame, and formulate the spatial graph where the node is the joint and edge is the connection between joints of a body and joints between adjacent frames. Finally the spatial graph is used as the input for the ST-GCN.

The output of the ST-GCN is the standard SoftMax classifier that identifies the graph into corresponding technique category.

2.2. Training

We kept the the network structure the same as it introduced in the paper [1]. To gather the training data, we collected thousands of strokes from hundreds of world tournament games happened during 2016-2018. These data are manually identified from videos and formulated into structured json file. A sample file fragment is as follows:


To train the ST-GCN, we kept the number of records (video fragments) in each technique equally to make sure the scale of joints from different categories are almost the same. To this end, we first further devided the records of each technique into two categories: the player face to the screen and the player back to the screen. Then, we grouped the techniques with little records into "other" category, and simply discarded extra records to maintain consistent record numbers for each category. Finally, we got seven techniques (i.e. topspin, reverse, push, short, pendulum, attack, others), 14 categories with 4375 records each.

3. Causality

We use FGES (Fast Greedy Equivalence Search) [2] to extract the recommended data that used for future purpose. The toolkit can be found in [3].

3.1. Input

To generate a causal graph, we group the table tennis data into a table, where each row represents a rally and each column represents a technique. Each cell records the frequency of corresponding technique used in that rally.

Input Table Sample
Topspin Reverse Push Pendulum Flick Lob Twist Block Short Others Attack Smash
2 0 0 1 0 0 0 0 0 0 1 0
2 1 0 0 0 0 0 2 1 0 1 0
0 0 0 1 1 0 0 2 1 0 1 0
0 0 0 1 1 0 0 2 1 0 1 0
2 1 0 0 0 0 0 1 0 0 2 1

3.2. Output



[1] Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. arXiv preprint arXiv:1801.07455, 2018.

[2] Fast Greedy Equivalence Search (FGES) Algorithm for Discrete Variables. Available at: (Accessed: 10 September 2020)

[3] py-causal. Available at: (Accessed: 10 September 2020)