The 20BN-JESTER dataset is a large collection of densely-labeled video clips that show humans performing pre-definded hand gestures in front of a laptop camera or webcam. The dataset was created by a large number of crowd workers. It allows for training robust machine learning models to recognize human hand gestures. It is available free of charge for academic research. Commercial licenses are available upon request.
A paper with supplementary material can be found here.
The video data is provided as one large TGZ archive, split into parts of 1 GB max. The total download size is 22.8 GB. The archive contains directories numbered from 1 to 148092. Each directory corresponds to one video and contains JPG images with height 100px and variable width. The JPG images were extracted from the orginal videos at 12 frames per seconds. The filenames of the JPGs start at 00001.jpg. The number of JPGs varies as the length of the original videos varies.
This dataset be used for academic research free of charge under the below license agreement. If you seek to use the data for commercial purposes please contact us.
Please register or log in to download the dataset.
|Total number of videos||
|Test Set (w/o labels)||
12,416Doing other things
5,379Pulling Hand In
5,315Pulling Two Fingers In
5,434Pushing Hand Away
5,358Pushing Two Fingers Away
5,031Rolling Hand Backward
5,165Rolling Hand Forward
5,410Sliding Two Fingers Down
5,345Sliding Two Fingers Left
5,244Sliding Two Fingers Right
5,262Sliding Two Fingers Up
3,980Turning Hand Clockwise
4,181Turning Hand Counterclockwise
5,307Zooming In With Full Hand
5,355Zooming In With Two Fingers
5,330Zooming Out With Full Hand
5,379Zooming Out With Two Fingers
If you have been successful in creating a classification model based on the training set and it performs well on the validation set, we encourage you to run your model on the test set (which is published without any class labels, as you might have noticed). Please prepare a .csv file with the video's id in the first column and your predicted class label (as a string matching the wording used in the training and validation sets). As a separator, please use a semicolon. You can then upload your .csv file here (user login required) to be ranked in the leaderboard and to benchmark your approach against that of other machine learners. We are looking forward to your submission.
RFEEN, 20 Crops
Ford's Gesture Recognition System
L. Shi, Y. Zhang, C. Jian, and L. Hanqing, "Gesture Recognition using Spatiotemporal Deformable Convolutional Representation" in IEEE International Conference on Image Processing (ICIP), 2019.
Motion Fused Frames (MFFs)
Spatiotemporal Two Streams network
3D CNN Architecture
Motion Feature Network (MFNet)
SSNet RGB resnet
Temporal Pyramid Relation Network for Video-Based Gesture Recognition，2018 25th IEEE International Conference on Image Processing (ICIP)
TRN - 8 segments
3D CNN - Multi time scale evaluation
TRN (CVPR'18 submission)
TRN + BNInception
3D CNN for transfer learning
2D and 3D fused network
Inceptionv2 - TRN16
One Stream Modified-I3D
RT C3D - 16 Frames
3D ResNet 101
3D convolutional neural network
Twenty Billion Neuron's Jester System
Basic finetune MobileNetV2 (pretrained imagenet) + LSTM output
3D-ResNet101 trained on Kinetics
Just random guessing...
test label (from 0 to 26)
[test_run] 3D RGB 16F