## Labelling unlabelled videos from scratch with multi-modal self-supervision

This repo covers the implementation for _Labelling unlabelled videos from scratch with multi-modal self-supervision_, which learns clusters from multi-modal data in a self-supervised way.


![Teaser Image](misc/splash.png)


## Key contributions

**(1) Clustering does not come for free**

Even very strong feature representations such as a supervisedly pretrained R(2+1)D-18 or MIL-NCE S3D network underperform our method that _learns_ clusters.

**(2) Truly multi-modal clustering yields robust clusters**

Since our method treats each modality as an _augmentation_ from another, our method learn to give stable predictions even if one modality is degraded.

## Installation

This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0.

1. Install required packages using `conda env create -f environment.yml`

2. Activate conda environment using `conda activate lab_vid`

3. Conda Install [FAISS](https://github.com/facebookresearch/faiss/blob/master/INSTALL.md).

4. See below for how to run evaluation and pretraining.


## Training

To train a model from scratch run locally:
```
python3 main.py --output-dir {SAV_FOLDER} --dataset {vggsound, kinetics, ave, kinetics_sound} --model {avc, avc_concat}
```

On SLURM:
```
sbatch scripts/run_main.sh ${USE_MLP} ${NUM_CLS} ${HEADCOUNT} ${AV_ALIGN} ${DIST} ${N_GROUPS} ${IND_GROUPS} ${DATASET} ${NUM_DATA_SAMPLES}
# e.g. sbatch scripts/run_main.sh True 309 10 True gauss 1 2 vggsound 170752

```

To train a model from a checkpoint run:
```
python3 finetune.py --ckpt-path vggsound_ckpt.pth
```
On SLURM:
```
sbatch scripts/finetune.sh ${USE_MLP} ${NUM_CLS} ${HEADCOUNT} ${AV_ALIGN} ${DIST} ${CKPT_PATH} ${IND_GROUPS} ${DATASET} ${NUM_DATA_SAMPLES}
# e.g. sbatch scripts/finetune.sh True 309 10 True gauss vggsound_ckpt.pth 2 vggsound 170752

```

Please replace XXX in SLURM script:
- SBATCH directives
- SAV_FOLDER

## Evaluation: Quantitative results

To obtain results for our model, please first download the pretrained models from [here](https://www.dropbox.com/s/utb9yyupszfh41h/vggsound_ckpt_200.pth).
Place the model folder in the repo under `pretrained_models/`.

**Clustering quality**.

To evaluate the clustering quality of our model run the following command:
```
python3 cluster_fit.py --weights-path ${WEIGHTS_PATH} --output-dir ${OUTPUT_DIR} --run-id ${EXP_DESC} --dataset {vggsound, kinetics, ave, kinetics_sound} --mode train
python3 k_means.py --path ${OUTPUT_DIR}/${EXP_DESC}.pkl --ncentroids ${NUM_CLS} --use-ours True
# Set NUM_CLS={kinetics: 400, ave: 28, vggsound: 309, kinetics_sounds: 32}
```

Results:
|                     | `NMI` | `ARI` | `Acc` | `<H>`| `<pmax>` |
| -------------       | -   |   - |  -  |   - |       - |
| Ours on VGGSound    | 56.7 | 22.5 | 32.3 | 2.4 | 38.0 |
| Ours on Kinetics    | 24.9 | 2.5 | 6.6 | 4.4 | 8.7 |
| Ours on Kinetics-S  | 50.2 | 31.4 | 43.2 | 1.7 | 48.1 |
| Ours on AVE         | 64.4 | 43.0 | 54.8 | 1.2 | 58.6 |


To obtain results for models benchmarked against:

1. Download pretrained models:
```
bash scripts/download_models.sh
```

2. Obtain features for the different methods:
  ```
# MIL-NCE 
python3 cluster_fit.py --mil-nce True --output-dir ${OUTPUT_DIR} --run-id ${EXP_DESC} --dataset {vggsound, kinetics, ave, kinetics_sound} --mode train
# XDC
python3 cluster_fit.py --xdc True --output-dir ${OUTPUT_DIR} --run-id ${EXP_DESC} --dataset {vggsound, kinetics, ave, kinetics_sound} --mode train
# DPC
python3 cluster_fit.py --dpc True --output-dir ${OUTPUT_DIR} --run-id ${EXP_DESC} --dataset {vggsound, kinetics, ave, kinetics_sound} --mode train 
```

3. Run FAISS k-means on features:
```
python3 k_means.py --path ${OUTPUT_DIR}/${EXP_DESC}.pkl --ncentroids ${NUM_CLS} --use-ours False
# Set NUM_CLS={kinetics: 400, ave: 28, vggsound: 309, kinetics_sounds: 32}
```

**Video Action Recognition**.

To evaluate weights on video action recognition, run the following:
```
python3 finetune_video.py --dataset {ucf101, hmdb51} --fold {1,2,3} --weights-path {WEIGHTS_PATH}
```

*HMDB-51*
|                 | 1 | 2 | 3 | 3-fold |
| -------------   | - | - | -  |  - |
| Ours (VGGSound)  | 53.32 | 53.62 | 52.32 | 53.1 |

*UCF-101*
|                 | 1 | 2 | 3 | 3-fold |
| -------------   | - | - | -  |  - |
| Ours (VGGSound)  | 86.35 | 88.14 | 88.75 | 87.7 |

**Video Retrieval**.

To evaluate weights on video action retrieval, run:
```
python3 video_retrieval.py --dataset {ucf101, hmdb51} --fold 1 --weights-path {WEIGHTS_PATH}
```

*HMDB-51*
|                 | 1 | 5 | 20 | 
| -------------   | - | - | -  |  
| Ours (VGGSound)  | 24.8 | 47.6 | 75.5 |

*UCF-101*
|                 | 1 | 5 | 20 |
| -------------   | - | - | -  |  
| Ours (VGGSound)  | 52.0 | 68.6 | 84.5 | 

## Evaluation: Qualitative results

**Cluster visualization**.
To run obtain an interactive cluster visualization such as the one provided in the SI, you can run
```
python3 get_clusters_vggsounds.py --path ${CKPT_PATH} --jobid ${VGGS_JOBID} --num-clusters 309 ;
python3 get_clusters_kinetics.py --path ${CKPT_PATH} --jobid ${KIN_JOBID} --num-clusters 400 ;
cd cluster_vis;
python3 preprocess.py --kinetics-path cluster_vis/${KIN_JOBID}.pkl --vgg-sound-path cluster_vis/${VGGS_JOBID}.pkl

open index.html in your browser 
```
after having trained the model on the dataset.
