# zipfian-whitening

## 1. Install dependencies
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
## 2. Setup word2vec embedding for sentence-transformers
First, download word2vec binary to `models` dir from google drive https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g

Then
```bash
# Unzip word2vec file
cd models
gunzip GoogleNews-vectors-negative300.bin.gz
cd ..
# Convert word2vec binary file to txt format so that it can be integrated with Sentence-Transformers
python src/convert_w2c_binary_to_text.py
```
You should have the following files:
```
models
├── GoogleNews-vectors-negative300.bin
├── GoogleNews-vectors-negative300.bin.gz
├── GoogleNews-vectors-negative300.txt
└── GoogleNews-vectors-negative300.vocab
```

## 3. Reproduce experimental results
```bash
source .venv/bin/activate
bash scripts/run_all.sh
```

And all the results will be saved under the `results/`.
To visualize, run `notebooks/{experiments_norm,table_generator}.ipynb`.

## 4. Other
The English Wikipedia frequency data is imported from Arora+'17[1]: https://github.com/PrincetonML/SIF/raw/master/auxiliary_data/enwiki_vocab_min200.txt .

[1] Sanjeev Arora, Yingyu Liang and Tengyu Ma. "A Simple but Tough-to-Beat Baseline for Sentence Embeddings" In ICLR, 2017.