# SetBench: A Benchmark for Evaluating Large Language Models on Set Operations

## Overview
SetBench is a synthetic benchmark designed to evaluate the robustness of LLM instruction-following abilities under various conditions (e.g. the set operations, set sizes, the nature of set members, and the construction of the sets).

## Dataset
The dataset includes everything needed for reproducibility and benchmarking an LLM against set operations. It consists of the following files:

| File | Description |
| --- | --- |
| `data/set-bench-config.txt` | Configuration file containing hyperparameters |
| `data/set-bench.parquete` | Parquet file containing prompts, ground truth, and hyperparameters |

To load the parquette file, use the following command:

```shell
pip install pandas
pip install pyarrow
```

```python
import pandas as pd
data = pd.read_parquet("data/set-bench.parquete")

print(data.shape)  # (182448, 16)
print(data.columns)
"""
['SetOperation', 'OperandSize', 'TokenType', 'TokenLength', 'MaxValue',
'PromptingMethod', 'DemonstrationPhrasing', 'NumberOfDemonstrations',
'TokenFrequency', 'AreTokensSimilar', 'RelationshipBetweenSetsAAndB',
'OverlapPercentage', 'SetA', 'SetB', 'GroundTruth', 'Prompt'],
"""
```

## Parameters
Here, we describe the parameters used when sampling prompts. Note that `-1` in any hyperparameter list below indicates that in some prompts the constraint implied by that hyperparameter was not in effect. See the paper for further details.

### Common Hyperparameters
| Parameter | Values |
| --- | --- |
| Operation | `['Difference', 'Intersection', 'Symmetric difference', 'Union']` |
| OperandSize | `[2, 4, 8, 16]` |
| TokenType | `['Deceptive Words', 'Numbers', 'Overlapping Numbers', 'Overlapping Words', 'Words', 'Words by Frequency']` |
| TokenLength | `[-1, 1, 2, 3, 4, 5]` |
| PromptingMethod | `['Base Prompt', 'CoT']` |
| DemonstrationPhrasing | `['Formal Language', 'Plain Language']` |
| NumberOfDemonstrations | `[0, 1, 3, 5]` |

### Lexico-semantic Hyperparameters
Note that for hyperparameter `RelationshipBetweenSetsAAndB`, the value of 0 means "Semantically disjoint", and value 1 means "Semantically intermingled". These are used in the deceptive sets evaluation in the paper. For details, please see sections 3.3.2 \& 4.2 of the paper.

| Parameter | Values |
| --- | --- |
| TokenFrequency | `[-1, 1, 2, 3, 4, 5, 6, 7, 8, 9]` |
| AreTokensSimilar | `[0, 1]` |
| RelationshipBetweenSetsAAndB | `[-1, 0, 1]` |
| OverlapPercentage | `['0.5', 'None']` |

## Resources

The dataset includes the following resources used in the construction of the dataset.

- `resources/words.zip`: English vocabulary used in the paper.
- `resources/deciles.json`: English vocabulary grouped by deciles of rank frequency as determined by the Google Books Ngram corpus, as described in the paper.
