Efficient-HELM

This tutorial will show you how to locally add your model into the HELM Classic leaderboard at a fraction of the cost of performing a full run, using a techinque from IBM Resarch described in Efficient Benchmarking (of Language Models) a paper from IBM Research.

Warning — The tutorial will currently only work for the HELM Classic leaderboard. Other leaderboards are not yet supported.

Download HELM leaderboard results

First, in order to compare your model to the latest and greatest models found in the HELM Classic leaderboard, use the following command to obtain a zip file of all previous HELM Classic results

export LEADERBOARD_VERSION=v0.3.0

Downloaded, expand the file into HELMs results dir:

curl -O https://storage.googleapis.com/crfm-helm-public/benchmark_output/archives/$LEADERBOARD_VERSION/run_stats.zip &&\
mkdir -p benchmark_output/runs/$LEADERBOARD_VERSION && unzip run_stats.zip -d benchmark_output/runs/$LEADERBOARD_VERSION

Now that the files are in your results directory, all HELM models will be shown in your UI along with your model.

Run Efficient-HELM

According to Efficient Benchmarking (of Language Models) a paper from IBM Research, which systematically analysed benchmark design choices using the HELM benchmark as an example, one can run the HELM benchmark with a fraction of the examples and still get a reliable estimation of a full run (Perlitz et al., 2023).

Specifically, the authors calculated the CI 95% of Rank Location from the real ranks as a function of the number of examples used per scenario and came up with the following tradeoffs¹:

Examples Per Scenario	CI 95% of Rank Location	Compute saved
10	±5	X400
20	±4	X200
50	±3	X80
200	±2	X20
1000	±1	X4
All	±1	X1

Choose your point on your tradeoff, how accurate do you need your rank? how much time do you want to wait? Once you have chosen, download the config and define your model

export EXAMPLES_PER_SCENARIO=10 && \
export MODEL_TO_RUN=openai/gpt2

That’s it, run the following to get the config file:

wget https://raw.githubusercontent.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_entries_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_entries_$EXAMPLES_PER_SCENARIO.conf

and this one to run the benchmark (will take some time in the first time since all the data has to be prepared):

helm-run \
--conf-paths run_entries_$EXAMPLES_PER_SCENARIO.conf \
--suite $LEADERBOARD_VERSION \
--max-eval-instances $EXAMPLES_PER_SCENARIO \
--models-to-run $MODEL_TO_RUN \
--cache-instances \
--num-train-trials 1 \
--skip-completed-runs

This will take some time the first time running since all the data (regardless of the number of examples chosen) is downloaded and prepared.

Summarize and serve your results

To view how your model fits in with the latest leaderboard, process and aggregate your results with:

helm-summarize --suite $LEADERBOARD_VERSION

And serve with:

helm-server

References List:

Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M. and Choshen, L., 2023. Efficient Benchmarking (of Language Models). arXiv preprint arXiv:2308.11696.

Note that the quantities below are the CI 95% of the rank location and are thus very conservative estimates. In our experiments, we did not experience deviations above ±2 for any of the options above.]: ↩