Code Structure
Warning — The document is stale and was last modified more than ten months ago. The information below may be outdated and incorrect. Please proceed with caution!
Birds-Eye View
Here’s a birds-eye view of how the benchmarking process interacts with the main
classes (see benchmark):
-
A
Scenario(given by aScenarioSpec) specifies a task and a data distribution. It specifies a set ofInstances, where eachInstancehas an input (e.g., question) and a set ofReferenceoutputs (e.g., multiple choice answers). -
A
DataPreprocessortakes in aScenarioand produces a list ofInstances. EachInstanceis given a unique ID. The set ofInstances is augmented according toDataAugmenterSpec. -
An
Adapter(given by anAdaptationSpec) takes a list ofInstances and adapts it to a set ofRequests to the API (e.g., the model, temperature, number of in-context training examples). Formally, the output is aScenarioStatecontaining a set ofRequestStates, where eachRequestStateconsists of aRequestand any metadata used to track the role of thisRequest(e.g., the relevantInstanceandReference). -
An
Executor(given by anExecutionSpec) executes eachRequestin theRequestStateto produce aRequestResultfor each one; everything is encapsulated in aScenarioState. -
A
Metric(given by aMetricSpec) takes aScenarioStatecontainingRequestResultss and produces a set ofStats (e.g., accuracy, accuracy@5, toxicity, bias, etc.). -
A
Runneris the top-level controller that runs the above steps and is driven by a set ofRunSpecs.
There are three types of classes:
- Specifications (e.g.,
AdapterSpec,ExecutionSpec,RunSpec): specified manually by the user. Note thatScenarioandMetricare subclassed, so they are constructed byObjectSpec, which specifies the subclass name and a free-form dictionary of arguments. - States (e.g.,
Instance,ScenarioState,Request,RequestResult): these are automatically generated and can be serialized. - Controllers (e.g.,
Scenario,Adapter,Executor,Metric,Runner): these have the bulk of the code and should not be serialized.
Adding new scenarios
In order to implement new scenarios:
- Create a new Python file in the
scenariosfolder. - Within the scenario file, create a
Scenarioclass, e.g.YourScenario. YourScenarioshould implementget_instances, a method that downloads the dataset files if they don’t already exist and returns a list ofInstances. EachInstancemust have a list of (potentially one)Referenceanswers: a correct answer may be indicated with aCORRECT_TAGin aReferenceinstance’stagsargument. In addition, you must specify thesplitof theInstanceas one ofTRAIN_SPLIT,VALID_SPLIT, orTEST_SPLITconstants as inscenario.py.- For
Scenarios with datasets that cannot be publicly shared, place a copy of the dataset at pathrestricted/<Name of the Scenario>and read from that path. SeeNewsQAScenarioandICEScenariofor some examples.
- For
- Note that you need not enumerate every possible correct answer (nor must there even necessarily be a correct answer).
- Make sure to document your scenario well with a clear docstring.
- In addition, specify its
name,description, andtags. - Identify the appropriate metric for your task in one of the
*_metrics.pyfiles. If the metric you’d like to use does not exist, follow the directions in Adding new metrics. Many will be inbasic_metrics.py. - Define a function in
run_specs.pyannotated withrun_spec_functionto:- Construct a
ScenarioSpecfor your scenario using a class name corresponding to the Python path of the class (e.g.helm.benchmark.scenarios.your_scenario.YourScenario) and any arguments which must be passed as a dictionary ofargs. - Construct an
AdapterSpecfor your scenario specifying the type of language model generation which must be performed for the task. - Construct one or more
MetricSpecobjects for your task, specifying the classname with the Python path of the object, with the same arguments as theScenarioSpecconstructor. - Construct and return
RunSpecobject, with anamecorresponding to the scenario name and any patterns to match in curly braces, ascenario_spec, anadapter_spec,metric_specs, andgroups.
- Construct a
- Attempt to run your task with
venv/bin/helm-run -r yourscenarioname:arg=valuewhereyourscenarionamematches thenamespecified in YourScenario - Update
src/helm/benchmark/static/contamination.yamlwith models that were trained on your scenario (i.e. contaminated). - Add a schema to
src/helm/benchmark/static/schema.yamland add the scenario tosubgroupsas needed.
Adding new metrics
To add a new metric, first determine if your metric is generic and likely to be widely used, or specific to your task.
- For generic metrics:
- Add a method to
basic_metrics.pywhich takes two arguments: thegoldanswer and the model’sprediction. - Add your method to the
metric_fn_mappinglookup.
- Add a method to
- For task specific metrics:
- Create a new
yourtask_metrics.pyfile for classYourTaskMetricwhich inherits fromMetricinmetric.py. - Define methods
__init__andevaluate_generationreturning a list ofStatobjects.
- Create a new
Your metric is responsible for producing Stat objects:
- Each
Statshould correspond to a distinct aggregate measurement over the generated examples. Some may have one metric (e.g. accuracy), while others may quantify multiple aspects (e.g. multiple distance metrics). - For each
valuegenerated for aStat, add it toyourstatusingyourstat.add(value). Usually, there will only be one value for eachStat, but multiple can be used, e.g. to show variance.
Data augmentations
To apply data augmentation, create a DataAugmenterSpec with a list of
PerturbationSpecs and pass it into RunSpec. The following is an
example:
data_augmenter_spec = DataAugmenterSpec(
perturbation_specs=[
PerturbationSpec(
class_name="helm.benchmark.augmentations.perturbation.ExtraSpacePerturbation",
args={"num_spaces": 5},
)
],
should_perturb_references=False,
should_augment_train_instances=False,
should_include_original_train=False,
should_augment_eval_instances=True,
should_include_original_eval=True,
)
run_spec = RunSpec(
...
data_augmenter_spec=data_augmenter_spec
)
In the example above, the DataPreprocessor will augment the set of evaluation instances by perturbing
the original set of instances with the ExtraSpacePerturbation, where spaces in the text are
replaced with num_spaces number of spaces.
We currently only support applying a single perturbation to an instance instead of chaining multiple perturbations and applying it onto a single instance.
Adding a new perturbation
- To add a new perturbation to the framework, create a new file at
src/helm/benchmark/augmentationswith the name<Name of perturbation>_perturbation.pye.g.,typo_perturbation.py. Inside the file, create a new class (name it<Name of the perturbation>Perturbatione.g.,TypoPerturbation) that extends the abstract classPerturbationand implement theperturbmethod which takes in text and outputs the perturbed text. - Add a test for the new perturbation in
test_perturbation.py.
Supporting new Hugging Face tokenizers
- Give the tokenizer a name. Use the same name that’s used in Hugging Face (e.g., “EleutherAI/gpt-j-6B”).
- In
HuggingFaceTokenizers, we load and cache tokenizers in memory. Add logic to handle the tokenizer in theload_tokenizermethod. - Add a test in
test_huggingface_tokenizer.pyto make sure we can load the tokenizer from Hugging Face. - Add a new class
<Name of tokenizer>WindowServicein file<Name of tokenizer>_window_service.py. Follow what we did forGPTJWindowService. - Import the new
WindowServiceand map the model(s) to it inWindowServiceFactory.
HEIM (text-to-image evaluation)
The overall code structure is the same as HELM’s.
When adding new scenarios and metrics for image generation, place the Python files under the image_generation package
(e.g., src/helm/benchmark/scenarios/image_generation).