Configuring Autograders
Pawtograder’s autograder is a GitHub Action,pawtograder/assignment-action,
that runs inside each student repository on every push. It overlays the
student’s submission onto your grader repo, runs the linter, build, and
instructor test suite (and optionally the student’s own tests with
mutation/coverage analysis), then reports per-test results, scores, and
artifacts back to Pawtograder. This page is the reference for the
pawtograder.yml config that drives that flow.
The
pawtograder.yml schema is published at
https://raw.githubusercontent.com/pawtograder/assignment-action/refs/tags/v3/pawtograder.schema.json.
Reference it from the top of your YAML to get IDE autocomplete:On This Page
grade.yml Workflow
The workflow file students run, plus action inputs and outputs.
pawtograder.yml Configuration
Top-level reference for
build, gradedParts, submissionFiles, and friends.Dependencies
Gate parts or units on prior results.
Feedbot
LLM-generated hints attached to failing tests, including per-test hints from custom graders.
Examples
Working
pawtograder.yml files for Java, Python, and mutation testing.Empty Submission Detection
How Pawtograder flags submissions that haven’t been changed from the starter.
Submission Viewer
How files and grader artifacts render in the UI.
Rerunning the Autograder
Regrade existing submissions against a chosen grader version.
Test Insights & Bulk Regrading
Find systemic test failures and regrade affected submissions in bulk.
Running the Grader Locally
Iterate on your grader outside GitHub Actions.
Architecture Overview
Advanced: the three repos involved and the action’s step-by-step flow.
Running a Forked Action
Advanced: point students at your fork of the action.
The grade.yml Workflow
The handout repository ships with a .github/workflows/grade.yml that is
cloned into each student repository. You must edit this file to install any
language toolchains or dependencies your build needs before the action runs.
The action itself only does grading — it does not install Java, Python,
Node, etc.
A minimal Java workflow looks like this:
submission/. The action downloads
the grader into a sibling grader/ directory. Both you and the student can
view the run output (including the action’s job summary table) under the
Actions tab of the student repo.
Action Inputs
| Input | Required | Description |
|---|---|---|
grading_server | yes | URL of the Pawtograder API, typically https://api.pawtograder.com. |
action_ref | yes | Pass ${{ github.action_ref }} — used by the server to record which grader version ran. |
action_repository | yes | Pass ${{ github.action_repository }} — used for the same reason. |
regression_test_job | no | Numeric ID of a regression-test job. When set, the action swaps the roles of “submission” and “grader” so that a known grader version can be run against a snapshot of a student submission. Set by the Pawtograder backend when launching regression tests, not by hand. |
handout_repo | no | Deprecated. Ignored as of v3, will be removed in v4. Handout detection is now performed server-side. |
Action Outputs
| Output | Description |
|---|---|
score | The numeric score reported by the grader. |
status | A human-readable status message. |
The pawtograder.yml Configuration
pawtograder.yml lives at the root of the grader/solution repo. There is
currently exactly one grader type (grader: overlay) and it has three
required top-level sections:
build— how to build, lint, and test the project.gradedParts— what tests are worth what points, organized into parts.submissionFiles— which files from the student repo are collected and overlaid onto the grader.
feedbot,llm,mutantAdvice— LLM-based features (see Feedbot and Mutation Test Units).maxImplementationHints: N— across all regular units, show full output for at mostNfailing tests. Once the limit is reached, additional failing tests still count against the score but are summarized as “N additional failing tests not shown.” This is a running total across the entire submission, so put the most important parts first ingradedPartsif you care which hints “win.” For per-unit suppression, usehide_output: trueon a regular unit (see below).maxMutantHints: N— capsmutantAdvicehints shown to students; covered with the mutation example in Mutation Test Units.fallbackFiles— provides defaults for files the student didn’t submit (see below).
build
The only required field is preset. The other fields are conditional:
script_info and venv apply to the python-script preset,
student_tests controls mutation/coverage features, and timeouts_seconds
overrides the built-in timeouts.
Presets
java-gradle— Builds with./gradlew test, uses Surefire XML for test results, JaCoCo for coverage, Checkstyle for linting, and Pitest for mutation testing. The grader repo must contain a workingbuild.gradle.python-script— Runs the shell commands you provide inscript_info(see below). Use this when you want full control over how tests, coverage, and mutation are produced.none— Disables building, linting, and testing entirely. The action still records the submission and runs handgrading flows. Useful for write-only / artifact-only assignments.
Linter
policy: ignore— lint errors are reported in the grading summary but tests still run.policy: fail— if the linter finds errors, the rest of grading is skipped and the student receives a zero. The submission does not count against any per-assignment submission cap if you have configured one.
student_tests
Controls what to do with the student’s own test suite. Tests are run in two
contexts:
Mutation analysis under
instructor_impl only runs if the student’s tests first pass against the instructor’s reference solution. The rationale is that if a student’s tests fail against a known-correct implementation, they’re asserting wrong behavior, so their mutation score isn’t meaningful. The action surfaces those failing tests in a dedicated “your test suite contains incorrect tests” message.timeouts_seconds
All sub-fields are optional; the defaults are:
| Phase | Default (seconds) |
|---|---|
build | 600 |
instructor_tests | 300 |
student_tests | 300 |
mutants | 1800 |
venv and script_info (Python preset)
For the python-script preset, you supply the shell commands the builder
should run for each phase:
script_info fields are required even if a given phase isn’t used —
provide a no-op command if you don’t need one. cache_key keys the cached
venv across runs; bump it when requirements.txt changes.
artifacts
A list of files or directories the grader will produce and upload to the
submission view. Each entry has a name (shown in the UI), a path
(relative to the grading workspace, or absolute), and optional data (a
free-form object — for example, { "format": "zip", "display": "html_site" }
tells the UI to render a directory as a navigable HTML site).
report_mutation_coverage or
report_branch_coverage is enabled) are added to this list at runtime —
you don’t need to declare them yourself.
gradedParts
name and an array of gradedUnits. Optional fields:
hide_until_released: true— students cannot see this part’s score or test output until the submission is released for grading.dependencies— see Dependencies below.hideFeedbot: true— Feedbot will not generate hints for any failing test in this part.
Regular Test Units
testsmay be a single string or an array of strings. Each string is matched as a prefix against the fully qualified test names emitted by the test runner (for JUnit,package.ClassName.testMethod). A prefix likeCreditCardPublicTest.matches every method on that class.testCountis the number of tests you expect to match. Setting this explicitly is intentional: it prevents a typo in a prefix from silently awarding full marks for zero tests.pointsis the unit’s max score.allow_partial_credit— defaults tofalse. When false, the student earnspointsonly if all matched tests pass and the number of passing tests equalstestCount. When true, the student earnspoints * (passing / testCount).hide_output: true— replaces student-visible test output with “Output for this test is intentionally hidden.” The full output is still recorded ashidden_outputand is visible to staff.hideFeedbot: true— suppresses Feedbot hints for this unit only.
Mutation Test Units
locationsis an array of strings. Each entry can be a class name (the unit counts mutants whose location starts with that class), a class with a line range (ClassName:startLine:endLineor, accepted equivalently,ClassName-startLine-endLine), or the name of a Pitest mutator (matched against the mutator field of each mutant).- Scoring uses either
breakPointsorlinearScoring, not both:breakPoints— array of{ minimumMutantsDetected, pointsToAward }. The unit picks the first (highest-numbered) breakpoint whose threshold the student met. Order them descending; the unit’s max score is taken from the first entry.linearScoring: { total_faults, points }— awards(detected / total_faults) * points.
hideFeedbot: true— same meaning as on regular units.
mutantAdvice is configured at the top level, mutants the student
didn’t detect can show a personalized hint:
maxMutantHints caps the total number of mutantAdvice hints shown
across all mutation units in a single submission. Like
maxImplementationHints, it’s a running total — order gradedParts so
the most important parts come first if you care which hints “win.” Omit
it to show all available hints.
submissionFiles
filesare the source/implementation files that get overlaid onto the grader for the instructor test runs.testFilesare student-written tests; they are kept separate so they can be overlaid only when the action wants to grade the student’s own tests, and so that mutation/coverage analysis has a clean target.- Patterns are GitHub Actions globs —
**for “any subdirectories”,*for “any name in this directory”. You can list a literal file alongside a glob to make that file required.
fallbackFiles
Optional. The path (relative to the grader repo) of a directory whose
contents should be copied into the grading workspace for any file the
student did not submit. Useful when students may delete files that your
test harness expects to exist.
Dependencies
BothgradedParts and gradedUnits accept a dependencies array. If any
dependency is not met, that part (or unit) is replaced in the feedback with
a message explaining which dependency failed instead of the actual grading
output.
A dependency may be written in any of three forms:
- If
minScoreis omitted, the dependency requires the maximum score for the referenced part or unit. minScoreis a raw score, not a percentage.- When a part’s dependencies fail, the entire part is replaced with one feedback entry.
- When a unit’s dependencies fail (but the part’s are satisfied), only that unit is replaced.
Feedbot
Feedbot is optional, LLM-generated feedback that the grading server can attach to failing tests. When enabled inpawtograder.yml, the action
includes an llm block on each failing test result so the grading server
knows which model and account to use.
enabledis required for Feedbot to run at all.provider,model,account, andspec_urlare all required whenenabled: true. If any are missing, Feedbot is disabled for the run and a warning is written to the visible output.spec_urlshould point to a markdown file with the assignment spec. The action fetches it at grading time with a 10-second timeout; if the fetch fails, Feedbot is disabled for that run and the failure is logged.promptselects the response strategy. The two built-ins arechain_of_thought(default) andchecklist. Any other string is used as a free-form custom strategy instruction; the embedded assignment spec and the underlying role/rules are not changed.accountselects which set of provider credentials the server uses — for example,account: cs2100will look upOPENROUTER_API_KEY_cs2100(falling back toOPENROUTER_API_KEY) when Feedbot dispatches the call.
{account} is the value of the account
field):
openai—OPENAI_API_KEYorOPENAI_API_KEY_{account}.azure—AZURE_OPENAI_ENDPOINTplusAZURE_OPENAI_KEY(orAZURE_OPENAI_KEY_{account}).anthropic—ANTHROPIC_API_KEYorANTHROPIC_API_KEY_{account}.openrouter—OPENROUTER_API_KEYorOPENROUTER_API_KEY_{account}. Use models likeopenai/gpt-4o-mini,anthropic/claude-3-haiku,google/gemini-pro.
hideFeedbot: true on the part or unit.
Per-Test Hints from Custom Graders
When Feedbot is enabled, the action automatically emits anextra_data.llm block on each failing test so the server knows which
model/account to invoke. If you are writing a custom python-script
grader and want to provide per-test hint configuration directly (rather
than going through the feedbot block), you can author the llm block
in your test output yourself:
provider, model, and account fields use the same key lookups
documented above.
Examples
Java with Gradle and JUnit
Python with Custom Scripts
Java with Mutation Testing (Pitest)
This example also grades the student’s own tests for fault-detection strength. The Gradle plugin used isinfo.solidsoft.pitest.
build.gradle enables the Pitest plugin:
Empty Submission Detection
Pawtograder automatically flags submissions whose collected files are identical to (or essentially unchanged from) the starter code. These show up in the grading interface so you can quickly find students who pushed without making any actual changes — for example, students who set up the repository but never started the assignment. Empty submission detection looks only at the files that matchsubmissionFiles, so it respects whatever scope you defined for the
assignment.
Submission Viewer
Submission files and grader-generated artifacts are displayed side by side in the submission viewer. Submitted files:- Text files render with syntax highlighting.
- Markdown files (
.md,.markdown) render as formatted HTML with code-block highlighting, images, tables, and links. - Binary files (images, PDFs, executables) are stored with the submission and exposed as a download button alongside file metadata.
data object on each artifact:
- Plain-text artifacts (
.txt,.log) render with line numbers and syntax highlighting. - Markdown artifacts render as formatted HTML.
- Directory artifacts with
data: { format: zip, display: html_site }are uploaded as a zip and rendered as a navigable HTML site (this is how Jacoco/Pitest HTML reports show up). - Other binary artifacts are exposed as downloads.
annotation_target: artifact on the rubric check and naming the artifact
in the artifact field. See the
Rubrics documentation for details.
Rerunning the Autograder
You can rerun the autograder on an existing submission from the assignment page, the test-insights page, or an individual submission. Reruns keep the original submission record (same timestamp, same submission count) and replace the autograder result. Each rerun lets you choose which grader version to use:- The current grader (latest commit on the grader repo’s default branch).
- A specific commit from the recent history list.
- A manual SHA, for precise version control.
Test Insights and Bulk Regrading
The Test Insights view groups identical test failures across the whole class so you can quickly find systemic problems (a flaky test, an ambiguous spec, an off-by-one in your reference solution). From any error group you can:- See the number of affected submissions and their average score.
- View and copy the email addresses of affected students.
- Pin globally important issues so they remain visible across assignments.
- Launch a regrade with those submissions preselected on the rerun-autograder dialog.
Running the Grader Locally
You can run the grader against a local solution and a local submission without involving GitHub Actions or the Pawtograder server. From a clone ofpawtograder/assignment-action:
pawtograder-grading/
directory in your current working directory; delete it between runs
(or you may hit EACCES errors copying files).
Architecture Overview
When the action runs in the student repo, it:Authenticates with the grading server
GitHub issues an OIDC token to the workflow. The action sends that token to the
autograder-create-submission edge function. The grading server verifies the token (so it knows which repo and commit the request came from), runs security checks, registers a new submission, and returns a one-time download URL for the matching grader repository tarball.Downloads the grader and reads pawtograder.yml
The action extracts the grader tarball alongside the student’s checkout and reads
pawtograder.yml from the grader repo. The config selects an “overlay” grader and a build preset (java-gradle, python-script, or none).Overlays student files onto the grader
For each glob in
submissionFiles.files and submissionFiles.testFiles, the action deletes the matching files in the grader checkout and copies the student’s files in. This is the “overlay”: the grader repo provides the harness, the student’s files are layered on top.Lints, builds, and runs instructor tests
The selected builder runs the linter (if configured), then a clean build, then the instructor test suite. Results are parsed into per-test pass/fail records. If
linter.policy: fail is set and the linter fails, or if the build fails, grading stops and a zero is recorded.Optionally runs student tests and mutation analysis
If
student_tests is configured, the action resets the grader’s solution files, layers in only the student test files, and runs them against the instructor implementation (and optionally mutation testing). It can also run the student’s tests against the student’s own implementation to report branch and mutation coverage.Scores parts and units, resolves dependencies
Scores are computed for every
gradedUnit, summed into gradedPart scores, and then dependency rules are applied — units or parts whose dependencies aren’t satisfied are replaced with a message instead of their actual results.Submits feedback and uploads artifacts
The action calls
autograder-submit-feedback with the tests, lint output, and logs. If the grader emitted any artifacts, they are uploaded to Supabase storage via the signed URLs returned by the server. A summary table is also written to the GitHub Actions job summary.handout_notice and
the action exits successfully without grading.
Running a Forked Action
The grading action is fully open source atpawtograder/assignment-action,
so if the pawtograder.yml schema documented above isn’t expressive enough
for what your assignment needs, you can fork the action and point your
assignment’s grading workflow at your fork instead. Common reasons to do
this include adding a new build preset, changing how scores are computed,
or wiring up custom artifact handling.
To use a fork, change the uses: line in grade.yml to point at your
fork and the ref (tag, branch, or commit SHA) you want students to run: