Cyber awareness month deal: 5 months free on annual subscriptions

Subscribe Now
18hr
:
58min
:
13sec
Feature
BLOG • 13 min read

Inside TryHackMe’s AI-Powered Certification Grading System

Written by Marta Strzelec and Vlad Boldura

Real-time evaluation using AI and LLMs (Large-Language Models) plays a central role in all professional certifications we create at TryHackMe. The usage of AI in these types of assessments is not just a passing trend; it’s a structural shift reshaping how professional certifications are built, delivered, and validated globally. TryHackMe is leading the way in ensuring all assessments are credible and uphold the highest standard of integrity while innovating to improve the assessment process.

We built an AI-powered grading system to solve problems that impact your certification experience - things like slow feedback, inconsistent results, and unclear grading expectations. Human grading works well in many cases but doesn't scale easily or consistently.

We’ve invested heavily in a rigorous process, combining expert design, state-of-the-art technology, and a human-in-the-loop validation system to ensure fair, consistent, and trustworthy assessments. We are committed to transparently addressing any concerns and are happy to share what we learned.

Deep Dive into the Framework Design

If you’re a learner, educator, or certification designer this is our transparent look into how we’ve engineered fast, fair, and scalable AI grading and what we’re improving.

Understanding the Challenges of Grading

How assessment questions work

Let’s consider how any exam can be graded and validated: you can, broadly speaking, assess the candidate’s skill and knowledge by asking closed questions and open (free-form) questions. In TryHackMe’s professional certifications, like many traditional certification programmes, those two elements are combined to maintain a broad scope of the test with the accuracy of the grading.

Closed questions have exact match answers, such as an IP address, a flag, or a simple MCQ format. They are easy to validate: the candidate either gave the correct answer or did not. Open questions, like report-based submissions, are much more complex to grade. You can’t have an exact word-for-word match when candidates submit free-form writing: every person will have a unique writing style.

Now that we understand the difference between those two let’s focus on open questions and how to grade them reliably.

Human-powered grading

Humans are the traditional choice for grading open questions. Training graders and scoring each assessment takes time, so a delay in result delivery is a significant issue. One of the other challenges with using humans to grade assessments is score reliability.

Inter-rater reliability

Inter-rater reliability refers to keeping scores consistent when different people grade the same input. Imagine, for example, a talent show where three judges each show their scores - their scores will vary because human-powered grading is inherently subjective.

It’s not just talent shows - high-stakes assessments (such as undergraduate portfolios, teacher assessments, programming assignments, and essays) and the reliability of their scoring have been academically studied. The common conclusion is that moderate and high inter-rater reliability can only be achieved with added effort: carefully constructed criteria and rubrics, rater training, and consensus-building exercises. Even with those factors present, inter-rater reliability can remain inconsistent, with the highest levels of agreement reported at 80%: even with best practices and standards in place, human raters agreed with each other for only 80% of cases.

Intra-rater reliability

Intra-rater reliability comes into play when the same input is graded multiple times by the same person. It measures whether scores are consistent and stay consistent despite other factors, such as time of day and fatigue.

Studies have shown that human raters are unreliable in replicating their scores when grading despite following criteria, rubrics, and scales. Without rubrics and similar tools, some graders had self-agreement scores as low as 7% (one human grader disagreed with 93% of their scores when grading the same submission for a second time). This low score can be caused by raters getting fatigued, more lenient or harsher over time, or simply changing their expectations.

How AI Transforms the Process

Despite inconsistent reliability and more time needed, traditional certification programmes defaulted to using humans in their grading process because machines could not accurately grade free-form submissions. However, this compromise used to be acceptable until technology matured enough to help overcome these challenges. At TryHackMe, we needed a new path and created an AI-powered system that we could trust to validate, assess, and grade free-form writing. Our goal is to combine the fairness of human-powered testing with AI’s programmatic consistency and speed. In our certification exams, your reports are graded almost instantly, with high precision, human-in-the-loop oversight, and granular feedback provided to exam takers.

It’s important to note that automated scoring systems aren’t new - they’ve been in use long enough to be rigorously studied, showing these systems can match human scoring accuracy while outperforming humans at cost and speed. The paradigm shift resulting from the rising popularity of LLMs is simply the next step in this progression. This change is not an easy task, and success is not guaranteed. Industry voices warn against using LLMs for grading and pointing out reliability issues. Current research continues to emphasise the need for nuance and expertise in using LLMs to avoid the risks of AI “taking shortcuts” and acting on biases. This nuance and precision are what we set out to achieve at TryHackMe.

Challenges of using AI for grading assessments

To understand some of the principles we used in designing our system, the unique challenges presented by using AI are worth mentioning.

Data Privacy and Security

As always with LLMs, this is a big concern. The solution was relatively straightforward: we ensured all data we sent to the AI grader was anonymised correctly.

LLM positivity bias

Modern LLMs are biased toward being overly positive due to the dataset they’re trained on. We take this into account when testing and calibrate our instructions so that the grader's tone is neutral.

Consistency and reproducibility

Due to a degree of randomness built into how LLMs work, achieving consistency while using AI to grade was not trivial. This problem was at the top of our minds, especially as we set out to provide a more reliable and consistent solution than human graders. While more work needs to be done, we achieved high reproducibility levels through model benchmarking and configuration.

Algorithmic Bias

As a result of training datasets, LLMs often respond based on hidden biases. For example, they might include a preference for a specific type of written English (generic, overly verbose). For TryHackMe, mitigating this risk lies mostly in a robust prompt testing process.

How to create a reliable AI-powered evaluation system

To harness AI effectively in our certifications, we’ve built a robust, expert-led framework that embeds human oversight into every process layer. That way, the result for our users achieves our goal: an instant, AI-powered assessment of their exam built on a rigorous, human-first design process.

Our team built a robust AI grading engine, pushing the boundaries of what's possible in certification assessments. We tackled significant technical challenges in developing a system that could replicate human grading consistency while providing instant results. The engineering team designed a comprehensive prompting suite to support the rapid iteration of our prompting strategy. We're especially proud of the strong intra-rater consistency we achieved, with minimal variation - comparable to the performance of trained human evaluators. While we've made substantial progress, our work is far from over as we continue refining the system and exploring new frontiers in AI-powered assessment.

Ugur Doktur, Machine Learning Engineer

Creating a baseline

To start, you need to carefully design baseline criteria for grading (which you should do for any exam, even if you are using humans for evaluation) and the foundations the AI needs. At TryHackMe, we combine our team’s cyber security industry experience with our internal AI expertise to create this baseline.

Step 1: Expert-defined criteria

For any exam evaluation, you need a strong foundation. At TryHackMe, our Content Engineering team, composed of cyber security industry experts, defines all scoring logic, structure, weights, and report formats. Content Engineers are field specialists with industry experience, and their objectives are:

  • To reflect on-the-job experience through the grading criteria
  • To cover a comprehensive range of skills through graded items
  • To prevent min-maxing - a strategy where users attempt to game the system by skipping sections or optimising for shortcuts rather than demonstrating real understanding.

Step 2: Curated validation data

With a clear scoring structure, we define reporting baselines representing performance levels ranging from Excellent to Poor. We build these examples using:

  • Real-world reports
  • Industry benchmarks
  • In-house expertise, where we write our examples based on industry experience

Step 3: LLM prompt creation

At this stage, we create detailed and robust prompts for our AI models, including the evaluation criteria and training set examples. Each graded item has a dedicated prompt to maximise precision and reduce hallucination risk.

We’ve developed a state-of-the-art prompting suite tailored for testing, comparison, and refinement to support a scalable and high-precision prompt engineering workflow. It supports automated and manual assessments, giving us deep visibility into how prompts perform across varied contexts and user intents.

Step 4: Consistency-driven model configuration

To guarantee reproducible scoring, we tune parameters such as temperature and seed. In LLMs, temperature controls randomness in output. Lower values make responses more deterministic, while higher values allow more variation. Seedsets the starting point for the model’s random choices, ensuring that repeated runs with the same input and configuration yield identical results for consistency.

With correct tuning, we can achieve high reproducibility of scores while allowing flexibility for varying writing styles and formats.

Validating and testing

What follows is a rigorous testing phase. We test using multiple approaches to maintain our North Star metrics (accuracy and consistency). We run all test cases with iterative improvements aimed at refining our prompts. The test cases cover different levels of sophistication, ranging from nonsensical inputs to “almost correct” answers.

Case 1: LLM Benchmarking

We benchmark multiple LLMs to ensure the most reliable configuration, including testing different models and configurations of temperature and seed parameters.

Case 2: Testing for nonsensical inputs

We test our prompts against inputs that make no sense in the exam context, such as pancake recipes or gardening tips. The goal is to ensure that the AI never awards points for random inputs.

Case 3: Comparative prompt tuning

As the next step, we run comparative tests between different iterations of prompts, selecting the best-performing ones. With each test, the prompt that “wins” is the one that assigns the grade closest to the human-assigned score. We run this many times per prompt to find the best possible prompting strategy for each user submission. Our growing internal expertise now makes each test round faster and more effective.

A key feature of our prompting suite is its ability to compare multiple prompt variations side by side, helping us identify the most effective phrasing. Beyond evaluation, the suite includes a refinement assistant powered by the LLM, where we input a current prompt, its output, and the desired result. The model then suggests optimised versions, accelerating iteration and promoting a more structured, outcome-driven approach to prompt development.

Case 4: Testing for partial or imperfect inputs

We test all of our prompts on imperfect submission examples. Those examples include bad grammar, partially correct answers, incomplete answers, and similar. The goal is to ensure the AI grader aligns with a human-assigned grade for each report for perfect and imperfect submissions. We also test submissions of the same quality presented in different formats (for example, test bullet-point submissions and continuous text submissions).

Case 5: Testing against generic and AI-generated submissions

As part of our imperfect submission testing, we evaluate and iterate on how the grader handles AI-generated content. To prevent generic, low-effort submissions from receiving passing scores, we incorporate AI-pattern detection into our prompts, such as recognising vague phrasing, lack of contextual specifics, or overly generic language. This check ensures that if a user relies on AI to generate the entirety of their submission, the system can identify it and flag responses that lack evidence of genuine user effort.

We don’t discourage using AI; we expect professionals to use modern tools. Our detection logic flags submissions that lack substance, not those that use assistance well, and we require users to show understanding and present findings in a way that uses AI-generated content as a starting point, not the final submission.

Case 6: Consistency testing

Finally, we test prompts for consistency, running the same submissions multiple times, on different days, and at various times of day to ensure scores provided by the AI grader have high intra-rater agreement metrics.

Maintenance and human oversight

Our work does not end once a system is designed, tested, and integrated into the exam. Continuous maintenance, tuning, and oversight are needed to ensure the AI performs as expected. We have identified a couple of actions that help us maintain the system.

Action 1: Spot checks

We manually review a selection of reports after scoring to identify areas for improvement and maintain the integrity of the scoring. When we first released the SAL1 Certification, we manually reviewed every one of the first submissions to ensure the system worked well. We now continue to perform spot checks and improve prompts regularly.

Action 2: Edge case reviews

Manual review is triggered in critical or ambiguous cases to ensure correct function. We do this more often after releasing a new prompt and commit to periodic reviews afterwards. Examples of triggers include:

  • Scores that are well below or well above average
  • Inconsistent user scores (for example, if a user’s reports score well in one section of the exam and badly in another)
  • Scores close to the pass/fail threshold (for example, if the user fails or passes the exam by 5 points or fewer)

Current performance

At the moment of writing this article, we were seeing the following general metrics:

Intra-rater agreement score

The AI grader assigns the same score to the same input in 72% of cases. However, the variation is minimal even when the score differs, with a standard deviation of 0.89 points on a 20-point scale.

Example data over 50 runs for a report summary sample:

  • The grader assigned a score of 12 points in 36 runs
  • The grader assigned a score of 14 points in 13 runs
  • The grader assigned a score of 13 points in 1 run

This places the AI grader at the level of a well-trained, consistent human grader who has a detailed rubric available. While more improvements are needed, this score shows high levels of consistency.

We did not observe variations in agreement rates over time (different days and times of day).

Inter-rater reliability score

The traditionally understood inter-rater reliability score can’t be applied here, as there’s technically only one grader: the AI. We will use this opportunity to discuss the AI grader’s performance against human graders.

We continue improving each AI grading prompt until it reliably scores within 2-4 points of the human-assigned grade. While our internal tolerance is up to 4 points for some edge cases, our target remains 2 points to ensure accuracy and trust. This gives us a high level of human agreement, typically around 94%, adjusted for reproducibility and reliability.

Edge-case optimisation

One of the most challenging aspects of training the AI grader was around edge cases, where humans were notoriously better in the past. In these cases, users submit dummy data or random strings or actively try to “break” the grader using injects. The key strategies which we have found that worked well were:

  • Creating scoring “guardrails” for the AI grader, much like you would for a human grader. Defining sub-grading steps produced better, more consistent results overall.
  • Defining the scope and expected user output helped solve many edge cases and improved grading stability.
  • In the case of PT1, ample training data, such as open-source penetration test reports from some of the most reputable companies worldwide, also proved useful for fine-tuning.

Summary & Future Improvements

We are incredibly proud of our framework and hope this post sets a standard for transparency in building processes. We’re currently planning future improvements to the system and will continue to innovate to make it even more reliable. Some of our improvement plans are:

Periodic retesting

While we have already done a few iterative improvement cycles for SAL1 prompts, the certification was only published earlier this year. For SAL1 and all other certifications we create, we look forward to tracking our AI grader's performance over time.

User input testing

We plan to run more iterative prompt improvements using real user reports submitted during examinations. While we will not dramatically change the existing implementation, what we can learn from the AI grading of actual user-generated inputs can be implemented in our future efforts. In the spirit of our human-led process, we are not planning to train models using these reports but instead use them to continuously stress-test our system.

Anti-cheating

We at Tryhackme take cheating prevention very seriously. We have taken steps to ensure our exams employ various strategies to make cheating difficult without introducing friction. Still, AI can help even further by helping us identify submissions from different users that are similar to each other and might indicate cheating.

Further transparency

While we aren’t publishing specific grading accuracy metrics at this stage, we track model performance internally, including agreement with human-assigned scores, consistency across reruns, and performance against adversarial prompts. We plan to share more insights in future transparency updates.

Addendum: notes on company culture

TryHackMe’s culture embeds experimentation and iterative improvements at the core of everything we do. We strive to continuously learn from what we’re doing well and how we fail. When we started this journey, we trusted the process would get us to a place where we were satisfied because that same process had worked for us so many times before.

Everything we do is always at scale and with the user in mind. The question was not “Should we use AI or humans to grade assessments?” but "How can we support thousands of users with fast, fair, and accurate results?” We knew we did not want to engineer ourselves into a system that couldn’t grow as fast as we could! The usage of AI was simply the answer that made the most sense.

While the drive for innovation got us here, our confidence in the system comes from these key operating methods: our iterative approach and focus on the user.

Addendum: notes on the cyber security certification industry

It’s worth noting that the industry has addressed reliability in cyber security certifications to some extent. Multiple certifying bodies, such as (ISC)² and CompTIA, have published articles on how they approach the problem and have provided insight into their rubrics.

The most common trend in addressing reliability seems to be a shift from subjective to objective grading, from report-based to MCQ-based formats. This allows for rigorous psychometric oversight of examinations and ensures high levels of reliability and standardisation.

However, this standardisation comes at a cost to realism, where candidates are tested in examinations in one way and have their on-the-job performance assessed in another. Free-form reporting, investigation, and decision-making are core parts of actual security work - yet they remain the hardest to assess consistently. Certifications supporting free-form reports as part of their evaluations provide details into their process, employing effective tactics to improve fairness through rubrics, multi-stage reviews, and live debriefs. However, even these providers do not publish concrete reliability data (e.g. agreement rates between graders), and the inherent variability of human scoring remains unaddressed.

It was essential for us not to accept this status quo. When we discussed human-powered grading at TryHackMe, the reliability of evaluators became a problem we wanted to consider seriously. With our AI grader, we attempt to bridge the gap between the reliability of fully objective assessments and the realism of report-based ones. We’re open to future collaboration with others to drive further innovation and raise our standards to benefit our users and the industry.

authorMarta Strzelec
May 29, 2025

Join over 640 organisations upskilling their
workforce with TryHackMe

We use cookies to ensure you get the best user experience. For more information contact us.

Read more