DGrid AI’s latest research tackles a core flaw in decentralized AI Scoring

by · crypto.news

DGrid AI introduces a new Proof of Quality framework designed to evaluate AI outputs and improve reward distribution across decentralized networks.

Summary

  • DGrid AI’s new PoQ research introduces reference-free scoring to reward AI nodes without needing correct answers.
  • DGrid trained specialized AI judges to score output quality, improving decentralized AI reward systems at scale.
  • DGrid AI’s new Proof of Quality models help decentralized AI networks evaluate responses accurately without ground truth data.

Decentralized AI networks have a payment problem that researchers have been quietly working around for years, and a recent paper from DGrid AI puts the issue directly on the table. The quality scoring systems powering node rewards have largely depended on having the correct answer on hand to compare against. In production, that answer rarely exists.

The paper, the fourth in DGrid’s ongoing research series on Proof of Quality (PoQ), proposes a trained alternative and publishes the numbers behind it. PoQ uses small evaluator models to score each output’s quality, and those scores drive the rewards. Cheap, and it scales.

DGrid built this up brick by brick: a cost-aware version that bakes latency into the payout math, an adversarial-robustness layer that holds up when scorers turn liar or lazy, and a framework that splits “quality” into parts you can inspect. Solid engineering. And every layer kept slamming into the same wall.

How the scoring problem developed

The basic structure of a decentralized inference network creates a measurement challenge. Independent nodes run language models and respond to user queries. Those responses need to be scored because scores determine pay. Cryptographic verification of every computation would be technically airtight but prohibitively expensive at scale, so the practical path has been automated quality evaluation using smaller models.

DGrid’s earlier work built out that approach incrementally,  adding latency-adjusted payouts, defenses against manipulative scorers, and a more granular breakdown of what “quality” actually means in a scoring context. What it could not fully resolve was the evaluation signal itself.

The strongest signal the team had was semantic similarity: compare the model’s output to a known correct answer and measure the distance between them in embedding space. That works in benchmark environments where reference answers exist. It does not work in a live network where users are asking open-ended questions and no ground truth is waiting in a database.

Off-the-shelf alternatives tested worse. An NLI cross-encoder,  a model class designed to assess logical entailment between sentences, returned a Pearson correlation of −0.363 when used to rate answer quality without a reference answer. A negative correlation means the model was more likely to favor poor responses over good ones. That is not a usable evaluation tool.

What the paper proposes

Instead of adapting existing models, the researchers trained three judges specifically for reference-free quality scoring. Each takes a question and a response as input and outputs a score from 0 to 10, with no correct answer provided.

The three models differ primarily in size and speed:

  • TextCNN (~10M parameters) runs in approximately 1 millisecond per call, making it suitable for high-throughput first-pass filtering.
  • MiniLM (22M parameters) sits in the middle at around 13 milliseconds.
  • DeBERTa (184M parameters) takes roughly 15 milliseconds and is optimized for accuracy.

Training followed a two-stage process. The models were first pre-trained on UltraFeedback, a public dataset of GPT-4-graded responses, before fine-tuning on the network’s own task distribution. The intent was to give the judges a broad baseline understanding of quality before narrowing their focus to the specific scoring context.

The core result

On a held-out test set of 300 examples, the DeBERTa judge achieved a Pearson correlation of 0.747 against the ground-truth proxy — without access to any reference answer. The reference-based evaluators from the prior framework, which did have access to correct answers, reached a maximum of 0.647.

The gap has a straightforward explanation. The older evaluators were similarity metrics measuring cosine distance to a reference embedding. The new judges were optimized end-to-end for the scoring task itself. The performance difference reflects that distinction more than any architectural breakthrough.

One caveat the authors include: the ground truth used here is itself a proxy — token-level word overlap rather than human judgment. The judges correlate well with this metric, but whether word overlap reliably reflects what a human would consider a quality response is a separate, unresolved question.

Two deployment-oriented features accompany the judges. A cascading pipeline routes queries through the lightweight model first and escalates to heavier models only when scores are ambiguous, reducing evaluation costs by up to 72.7% at the most aggressive threshold setting, though correlation drops to around 0.51 in that configuration. An online calibration mechanism, running without manual tuning, consistently identifies semantic quality as the dominant signal and adjusts weights accordingly, assigning it 4.7 times its starting weight over time.

Where the system still struggles

The judges perform unevenly across task types. On question answering, correlation reaches 0.830. On summarization, it falls to 0.199. The paper attributes this not to a failure in the judges themselves but to the evaluation metric used during training: raw word overlap is a poor measure of summarization quality, so models trained against it learn to track a weak signal. The authors describe this as the primary open problem rather than a known limitation being managed quietly.

That framing is consistent with how the paper presents its results overall — methodically, with the failure cases as clearly stated as the improvements. Four papers into this research thread, the work reads less like a product announcement and more like a team incrementally closing gaps in something they intend to actually deploy.