AI Paper checker

Blog post description.

PROJECT DETAILS

6/12/20263 min read

AI Paper Checker

AI-Powered Descriptive Answer Grading — Rubric-Driven, Explainable, and Fully Private

Automated Exam Evaluation • Semantic Analysis • Local AI • Maharashtra SSC Aligned

What It Is

Examiner AI is an automated grading system for descriptive exam answers — the long-form, open-ended responses that no multiple-choice grader can touch. It evaluates a student's written answer against a model answer using three independent signals: a configurable rubric engine aligned to Maharashtra SSC marking standards, semantic similarity analysis through neural sentence embeddings, and LLM-generated educator-style feedback — all running locally, with no student data ever leaving the machine.

Where most 'AI grading' tools simply ask a chatbot for a number, Examiner AI triangulates: the rubric provides the structure an examiner would use, the embeddings measure whether the student actually understood the concept, and the LLM explains the result in language a student can learn from.

The Problem It Solves

Descriptive answer evaluation is the most expensive bottleneck in education. A single board exam generates millions of handwritten long-form answers, each requiring minutes of a trained examiner's attention. Grading is slow, costly, inconsistent between examiners, and exhausting — and the student typically receives a number with no explanation of what was missing.

Existing automated tools handle only objective questions. The descriptive answer — where actual understanding is demonstrated — remains almost entirely manual.

Examiner AI grades descriptive answers in seconds, with a transparent breakdown of exactly where every mark came from.

How It Grades

A Real Rubric Engine — Not a Prompt

At the core is a structured rubric system modeled on how board examiners actually mark. Each rubric defines criteria with maximum marks, partial-credit distributions, and keyword specifications. Three production-aligned templates ship today:

• Long Answer (10 marks) — multi-step solution structure, diagram presence and labeling, language clarity

• Short Answer (2 marks) — definition and keyword coverage with penalty deductions

• Numerical (10 marks) — formula correctness, substitution steps, unit validation, and final-value tolerance checking within ±2%

A penalty system applies conditional deductions for off-topic content, factual errors, and unit mistakes — the same deductions a human examiner would make.

Signal Extraction From the Student's Answer

Before scoring, the system analyzes the student's response across multiple dimensions: key-concept matching against the model answer, detection of numbered and sequential solution steps, diagram presence and label correctness, language quality heuristics, and numeric value extraction with tolerance comparison.

Semantic Similarity — Measuring Understanding, Not Word Matching

Both the model answer and the student answer are embedded using a neural sentence transformer, and cosine similarity measures how close the student's meaning is to the expected answer. A student who explains the concept in entirely different words still scores — because the system compares meaning, not vocabulary. This is the difference between checking memorization and checking understanding.

Composite Scoring With Transparent Weights

The final score blends the rubric result (70% weight) with similarity and LLM assessment (30%, user-adjustable). Every component's contribution is decomposed visually in an interactive chart — an examiner can see precisely why a score is what it is, and challenge any component independently.

Educator-Style Feedback via Local LLM

A locally-running Llama 3 model generates constructive feedback for each answer: what was done well, which key points are missing, and what to improve. The feedback prompt is institution-customizable. If the LLM is unavailable, the system degrades gracefully to template feedback rather than failing.

Hidden Technical Strengths

Triangulated Scoring Beats Single-Signal AI

No single grading signal is trustworthy alone: rubrics miss paraphrased understanding, embeddings miss structural requirements, LLMs hallucinate marks. Examiner AI's architecture acknowledges this and triangulates all three, with the deterministic rubric carrying the most weight. This is a fundamentally more defensible grading design than any single-model approach.

Confidence Scoring With Human-in-the-Loop Design

Every evaluation carries a confidence score computed from evidence completeness and criterion match ratios. Low-confidence cases are explicitly designed to escalate to human review — the system is built as an examiner's assistant with an honesty mechanism, not a black-box replacement.

Fully Local AI — Student Data Never Leaves the Machine

Both the embedding model and the feedback LLM run locally via Ollama and Sentence Transformers. No student answer, no question paper, and no grade is ever transmitted to a cloud API. For educational institutions handling minors' exam data, this is not a nice-to-have — it is the difference between deployable and not.

Auditable by Design

Every rubric evaluation produces a structured JSON breakdown — criterion by criterion, mark by mark, penalty by penalty — exportable for audit trails. When a student or parent disputes a grade, the institution can show exactly how it was computed.

Extensible Rubric Architecture

Rubric criteria are pluggable dataclasses — new criterion types (essay quality, citation format, code correctness) slot into the existing engine without touching the scoring pipeline. The same engine that grades SSC physics today can grade university essays tomorrow.

Technology Stack

• Python with Streamlit — interactive two-panel grading dashboard

• Sentence Transformers (all-MiniLM-L6-v2) — local 384-dimension semantic embeddings

• Ollama with Llama 3 — local LLM feedback generation, zero API cost

• Plotly — interactive score-decomposition visualizations

• NumPy / SciPy / Pandas — numeric tolerance checking and data handling

Everything runs on a single machine. No cloud dependency, no per-evaluation cost, no API keys required.

Current Stage and Roadmap

Examiner AI is a working MVP — the full grading pipeline is functional end-to-end, with three SSC-aligned rubric templates, live semantic analysis, and operational LLM feedback. The production roadmap is explicitly mapped:

• LLM-powered signal extraction returning evidence spans (replacing current heuristics)

• Persistent encrypted database with tamper-evident audit logging

• Role-based access for administrators, evaluators, and students

• Rubric authoring UI for non-technical educators

• Batch grading for full exam sets and question bank management

• Confidence-threshold escalation queues for human reviewers

Examiner AI treats grading the way examiners do —

rubric first, evidence always, and every mark explainable.