Olito Labs
Methodology & Open-Source Protocol

Scoring the AI Gap

The complete codebook, rubric, and reproducibility results behind the AI-Native Score. Built to be audited, replicated, and extended.

01

Framework Overview

Why five dimensions, and why they combine the way they do.

The AI-Native Score measures demonstrated AI capability on a 0–100 scale. It is not a self-assessment. Every component is grounded in observable evidence from the interview: tools mentioned, workflows described, integrations built.

I chose five dimensions because they cover AI adoption completely without overlapping. Each captures a distinct, non-overlapping signal:

Dimension Weight What It Measures Why It Matters
Highest Capability Tier 30% Most advanced tool class actively used Ceiling of capability. The tier determines the maximum output possible
Usage Sophistication 25% How they use tools, not just which ones Two people using the same tool can produce 10x different output based on prompting skill
Tool Diversity 20% Number of distinct tool classes actively used Breadth of exposure correlates with adaptability as the tool landscape shifts
Self-Directed Learning 15% Learned AI independently vs. firm mandate Predicts future trajectory: self-directed learners climb tiers faster
Integration Depth 10% Embedded in daily workflow vs. occasional use Frequency compounds. Daily users iterate faster and discover emergent capabilities

The score is composite rather than categorical. Two people in the same capability tier can score differently. A consultant who uses ChatGPT daily with custom instructions and projects scores higher than one who opens it once a week for ad hoc questions, even though both are in the same tier.

The capability ladder framing emerges from how the tiers relate to each other. Each tier builds on the one below: you cannot automate what you haven’t learned to prompt, and you can’t run agentic tools without pipeline scaffolding. The tiers are a ladder, not a menu. Stalling at any rung blocks growth above it.

Why not just measure capability tier?
Because tier alone misses the most important signal: how someone uses their tools. In my data, the highest-scoring participant (62/100) was in the same capability tier as participants scoring 20/100. The difference was sophistication, diversity, and integration depth. A single-axis measure would have placed them in the same bucket.
02

Scoring Codebook

The full rubric. Every level, every weight, every anchor example.

AI-Native Score (0–100)

I score each component independently from interview evidence, multiply by its weight, and sum them. No normalization or curve is applied. The scale is absolute.

1 Highest Capability Tier 30%

Tier Class Tools Points
0 None No AI tools 0
1 Everyday Chat ChatGPT (free/Plus basic), Gemini basic, firm internal chatbots, Microsoft Copilot basic 25
2 Advanced Prompting ChatGPT Pro, Deep Research, Claude (artifacts, projects), Perplexity Pro, NotebookLM 50
3 Automation & Agentic Claude Code, Cursor, Codex CLI, Windsurf, n8n, Zapier, Make, custom API integrations, workflow automation builders 100
Tier is determined by the highest class of tool the participant has actively used (not just heard of). Evidence: specific tool names, described workflows, mentioned outputs.

2 Usage Sophistication 25%

Level Points Evidence Required Interview Example
Builds Workflows 25 Creates multi-step automations, chains tools, builds custom integrations “I built a pipeline that takes the sourcing data, runs it through Claude, then pushes to our CRM”
Advanced Prompting 18 System prompts, structured outputs, iterative refinement, custom instructions “I set up a project with specific role instructions and context documents”
Basic Prompting 10 Simple Q&A, copy-paste results, basic conversations “I ask it to summarize things for me, draft an email”
Copy & Paste 5 Pastes content in, copies output, no prompt craft “I just throw the document in and see what comes back”
None 0 No usage

3 Tool Diversity 20%

Distinct Tool Classes Points
4+ classes20
3 classes15
2 classes10
1 class5
00

4 Self-Directed Learning 15%

Behavior Points Interview Evidence
Self-taught, explores independently 15 “Nobody showed me — I just started experimenting”
Mix of firm training + self-exploration 10 “The firm gave us a workshop but I went further on my own”
Only uses what firm provides/mandates 5 “I use whatever tools the firm gives us”
No learning initiative 0

5 Integration Depth 10%

Behavior Points
AI embedded in daily workflow, cannot work without it10
Regular use but could function without7
Occasional/situational use4
Tried once or twice1
Never used0
03

AI-Assisted Coding

How an LLM applies a fixed codebook to interview data, and where humans stay in the loop.

Every interview follows the same pipeline. A human conducts and records a 30–60 minute semi-structured interview. Then Claude (Anthropic’s Sonnet 4.6) applies the scoring codebook to the full interview record, extracting 27 structured fields in a single pass: demographics, AI capability assessment, tool awareness, adoption barriers, workflow descriptions, and quantified impact metrics. For research purposes, the key output is the AI-Native Score: the 5-component weighted composite described in Section 2. The human reviews every extraction and can override any field.

The LLM acts as a coder, not an analyst. The rubric defines every level, every threshold, every decision boundary. The LLM’s job is to apply the codebook consistently, finding the interview evidence that maps to each rubric level. It does not invent categories or create novel assessments.

Pipeline From interview to scored record
🎤
Human Interview
📄
Interview Record
🤖
Sonnet 4.6 + Codebook
{ }
Structured JSON
🔍
Human Review
🗃
SQLite DB

The extraction is a single API call. The system prompt embeds the full rubric and schema; the user message contains only the interview transcript:

Prompt Architecture
// System prompt (cached across extractions)
system: "You are an expert qualitative researcher.
  Extract structured data from this interview
  following the rubric and schema below exactly.
  Return ONLY valid JSON matching the schema."

  // Full scoring rubric embedded in system prompt
  + [complete scoring-rubric.md]

  // Full JSON schema embedded in system prompt
  + [extraction-schema.json]

// Model config
model:       claude-sonnet-4-6
temperature: 0
max_tokens:  16384

// User message: just the transcript
user: "{interviewTranscript}"

The system prompt is cached across extractions using Anthropic’s prompt caching, so the rubric and schema are sent to the API once and reused, reducing cost per interview significantly.

What humans review and override
Every extraction is reviewed before entering the database. Common override scenarios: (1) the LLM misclassifies a capability tier when the participant describes a tool ambiguously, (2) the LLM misreads nuanced interview signals, (3) the interview documentation quality is poor and the LLM fills gaps with assumptions. Override rate across my dataset: approximately 8% of scored fields.

The output is validated against a strict schema using Zod, a TypeScript runtime validation library. Scores must be integers within defined bounds. Enums must match the allowed values exactly. If validation fails (a score out of range, a missing required field, an unrecognized category), the extraction is rejected and the pipeline automatically retries, up to three attempts. This mechanical validation catches most systematic errors before a human ever sees the data.

04

Reproducibility Results

10 interviews, 5 runs each. How stable is the AI coder?

To test whether the LLM-as-coder produces consistent scores, I ran a test–retest experiment. I selected 10 interviews stratified across firm types and score ranges, then ran the full extraction pipeline 5 times per interview with identical prompts. No caching, no seed. Each run is an independent call to the API.

Protocol 10 interviews selected for reproducibility testing
# Firm Type AI-Native Score
1Research31.6
2MBB43.2
3GovTech44.6
4MBB51.2
5Big 456.8
6MBB58.4
7Independent58.8
8Government59.6
9Boutique63.0
10VCPE73.2
Results Scoring precision across 50 independent runs
±3.2 points
Average scoring variation on a 100-point scale, across 5 independent runs per interview
±10 pts
±3.2
020406080100
ICC = 0.92 “Excellent” agreement per Cicchetti (1994). Target was > 0.85.
Exhibit Individual scoring runs: 5 independent extractions per interview
0 20 40 60 80 100 AI-Native Score = one independent scoring run (n = 50) Research MBB-1 GovTech MBB-2 Big 4 MBB-3 Indep. Gov Boutique VCPE
All 50 data points cluster in a narrow band of the 0–100 scale. Average variation per interview: ±3.2 points.
Source: Olito Labs Reproducibility Experiment. 10 interviews × 5 runs = 50 independent API calls. Claude Sonnet 4.6, temperature 0, no caching.

The AI-Native Score shows excellent reproducibility. The intraclass correlation coefficient (ICC(3,1) = 0.92) exceeds my pre-registered threshold of 0.85 (“excellent” per Cicchetti 1994). Average scoring variation was just ±3.2 points on a 100-point scale, driven primarily by one outlier interview (14-point range) whose record contained ambiguous signals about tool sophistication. Excluding that interview, mean SD drops to 2.7.

Categorical fields show strong agreement across all 50 runs:

Measure Statistic Value Interpretation
Firm type classificationFleiss’ κ1.00Perfect agreement
Usage sophisticationFleiss’ κ0.83Substantial agreement
Capability tierFleiss’ κ0.67Moderate: boundary cases between Tier 2 and 3
AI-Native ScoreICC(3,1)0.92Excellent consistency

The raw data for all 50 extractions is available in the open-source repository.

05

Limitations

What this methodology can and cannot tell you.

I designed this framework to be rigorous within its scope. It has limitations, and I state them openly so readers can judge for themselves.

  • Sample Bias My 50 participants are not a representative sample of knowledge workers. They skew toward consulting, finance, and technology, all industries with higher-than-average AI exposure. Most were recruited through professional networks, which introduces selection bias toward people willing to discuss AI usage. Generalization beyond this cohort should be cautious.
  • Single-Model Coder One model family (Claude) performed all extractions. Different LLMs might produce systematically different scores. I test reproducibility within-model (same model, multiple runs) but not across-model. Future work should include cross-model validation.
  • Self-Report Data The AI-Native Score is grounded in what participants describe, not what they demonstrably do. People may overstate or understate their tool usage. I mitigate this by looking for specific evidence (tool names, described workflows, concrete outputs) rather than self-assessments, but overstatement and understatement are still possible.
  • Interview Documentation Quality Automated documentation of interviews introduces errors, particularly in technical terms (tool names, programming concepts). Poor audio quality amplifies this. I manually correct obvious errors in tool names and technical vocabulary, but some context loss is inevitable.
  • Point-in-Time Measurement AI capability is changing rapidly. Scores measured in February 2026 reflect the tool landscape at that moment. Tools that are “Tier 3” today may become baseline within months. The framework is designed to be recalibrated, but any specific score is a snapshot.
  • Interview Framing Effects The semi-structured interview format means different interviewers may elicit different levels of detail. A participant who is not asked about automation tools won’t describe them, which could lower their score. I use a standard question protocol to minimize this, but interviewers still vary.
06

Open-Source Artifacts

Everything you need to replicate, extend, or critique this research.

07

Score Calculator

A simplified self-assessment. Answer four questions for a rough estimate of where you land.

This is a simplified self-assessment version of the scoring rubric described in Section 2. It applies the same five components and weights, but uses your own answers rather than evidence extracted from a structured interview. Treat it as a rough orientation, not a precise measurement. A full standalone version, with longer-form questions and LLM-assisted analysis for more accurate scoring, is in development.

Step 1 of 4

Which AI tools do you actively use?

Select all that apply. The calculator derives your capability tier and tool diversity from this.

Tier 1 Chat & Writing
ChatGPT Free/Plus
Google Gemini
Microsoft Copilot
Perplexity
Firm internal chatbot
Other
Tier 2 Research & Power Tools
ChatGPT Pro / higher thinking
Claude Project modes
Deep Research
NotebookLM
Perplexity Labs
Other
Tier 3 Agentic
Claude Code
Claude Cowork
Cursor
Codex
Devin
Other
None of these

When you use AI, which best describes your typical approach?

Pick the one that sounds most like you on a normal workday.

Copy & Paste
I paste content in and use whatever comes back
Basic Prompting
I ask questions, request summaries, draft emails and other straightforward tasks
Advanced Prompting
I craft detailed prompts, use custom instructions or projects, iterate on outputs
Builds Workflows
I chain tools together, build multi-step automations, or create systems

How did you learn to use AI tools?

Think about where your current skills actually came from.

Firm-provided
My company introduced them and I use what’s provided
Mix
Started with what was offered, then explored further on my own
Self-taught
Entirely self-directed. I found and learned tools independently

How embedded is AI in your daily work?

Be honest: where does AI actually sit in your routine?

Tried a few times
I’ve experimented but it’s not part of my routine
Occasional
I use it for specific tasks when I think of it
Regular
It’s part of my routine. I use it most days
Essential
I can’t imagine working without it. It’s embedded in everything
0
AI-Native Score
0 25 50 75 100
Component Breakdown
Tier
0 / 30
Sophistication
0 / 25
Diversity
0 / 20
Learning
0 / 15
Integration
0 / 10

The corrected formula: Score = (Tier × 0.30) + Sophistication + Diversity + Learning + Integration. Tier is scored 0–100 and weighted to contribute up to 30 points. Other components contribute their rubric points directly (max 25 + 20 + 15 + 10 = 70).