Methodology — The Widening AI Gap

01

Framework Overview

Why five dimensions, and why they combine the way they do.

The AI-Native Score measures demonstrated AI capability on a 0–100 scale. It is not a self-assessment. Every component is grounded in observable evidence from the interview: tools mentioned, workflows described, integrations built.

I chose five dimensions because they cover AI adoption completely without overlapping. Each captures a distinct, non-overlapping signal:

Dimension	Weight	What It Measures	Why It Matters
Highest Capability Tier	30%	Most advanced tool class actively used	Ceiling of capability. The tier determines the maximum output possible
Usage Sophistication	25%	How they use tools, not just which ones	Two people using the same tool can produce 10x different output based on prompting skill
Tool Diversity	20%	Number of distinct tool classes actively used	Breadth of exposure correlates with adaptability as the tool landscape shifts
Self-Directed Learning	15%	Learned AI independently vs. firm mandate	Predicts future trajectory: self-directed learners climb tiers faster
Integration Depth	10%	Embedded in daily workflow vs. occasional use	Frequency compounds. Daily users iterate faster and discover emergent capabilities

The score is composite rather than categorical. Two people in the same capability tier can score differently. A consultant who uses ChatGPT daily with custom instructions and projects scores higher than one who opens it once a week for ad hoc questions, even though both are in the same tier.

The capability ladder framing emerges from how the tiers relate to each other. Each tier builds on the one below: you cannot automate what you haven’t learned to prompt, and you can’t run agentic tools without pipeline scaffolding. The tiers are a ladder, not a menu. Stalling at any rung blocks growth above it.

◆

Why not just measure capability tier?

Because tier alone misses the most important signal: how someone uses their tools. In my data, the highest-scoring participant (62/100) was in the same capability tier as participants scoring 20/100. The difference was sophistication, diversity, and integration depth. A single-axis measure would have placed them in the same bucket.

02

Scoring Codebook

The full rubric. Every level, every weight, every anchor example.

AI-Native Score (0–100)

I score each component independently from interview evidence, multiply by its weight, and sum them. No normalization or curve is applied. The scale is absolute.

1 Highest Capability Tier 30%

Tier	Class	Tools	Points
0	None	No AI tools	0
1	Everyday Chat	ChatGPT (free/Plus basic), Gemini basic, firm internal chatbots, Microsoft Copilot basic	25
2	Advanced Prompting	ChatGPT Pro, Deep Research, Claude (artifacts, projects), Perplexity Pro, NotebookLM	50
3	Automation & Agentic	Claude Code, Cursor, Codex CLI, Windsurf, n8n, Zapier, Make, custom API integrations, workflow automation builders	100

Tier is determined by the highest class of tool the participant has actively used (not just heard of). Evidence: specific tool names, described workflows, mentioned outputs.

2 Usage Sophistication 25%

Level	Points	Evidence Required	Interview Example
Builds Workflows	25	Creates multi-step automations, chains tools, builds custom integrations	“I built a pipeline that takes the sourcing data, runs it through Claude, then pushes to our CRM”
Advanced Prompting	18	System prompts, structured outputs, iterative refinement, custom instructions	“I set up a project with specific role instructions and context documents”
Basic Prompting	10	Simple Q&A, copy-paste results, basic conversations	“I ask it to summarize things for me, draft an email”
Copy & Paste	5	Pastes content in, copies output, no prompt craft	“I just throw the document in and see what comes back”
None	0	No usage	—

3 Tool Diversity 20%

Distinct Tool Classes	Points
4+ classes	20
3 classes	15
2 classes	10
1 class	5
0	0

4 Self-Directed Learning 15%

Behavior	Points	Interview Evidence
Self-taught, explores independently	15	“Nobody showed me — I just started experimenting”
Mix of firm training + self-exploration	10	“The firm gave us a workshop but I went further on my own”
Only uses what firm provides/mandates	5	“I use whatever tools the firm gives us”
No learning initiative	0	—

5 Integration Depth 10%

Behavior	Points
AI embedded in daily workflow, cannot work without it	10
Regular use but could function without	7
Occasional/situational use	4
Tried once or twice	1
Never used	0

03

AI-Assisted Coding

How an LLM applies a fixed codebook to interview data, and where humans stay in the loop.

Every interview follows the same pipeline. A human conducts and records a 30–60 minute semi-structured interview. Then Claude (Anthropic’s Sonnet 4.6) applies the scoring codebook to the full interview record, extracting 27 structured fields in a single pass: demographics, AI capability assessment, tool awareness, adoption barriers, workflow descriptions, and quantified impact metrics. For research purposes, the key output is the AI-Native Score: the 5-component weighted composite described in Section 2. The human reviews every extraction and can override any field.

The LLM acts as a coder, not an analyst. The rubric defines every level, every threshold, every decision boundary. The LLM’s job is to apply the codebook consistently, finding the interview evidence that maps to each rubric level. It does not invent categories or create novel assessments.

Pipeline From interview to scored record

🎤

Human Interview

→

📄

Interview Record

→

🤖

Sonnet 4.6 + Codebook

→

{ }

Structured JSON

→

🔍

Human Review

→

🗃

SQLite DB

The extraction is a single API call. The system prompt embeds the full rubric and schema; the user message contains only the interview transcript:

Prompt Architecture

// System prompt (cached across extractions)
system: "You are an expert qualitative researcher.
  Extract structured data from this interview
  following the rubric and schema below exactly.
  Return ONLY valid JSON matching the schema."

  // Full scoring rubric embedded in system prompt
  + [complete scoring-rubric.md]

  // Full JSON schema embedded in system prompt
  + [extraction-schema.json]

// Model config
model:       claude-sonnet-4-6
temperature: 0
max_tokens:  16384

// User message: just the transcript
user: "{interviewTranscript}"

The system prompt is cached across extractions using Anthropic’s prompt caching, so the rubric and schema are sent to the API once and reused, reducing cost per interview significantly.

◆

What humans review and override

Every extraction is reviewed before entering the database. Common override scenarios: (1) the LLM misclassifies a capability tier when the participant describes a tool ambiguously, (2) the LLM misreads nuanced interview signals, (3) the interview documentation quality is poor and the LLM fills gaps with assumptions. Override rate across my dataset: approximately 8% of scored fields.

The output is validated against a strict schema using Zod, a TypeScript runtime validation library. Scores must be integers within defined bounds. Enums must match the allowed values exactly. If validation fails (a score out of range, a missing required field, an unrecognized category), the extraction is rejected and the pipeline automatically retries, up to three attempts. This mechanical validation catches most systematic errors before a human ever sees the data.

04

Reproducibility Results

10 interviews, 5 runs each. How stable is the AI coder?

To test whether the LLM-as-coder produces consistent scores, I ran a test–retest experiment. I selected 10 interviews stratified across firm types and score ranges, then ran the full extraction pipeline 5 times per interview with identical prompts. No caching, no seed. Each run is an independent call to the API.

Protocol 10 interviews selected for reproducibility testing

#	Firm Type	AI-Native Score
1	Research	31.6
2	MBB	43.2
3	GovTech	44.6
4	MBB	51.2
5	Big 4	56.8
6	MBB	58.4
7	Independent	58.8
8	Government	59.6
9	Boutique	63.0
10	VCPE	73.2

Results Scoring precision across 50 independent runs

±3.2 points

Average scoring variation on a 100-point scale, across 5 independent runs per interview

±10 pts

±3.2

020406080100

ICC = 0.92 “Excellent” agreement per Cicchetti (1994). Target was > 0.85.

Exhibit Individual scoring runs: 5 independent extractions per interview

All 50 data points cluster in a narrow band of the 0–100 scale. Average variation per interview: ±3.2 points.

Source: Olito Labs Reproducibility Experiment. 10 interviews × 5 runs = 50 independent API calls. Claude Sonnet 4.6, temperature 0, no caching.

The AI-Native Score shows excellent reproducibility. The intraclass correlation coefficient (ICC(3,1) = 0.92) exceeds my pre-registered threshold of 0.85 (“excellent” per Cicchetti 1994). Average scoring variation was just ±3.2 points on a 100-point scale, driven primarily by one outlier interview (14-point range) whose record contained ambiguous signals about tool sophistication. Excluding that interview, mean SD drops to 2.7.

Categorical fields show strong agreement across all 50 runs:

Measure	Statistic	Value	Interpretation
Firm type classification	Fleiss’ κ	1.00	Perfect agreement
Usage sophistication	Fleiss’ κ	0.83	Substantial agreement
Capability tier	Fleiss’ κ	0.67	Moderate: boundary cases between Tier 2 and 3
AI-Native Score	ICC(3,1)	0.92	Excellent consistency

The raw data for all 50 extractions is available in the open-source repository.

05

Limitations

What this methodology can and cannot tell you.

I designed this framework to be rigorous within its scope. It has limitations, and I state them openly so readers can judge for themselves.

Sample Bias My 50 participants are not a representative sample of knowledge workers. They skew toward consulting, finance, and technology, all industries with higher-than-average AI exposure. Most were recruited through professional networks, which introduces selection bias toward people willing to discuss AI usage. Generalization beyond this cohort should be cautious.
Single-Model Coder One model family (Claude) performed all extractions. Different LLMs might produce systematically different scores. I test reproducibility within-model (same model, multiple runs) but not across-model. Future work should include cross-model validation.
Self-Report Data The AI-Native Score is grounded in what participants describe, not what they demonstrably do. People may overstate or understate their tool usage. I mitigate this by looking for specific evidence (tool names, described workflows, concrete outputs) rather than self-assessments, but overstatement and understatement are still possible.
Interview Documentation Quality Automated documentation of interviews introduces errors, particularly in technical terms (tool names, programming concepts). Poor audio quality amplifies this. I manually correct obvious errors in tool names and technical vocabulary, but some context loss is inevitable.
Point-in-Time Measurement AI capability is changing rapidly. Scores measured in February 2026 reflect the tool landscape at that moment. Tools that are “Tier 3” today may become baseline within months. The framework is designed to be recalibrated, but any specific score is a snapshot.
Interview Framing Effects The semi-structured interview format means different interviewers may elicit different levels of detail. A participant who is not asked about automation tools won’t describe them, which could lower their score. I use a standard question protocol to minimize this, but interviewers still vary.

06

Open-Source Artifacts

Everything you need to replicate, extend, or critique this research.

I publish the complete scoring apparatus so flaws can be found and improvements can be made. All artifacts are available in the GitHub repository.

The repository contains the full TypeScript extraction pipeline: scoring rubric, JSON schema, prompt architecture, Zod validation, and a CLI that processes transcripts individually or in batch. Point Claude Code at the repo with a folder of transcripts and it handles everything.

📖 Scoring Rubric Complete rubric with all component weights, tool class taxonomy, scoring bands, and signal adjustments. Markdown · references/scoring-rubric.md { } JSON Schema The exact schema used for extraction validation. All field types, constraints, and enum values. JSON Schema · references/extraction-schema.json 🤖 Prompt Architecture The TypeScript prompt builder that assembles the system message from the rubric and schema at runtime. TypeScript · engine/src/prompt.ts ⚙ Extraction Engine The core extraction pipeline with API calls, Zod validation, retry logic, and cost tracking. TypeScript · engine/src/extract.ts 🚀 Claude Code Skill Point Claude Code at this file and it runs the full pipeline for you. No manual setup required. Markdown · SKILL.md 📑 README Full documentation: quick start, input/output format, configuration, cost estimates, and CLI reference. Markdown · README.md

07

Score Calculator

A simplified self-assessment. Answer four questions for a rough estimate of where you land.

This is a simplified self-assessment version of the scoring rubric described in Section 2. It applies the same five components and weights, but uses your own answers rather than evidence extracted from a structured interview. Treat it as a rough orientation, not a precise measurement. A full standalone version, with longer-form questions and LLM-assisted analysis for more accurate scoring, is in development.

Step 1 of 4

Which AI tools do you actively use?

Select all that apply. The calculator derives your capability tier and tool diversity from this.

Tier 1 Chat & Writing

ChatGPT Free/Plus

Google Gemini

Microsoft Copilot

Perplexity

Firm internal chatbot

Other

Tier 2 Research & Power Tools

ChatGPT Pro / higher thinking

Claude Project modes

Deep Research

NotebookLM

Perplexity Labs

Other

Tier 3 Agentic

Claude Code

Claude Cowork

Cursor

Codex

Devin

Other

None of these

When you use AI, which best describes your typical approach?

Pick the one that sounds most like you on a normal workday.

Copy & Paste

I paste content in and use whatever comes back

Basic Prompting

I ask questions, request summaries, draft emails and other straightforward tasks

Advanced Prompting

I craft detailed prompts, use custom instructions or projects, iterate on outputs

Builds Workflows

I chain tools together, build multi-step automations, or create systems

How did you learn to use AI tools?

Think about where your current skills actually came from.

Firm-provided

My company introduced them and I use what’s provided

Mix

Started with what was offered, then explored further on my own

Self-taught

Entirely self-directed. I found and learned tools independently

How embedded is AI in your daily work?

Be honest: where does AI actually sit in your routine?

Tried a few times

I’ve experimented but it’s not part of my routine

Occasional

I use it for specific tasks when I think of it

Regular

It’s part of my routine. I use it most days

Essential

I can’t imagine working without it. It’s embedded in everything

0

AI-Native Score

0 25 50 75 100

Component Breakdown

Tier

0 / 30

Sophistication

0 / 25

Diversity

0 / 20

Learning

0 / 15

Integration

0 / 10

The corrected formula: Score = (Tier × 0.30) + Sophistication + Diversity + Learning + Integration. Tier is scored 0–100 and weighted to contribute up to 30 points. Other components contribute their rubric points directly (max 25 + 20 + 15 + 10 = 70).

Scoring the AI Gap

Framework Overview

Scoring Codebook

AI-Native Score (0–100)

1 Highest Capability Tier 30%

2 Usage Sophistication 25%

3 Tool Diversity 20%

4 Self-Directed Learning 15%

5 Integration Depth 10%

AI-Assisted Coding

Reproducibility Results

Limitations

Open-Source Artifacts

Score Calculator

Which AI tools do you actively use?

When you use AI, which best describes your typical approach?

How did you learn to use AI tools?

How embedded is AI in your daily work?