Olito Labs
The complete codebook, rubric, and reproducibility results behind the AI-Native Score. Built to be audited, replicated, and extended.
Why five dimensions, and why they combine the way they do.
The AI-Native Score measures demonstrated AI capability on a 0–100 scale. It is not a self-assessment. Every component is grounded in observable evidence from the interview: tools mentioned, workflows described, integrations built.
I chose five dimensions because they cover AI adoption completely without overlapping. Each captures a distinct, non-overlapping signal:
| Dimension | Weight | What It Measures | Why It Matters |
|---|---|---|---|
| Highest Capability Tier | 30% | Most advanced tool class actively used | Ceiling of capability. The tier determines the maximum output possible |
| Usage Sophistication | 25% | How they use tools, not just which ones | Two people using the same tool can produce 10x different output based on prompting skill |
| Tool Diversity | 20% | Number of distinct tool classes actively used | Breadth of exposure correlates with adaptability as the tool landscape shifts |
| Self-Directed Learning | 15% | Learned AI independently vs. firm mandate | Predicts future trajectory: self-directed learners climb tiers faster |
| Integration Depth | 10% | Embedded in daily workflow vs. occasional use | Frequency compounds. Daily users iterate faster and discover emergent capabilities |
The score is composite rather than categorical. Two people in the same capability tier can score differently. A consultant who uses ChatGPT daily with custom instructions and projects scores higher than one who opens it once a week for ad hoc questions, even though both are in the same tier.
The capability ladder framing emerges from how the tiers relate to each other. Each tier builds on the one below: you cannot automate what you haven’t learned to prompt, and you can’t run agentic tools without pipeline scaffolding. The tiers are a ladder, not a menu. Stalling at any rung blocks growth above it.
The full rubric. Every level, every weight, every anchor example.
I score each component independently from interview evidence, multiply by its weight, and sum them. No normalization or curve is applied. The scale is absolute.
| Tier | Class | Tools | Points |
|---|---|---|---|
| 0 | None | No AI tools | 0 |
| 1 | Everyday Chat | ChatGPT (free/Plus basic), Gemini basic, firm internal chatbots, Microsoft Copilot basic | 25 |
| 2 | Advanced Prompting | ChatGPT Pro, Deep Research, Claude (artifacts, projects), Perplexity Pro, NotebookLM | 50 |
| 3 | Automation & Agentic | Claude Code, Cursor, Codex CLI, Windsurf, n8n, Zapier, Make, custom API integrations, workflow automation builders | 100 |
| Level | Points | Evidence Required | Interview Example |
|---|---|---|---|
| Builds Workflows | 25 | Creates multi-step automations, chains tools, builds custom integrations | “I built a pipeline that takes the sourcing data, runs it through Claude, then pushes to our CRM” |
| Advanced Prompting | 18 | System prompts, structured outputs, iterative refinement, custom instructions | “I set up a project with specific role instructions and context documents” |
| Basic Prompting | 10 | Simple Q&A, copy-paste results, basic conversations | “I ask it to summarize things for me, draft an email” |
| Copy & Paste | 5 | Pastes content in, copies output, no prompt craft | “I just throw the document in and see what comes back” |
| None | 0 | No usage | — |
| Distinct Tool Classes | Points |
|---|---|
| 4+ classes | 20 |
| 3 classes | 15 |
| 2 classes | 10 |
| 1 class | 5 |
| 0 | 0 |
| Behavior | Points | Interview Evidence |
|---|---|---|
| Self-taught, explores independently | 15 | “Nobody showed me — I just started experimenting” |
| Mix of firm training + self-exploration | 10 | “The firm gave us a workshop but I went further on my own” |
| Only uses what firm provides/mandates | 5 | “I use whatever tools the firm gives us” |
| No learning initiative | 0 | — |
| Behavior | Points |
|---|---|
| AI embedded in daily workflow, cannot work without it | 10 |
| Regular use but could function without | 7 |
| Occasional/situational use | 4 |
| Tried once or twice | 1 |
| Never used | 0 |
How an LLM applies a fixed codebook to interview data, and where humans stay in the loop.
Every interview follows the same pipeline. A human conducts and records a 30–60 minute semi-structured interview. Then Claude (Anthropic’s Sonnet 4.6) applies the scoring codebook to the full interview record, extracting 27 structured fields in a single pass: demographics, AI capability assessment, tool awareness, adoption barriers, workflow descriptions, and quantified impact metrics. For research purposes, the key output is the AI-Native Score: the 5-component weighted composite described in Section 2. The human reviews every extraction and can override any field.
The LLM acts as a coder, not an analyst. The rubric defines every level, every threshold, every decision boundary. The LLM’s job is to apply the codebook consistently, finding the interview evidence that maps to each rubric level. It does not invent categories or create novel assessments.
The extraction is a single API call. The system prompt embeds the full rubric and schema; the user message contains only the interview transcript:
// System prompt (cached across extractions) system: "You are an expert qualitative researcher. Extract structured data from this interview following the rubric and schema below exactly. Return ONLY valid JSON matching the schema." // Full scoring rubric embedded in system prompt + [complete scoring-rubric.md] // Full JSON schema embedded in system prompt + [extraction-schema.json] // Model config model: claude-sonnet-4-6 temperature: 0 max_tokens: 16384 // User message: just the transcript user: "{interviewTranscript}"
The system prompt is cached across extractions using Anthropic’s prompt caching, so the rubric and schema are sent to the API once and reused, reducing cost per interview significantly.
The output is validated against a strict schema using Zod, a TypeScript runtime validation library. Scores must be integers within defined bounds. Enums must match the allowed values exactly. If validation fails (a score out of range, a missing required field, an unrecognized category), the extraction is rejected and the pipeline automatically retries, up to three attempts. This mechanical validation catches most systematic errors before a human ever sees the data.
10 interviews, 5 runs each. How stable is the AI coder?
To test whether the LLM-as-coder produces consistent scores, I ran a test–retest experiment. I selected 10 interviews stratified across firm types and score ranges, then ran the full extraction pipeline 5 times per interview with identical prompts. No caching, no seed. Each run is an independent call to the API.
| # | Firm Type | AI-Native Score |
|---|---|---|
| 1 | Research | 31.6 |
| 2 | MBB | 43.2 |
| 3 | GovTech | 44.6 |
| 4 | MBB | 51.2 |
| 5 | Big 4 | 56.8 |
| 6 | MBB | 58.4 |
| 7 | Independent | 58.8 |
| 8 | Government | 59.6 |
| 9 | Boutique | 63.0 |
| 10 | VCPE | 73.2 |
The AI-Native Score shows excellent reproducibility. The intraclass correlation coefficient (ICC(3,1) = 0.92) exceeds my pre-registered threshold of 0.85 (“excellent” per Cicchetti 1994). Average scoring variation was just ±3.2 points on a 100-point scale, driven primarily by one outlier interview (14-point range) whose record contained ambiguous signals about tool sophistication. Excluding that interview, mean SD drops to 2.7.
Categorical fields show strong agreement across all 50 runs:
| Measure | Statistic | Value | Interpretation |
|---|---|---|---|
| Firm type classification | Fleiss’ κ | 1.00 | Perfect agreement |
| Usage sophistication | Fleiss’ κ | 0.83 | Substantial agreement |
| Capability tier | Fleiss’ κ | 0.67 | Moderate: boundary cases between Tier 2 and 3 |
| AI-Native Score | ICC(3,1) | 0.92 | Excellent consistency |
The raw data for all 50 extractions is available in the open-source repository.
What this methodology can and cannot tell you.
I designed this framework to be rigorous within its scope. It has limitations, and I state them openly so readers can judge for themselves.
Everything you need to replicate, extend, or critique this research.
I publish the complete scoring apparatus so flaws can be found and improvements can be made. All artifacts are available in the GitHub repository.
The repository contains the full TypeScript extraction pipeline: scoring rubric, JSON schema, prompt architecture, Zod validation, and a CLI that processes transcripts individually or in batch. Point Claude Code at the repo with a folder of transcripts and it handles everything.
A simplified self-assessment. Answer four questions for a rough estimate of where you land.
This is a simplified self-assessment version of the scoring rubric described in Section 2. It applies the same five components and weights, but uses your own answers rather than evidence extracted from a structured interview. Treat it as a rough orientation, not a precise measurement. A full standalone version, with longer-form questions and LLM-assisted analysis for more accurate scoring, is in development.