Published
Methodology

Executive Summary

Workera's Skills Inferencing System introduces continuous, observation-based skill measurement as a complement to Workera's established point-in-time assessment platform.

Where traditional assessments capture what a person can do under test conditions, Skills Inferencing captures what a person actually does in the course of their daily work. The system operates by collecting evidence from workplace tools — Slack conversations, email communications, meeting transcripts, code contributions, and project management activity — classifying that evidence against a structured behavior taxonomy using an LLM-powered evidence-gathering agent, and then scoring each behavior using a fully deterministic, formula-based scoring engine. The architecture enforces a strict privacy boundary: raw evidence never leaves the individual's device, while only numeric scores are transmitted to the organizational dashboard. The current system supports the Observed signal type (from automated scans of workplace tools). Additional signal types — Manager-Rated and Peer-Rated — are planned but not yet implemented. When combined with the Assessed signal from the Workera assessment platform, the system produces skill profiles that grow more reliable over time. This methodology documents the theoretical foundations, technical architecture, scoring methodology, and operational guidance for deploying Skills Inferencing as part of a comprehensive workforce intelligence program.

1. Introduction

1.1 Why Observation-Based Skill Signals

Traditional skill measurement operates on a test-and-report cycle: an individual completes an assessment at a point in time, receives a score, and that score represents their capability until the next assessment. This model, while psychometrically rigorous, suffers from two structural limitations. First, assessments are episodic — they capture a snapshot but miss the continuous evolution of skills that occurs through daily practice. Second, assessments measure potential under controlled conditions, not application in authentic work contexts.

Observation-based skill signals address both limitations. By continuously monitoring how individuals apply skills in their actual work — the prompts they write, the AI tools they deploy, the strategic recommendations they make, the knowledge they share with colleagues — the system builds a living, evolving skill profile. This approach draws on Evidence-Centered Design (ECD), the psychometric framework developed by Mislevy, Steinberg, and Almond (2003), which holds that valid inferences about competence require a principled chain from observable evidence to latent constructs.

In the ECD framework, every inference must be traceable: from an observable work product (a Slack message explaining transformer architecture) through an evidence rule (this constitutes verbal evidence of the "Describe transformers" behavior) to a student model update (the behavior score moves from 0 to 1). Skills Inferencing implements this chain with formal rigor, ensuring that every score change can be audited back to its originating evidence.

1.2 Relationship to Workera's Assessment System

Skills Inferencing is not a replacement for Workera's AI-native assessment and scoring system; it is a complement. The assessment system provides high-validity, proctored measurement of knowledge and applied skill. Skills Inferencing provides continuous, ambient measurement of skill application in authentic contexts. Together, they form a two-lens model:

Table comparing Assessment System and Skills Inferencing across six dimensions. Measurement type: point-in-time proctored vs continuous ambient. What it measures: what you can do vs what you actually do. Evidence source: standardized items vs workplace artifacts. Cadence: quarterly or on-demand vs roughly every 4 hours during business hours. Scoring model: IRT adaptive vs deterministic rule-based. Composite weight: higher as highest-validity signal vs lower as corroborating signal.

When both signals agree — a person scores well on an assessment and demonstrates the skill in practice — confidence in the inference is high. When they diverge — strong assessment score but no observed application, or vice versa — the gap itself is informative and becomes the basis for targeted development.

1.3 System Model Overview

The Skills Inferencing System comprises two components: an on-device desktop app and a central admin server.

Architecture diagram showing two components. Desktop App — Ambient — includes Scanner, Classifier, Scoring Engine, and UI. Runs on the user's device, scans connected tools, classifies evidence with a large language model, uses a deterministic scorer, high-water marks, and a coach. Syncs scores only with no evidence sent to server. Admin Server provides Organizational Dashboard and Framework Source, covering team scores, behavior gap analysis, program and framework management, and invitations.
Two-component architecture: the Ambient desktop app (evidence collection, classification, and scoring — all on-device) and the admin server (scores, analytics, and per-organization framework configuration).

The Desktop App (shipped as Ambient by Workera) runs on the user's machine. On a recurring schedule — by default a few times a day during business hours (Monday–Friday, 9am–6pm) — it scans the user's connected workplace tools on-device. Connections are made through secure, user-authorized connectors, covering tools such as Slack, Gmail, Google Calendar, Google Docs, Google Drive, GitHub, Linear, Asana, Notion, Granola, Zoom, Gong, and the user's own AI-assistant usage. The set of available connectors is extensible rather than fixed.

For each scan the app gathers recent artifacts, then uses a large language model to classify each artifact against the user's behavior taxonomy and assign an evidence type. Classification is the onlyplace AI judgment is used; the model never assigns scores. After classification, the app merges new evidence into the local profile, runs the deterministic scoring formula across the user's assigned behaviors, applies high-water-mark protections, and syncs the resulting scores (without evidence) to the admin server. It also provides the user-facing interface: profile view, a coach tab with personalized growth recommendations, and an activity feed.

The Admin Server is the backend that stores team-level data: people, the organization's program/behavior framework, daily scores, and capability scores. The desktop app fetches each person's assigned behaviors from this server, so the framework is defined centrally and per organization (see Section 5). The server never receives or stores evidence. The admin dashboard provides organizational analytics, behavior gap analysis, and framework management.

2. Multi-Signal Architecture

2.1 Signal Types and Weighting

The Skills Inferencing System is designed to ingest four distinct signal types, each offering a different lens on the same underlying skill constructs. This multi-signal approach is grounded in the psychometric principle that convergent evidence from independent sources produces more reliable inferences than any single measurement method.

Table comparing four signal types across source, characteristics, strengths, and limitations. Observed: automated scan of workplace tools; continuous, ambient, artifact-based; strengths are authentic context, no test anxiety, captures applied skill; limitations are bias toward visible work and limited tool coverage. Manager-Rated (Planned): manager review via admin dashboard; expert judgment, periodic; strengths are contextual knowledge and performance context; limitations are halo effect, availability bias, and rater calibration. Assessed: Workera assessment platform; standardized, proctored, IRT-scored; strengths are high validity, calibrated difficulty, and standardized; limitations are point-in-time measurement and test conditions differing from work conditions. Peer-Rated (Planned): peer feedback and third-party evidence; social, corroborative; strengths are independent confirmation and capturing collaborative skills; limitations are reciprocity bias and social dynamics.

When multiple signal types are available for a given skill area, they are combined into a composite score. Currently, only the Observed signal is implemented; Assessed, Manager-Rated and Peer-Rated are planned. Each signal contributes according to its relative validity and reliability — Assessed highest (standardized and proctored), then Observed (authentic application), then the Manager- and Peer-Rated signals (contextual and corroborating). The exact weights are calibrated over time and normalized across whichever signals are present; in early deployment, observation may be the only signal, in which case the composite is simply the Observed score. The system becomes progressively more reliable as additional signals are added.

3. Evidence-Centered Observation

3.1 Evidence Sources

The system collects evidence from a broad and growing set of workplace tools, each connected through a secure connector that the user authorizes. (Connectors are added over time; the table below lists the most common categories rather than a fixed set.) The more integrations a user has connected, the broader the evidence coverage. Evidence is gathered for whichever behaviors the organization's framework defines — not only AI skills (see Section 5).

Table comparing four signal types across source, characteristics, strengths, and limitations. Observed: automated scan of workplace tools; continuous, ambient, artifact-based; strengths are authentic context, no test anxiety, captures applied skill; limitations are bias toward visible work and limited tool coverage. Manager-Rated (Planned): manager review via admin dashboard; expert judgment, periodic; strengths are contextual knowledge and performance context; limitations are halo effect, availability bias, and rater calibration. Assessed: Workera assessment platform; standardized, proctored, IRT-scored; strengths are high validity, calibrated difficulty, and standardized; limitations are point-in-time measurement and test conditions differing from work conditions. Peer-Rated (Planned): peer feedback and third-party evidence; social, corroborative; strengths are independent confirmation and capturing collaborative skills; limitations are reciprocity bias and social dynamics.

The specific connected tools within each category expand over time. Coverage varies by which connectors each user has authorized.

3.2 Evidence Collection Methodology

Evidence collection runs on-device, inside the desktop app. On a recurring schedule — by default a few times a day during business hours (Monday–Friday, 9am–6pm; the interval is configurable) — the app sweeps all connected tools, fetches recent artifacts, and passes them to a large language model for classification.

The collection process follows a strict methodology:

  1. Temporal window. Each scan searches for activity since the previous scan (on first run, a short backfill window). This prevents re-processing of evidence and ensures clean deduplication across scans.
  2. Evidence attribution. Each evidence item records the source, exact timestamp, channel or context, raw snippet (when available), and a second-person description framed as a coach addressing the user ("You explained the difference between GPT-4 and Claude for JSON output in #epd-triad").
  3. Behavior mapping. Each evidence item is mapped to the single most relevant behavior ID from the user's assigned framework (configured per organization — see Section 5). The mapping must pass a relevance test: the evidence must directly demonstrate the specific behavior, not merely be tangentially related.
  4. Classification. Each evidence item is tagged with one of five evidence types ([D], [V], [T], [I], [N]) according to the classification taxonomy defined in Section 3.3.
  5. Storage. All new evidence is merged into the local profile on the user's device. The classification step never assigns scores — scoring is a separate, deterministic step (Section 4).

Separation of concerns.The scan-and-classify step is deliberately constrained to evidence collection and classification only. Scoring is performed exclusively by the deterministic engine, also in the desktop app (Section 4). This separation ensures that AI judgment is bounded to classification — a task where LLMs excel — while scoring follows a transparent, auditable formula.

3.3 Evidence Classification Taxonomy

Every evidence item is assigned exactly one of five types, which carry different evidential weight — reflecting the strength of the inference each supports. Each carries a different weight and requires different judgment criteria:

Table defining six evidence categories with code, weight, definition, and key test. Demonstrated, code D, weight 1.0: the person did the thing — a tangible artifact, deployed system, completed deliverable, or observable action exists; key test is whether you can point to the output. Verbal, code V, weight 0.5: the person discussed, explained, or opined about the concept, showing understanding but no tangible output; key test is whether they talked about it or did it. Third-party, code T, weight 0.75: someone else confirmed, credited, or praised the person's skill via review feedback, public attribution, or recommendations; key test is who the source is and how specific they are. Inferred, code I, weight 0.25: circumstantial evidence that suggests capability but does not prove it, such as registration, tool installation, or adjacent activity; key test is whether we are seeing the skill or something near it. Negative, code N, no weight assigned: evidence that contradicts the skill claim including errors, admissions of inability, and rejection of best practices; key test is whether the artifact shows the skill is absent or weak. Not evidence, no code, weight 0: the artifact contains no skill signal — passive receipt, attendance, forwarding, pleasantries, or administrative messages; key test is whether the artifact would exist regardless of the person's skill level.

The single most important distinction is doing vs. talking: producing an artifact ([D]) is materially stronger than describing the work ([V]). A conversation, however knowledgeable, is Verbal — not Demonstrated.

3.4 The Demonstrated vs. Verbal Distinction

This is the single most important distinction in the system. Getting it wrong inflates profiles and undermines validity.

Demonstrated [D] means the person produced something. There is an artifact — code committed, a document written, a system deployed, a presentation delivered, a process established. The evidence exists independently of the person's claim.

Verbal [V] means the person talked about something. They explained a concept in a meeting, discussed an approach in Slack, or described their understanding in an email. The evidence depends entirely on their words.

The fundamental testIf you removed the person's self-description, would the evidence still exist? If yes, it's Demonstrated. If no, it's Verbal.

Example evidence card labeled Demonstrated and Strong from a Slack message. The message reads: just pushed the new RAG pipeline to prod, indexing the entire knowledge base using recursive chunking with 512-token windows and 64-token overlap, retrieval accuracy improved from 62% to 89% on an internal benchmark, with a link to the pull request. Evidence type is D Demonstrated, skill is B-AI.2.2b Implement RAG, strength is Strong, confidence is High. Classifier note: the PR link and production deployment are verifiable artifacts, technical details including chunking parameters and benchmark improvement confirm deep implementation work, classified as unambiguously Demonstrated.
Example evidence card labeled Verbal and Weak from a Slack message. The message reads: just came back from the AI Engineering Summit, the talks on RAG were eye-opening, suggests looking into hybrid search combining dense embeddings with BM25, several speakers mentioned it as a best practice, planning to write up a future post on the approach. Evidence type is V Verbal, skill is B-AI.2.2b Implement RAG, strength is Weak, confidence is Low. Classifier note: the user attended a conference and is relaying information they heard, they have not built hybrid search, going to write up is future intent not current output, classified as Verbal not Demonstrated and weak because the knowledge is secondhand from speakers.

The "NOT Evidence" Filter

The following are explicitly excluded from evidence collection. They appear AI-related but do not demonstrate any skill:

  • Receiving an email, invitation, or message about AI (passive receipt, not active demonstration)
  • Accepting or declining a meeting or calendar event (scheduling is not a skill)
  • Being mentioned in a conversation without contributing (passive presence)
  • Forwarding an article or link without commentary (no original thought demonstrated)
  • Attending a meeting where AI was discussed but the person did not contribute
  • Having a role that involves AI (this may be [I] at best, never [D] or [V])
  • Using AI tools for routine tasks that anyone would do (e.g., basic spell-check, auto-complete)

The fundamental test: "Did this person actively DO or SAY something that required this specific AI skill?" If the answer is "no, it just happened near them," it is not evidence.

The Relevance Test

Before mapping evidence to a behavior, the classification agent must verify direct relevance. The question is: "Does this evidence specifically show THIS skill, or is it just vaguely related to AI?"

Comparison table of bad versus good skill mappings with three examples. First: bad mapping rejects accepting an invite to an AI roundtable mapped to Stay Informed; good mapping accepts writing a Slack post explaining how GPT-4 differs from Claude for JSON output mapped to Compare Frontier Models. Second: bad mapping rejects sending an email about AI partnership mapped to Identify AI Opportunities; good mapping accepts building a scheduled Claude task that automates daily meeting prep mapped to Automate with AI. Third: bad mapping rejects being in a meeting where RAG was discussed mapped to Implement RAG; good mapping accepts teaching transformer architecture to 200 students in CS230 mapped to Describe Transformers.

Second-Person Framing

All evidence descriptions are written in second person, as a coach addressing the individual: "You explained...", "You built...", "You teach...". This serves two purposes: it makes the evidence immediately meaningful when the user reads their profile, and it forces the classification agent to specify exactly what the person did (who, where, what, when) rather than writing abstract summaries.

4. Deterministic Scoring Architecture

4.1 Design Principle: Formula-Based, Not AI Judgment

A deliberate architectural decision separates the system into two domains: AI-powered classification and formula-based scoring. The classification step uses LLM capabilities where they excel — reading natural language, understanding context, and classifying evidence into structured categories. But the scoring itself is entirely deterministic: a fixed set of rules weighs the evidence by type and source and produces a score.

This separation provides three guarantees:

  1. Reproducibility. Given the same evidence, the scoring engine will always produce the same score. There is no stochastic element, no temperature setting, no prompt variation in scoring.
  2. Auditability. Every score can be explained by citing the specific rule that was applied and the evidence counts that triggered it. The system stores the rule number and explanation alongside every score.
  3. Predictability. Users and administrators can understand, in advance, what evidence is needed to reach a given score. This makes the system a developmental tool, not a black box.

4.2 Bloom's Taxonomy Alignment

Behavior scores use a four-level scale aligned to Bloom's Revised Taxonomy (Anderson & Krathwohl, 2001). Each level represents a qualitatively different depth of skill, not merely a quantitative increase:

Table defining a four-level scoring scale from 0 to 3 with Bloom's Taxonomy alignment and descriptions. Score 0, No Evidence, no Bloom's alignment: no evidence of the behavior has been observed, the individual may possess the skill but has not demonstrated it in observable contexts. Score 1, Knowledge, aligned to Remember and Understand: evidence suggests the individual understands the concept and can discuss, explain, or identify the skill area, but has not been observed applying it to produce outcomes. Score 2, Application, aligned to Apply and Analyze: the individual has been observed applying the skill in authentic work contexts, demonstrated evidence exists showing real usage, often corroborated by verbal or third-party evidence. Score 3, Mastery, aligned to Evaluate and Create: the individual demonstrates deep, repeated application of the skill across multiple contexts, evidence comes from multiple sources and often includes independent confirmation from colleagues.

Bloom's verb alignment.Each behavior in the framework is associated with a specific Bloom's verb (Explain, Compare, Design, Implement, Evaluate, etc.) that anchors the expected cognitive level.

4.3 How a Behavior Is Scored

The scoring engine evaluates each behavior independently. It looks only at the behavior's current (non-stale) evidence — how much there is, what types it spans, and how many independent sources it comes from — and applies a fixed set of rules to land on a 0–3 score. The same evidence always yields the same score; there is no model judgment at this stage.

The logic follows a consistent ladder:

  • Negative evidence comes first. Clear counter-evidence pulls the score down — it is the only thing that can.
  • Mastery (3) requires demonstrated work that is both repeated and independently corroborated — multiple artifacts across more than one source, ideally confirmed by someone else.
  • Application (2) requires real demonstrated work, strengthened by a second angle — either independent confirmation or a separate supporting source.
  • Knowledge (1) follows from a single piece of credible evidence — one artifact, a third-party mention, a substantive discussion, or (at most) circumstantial/inferred evidence.
  • No evidence (0) is the default when nothing relevant has been observed.

Two principles shape the ladder: doing outranks talking (demonstrated work counts for more than discussion, which counts for more than circumstantial signals), and convergence matters (the same skill seen in more places, and confirmed by others, scores higher than a single isolated signal). The exact thresholds are tuned over time and are intentionally not published here.

Worked Example

Suppose, after a few weeks of scanning, a behavior has accumulated three pieces of evidence from three different sources: a demonstrated artifact (a system the person built), a verbal contribution (explaining the approach in a meeting), and a third-party mention (a colleague crediting their work). That combination — real demonstrated work, plus independent confirmation, across multiple sources — lands the behavior at 2 (Application). Adding a second demonstrated artifact in a new context would push it toward 3 (Mastery).

4.4 Score Protection: High-Water Marks

Skills, once demonstrated, do not typically disappear. To reflect this, the system keeps a high-water mark for each behavior — the highest score it has reached — and holds the score there even when recent evidence is thinner. A score only drops in response to explicit negative evidence, so users are never penalized for a quiet week.

4.5 Evidence Staleness

Not all evidence ages at the same rate. Each behavior is tagged perishable (fast-moving skills) or durable (slower-changing skills): evidence ages out of a shorter window for perishable behaviors and a longer one for durable behaviors. Once it ages out it no longer counts toward the current score — though the high-water mark keeps the score from dropping for that reason alone. Standing facts such as job titles and ongoing roles never go stale.

5. Skill Framework

5.1 Capability → Skill Area → Behavior Hierarchy

The skill framework uses a three-level hierarchy that moves from broad organizational programs to specific, observable behaviors. Crucially, the framework is configured per organization — there is no single fixed catalog baked into the product. Each organization defines its own programs and behaviors in the admin dashboard, and the desktop app fetches each person's assigned framework from the server at runtime.

  1. Program (capability) is the top-level organizational requirement — for example "AI Native Work Habits" or "Leadership." A program carries a target score on a fixed scale and can be shared across all roles or assigned to specific role types. An organization may run several programs at once.
  2. Skill Area groups related behaviors into thematic clusters within a program (e.g., "Customer Centricity" or "Ownership and Accountability").
  3. Behavior is the atomic unit of measurement. Each behavior is a specific, observable action statement anchored by a Bloom's verb. Behaviors are scored individually on the 0–3 scale.

Behaviors are not limited to AI skills. Because the framework is org-defined, a program can measure general professional competencies — communication, management, customer focus, ownership — alongside (or instead of) AI-specific behaviors.

Example program showing one organization's live configuration with five skill areas. AI Native Work Habits with behavior codes B-AI, Agentic Engineering with behavior codes B-AGE, Customer Centricity with behavior codes B-CUS, One Team Habits with behavior codes B-ONE, and Ownership and Accountability with behavior codes B-OWN.
Example of one organization's configured framework. Skill areas and behaviors are defined per tenant in the admin dashboard; another organization's framework will look entirely different. Behavior IDs are auto-generated when behaviors are created.

5.2 Starter Frameworks & Signature Packs

To help organizations get started, the product ships signature starter packs — pre-authored blueprints (e.g., AI Fundamentals, Agentic Engineering, Staying Current, Leadership, Management, Communication) that an admin can adopt as-is or use as a starting point. A starter pack is just a seed: once adopted, its behaviors live in the organization's framework and can be edited freely. Each behavior — whether from a starter pack or authored from scratch — is designed to be:

  • Observable — demonstrable through workplace artifacts, conversations, or actions that can be captured by connected tools
  • Distinct — non-overlapping with other behaviors in the framework
  • Actionable — something the individual can improve through deliberate practice
  • Universal — relevant across roles, from individual contributors to executives

There is no fixed default catalog — your organization's skill areas and behaviors are whatever its admins configure.

5.3 Framework Customization

Customizing the framework is the normal way the system is used, not an advanced option — every organization configures its own. Administrators work through the admin dashboard to match their organization's skill requirements. Options include:

  • Adding behaviors: New behaviors can be added to any skill area. Each new behavior requires a description, skill-area assignment, and a Bloom's verb anchor (the ID is generated automatically).
  • Removing behaviors: Behaviors that are not relevant to the organization's context can be removed. This affects all users and recalculates scores.
  • Adjusting targets: The program target score can be adjusted to set organizational expectations.
  • Role-specific programs: Programs can be shared across all roles or assigned to specific role types, so different cohorts can be measured against different behavior sets.

When the framework changes, the desktop app detects the update (it checks for framework updates periodically) and notifies the user. New behaviors appear in their profile with a score of 0 and are immediately eligible for evidence collection.

5.4 Skill Listening

The AI landscape evolves rapidly. New tools, techniques, and competencies emerge frequently. Skill Listening is a system capability that monitors industry developments and suggests new behaviors for the framework.

Each skill suggestion includes:

  • Proposed behavior description with Bloom's verb alignment
  • Source — the industry signal that prompted the suggestion (e.g., an emerging standard or a widely-adopted new tool)
  • Rationale — why this behavior is becoming important (e.g., "a new capability is becoming a standard part of how enterprise teams work, and a large share of organizations are now evaluating adoption")

Suggestions enter a review queue in the admin dashboard with three possible statuses: Pending(awaiting admin review), Approved (added to the framework), or Dismissed (rejected by admin). This ensures that framework evolution is guided by industry trends but controlled by organizational judgment.

6. Privacy Architecture

6.1 Design Principle: Evidence Stays Local

The privacy architecture is governed by a single, inviolable principle: raw evidence never leaves the individual's device. The system is designed so that a user's messages, emails, meeting transcripts, code, and other workplace artifacts are processed entirely on their own machine. The organizational dashboard receives only numeric scores — never the underlying evidence.

This principle is not merely a policy choice; it is enforced by the system architecture:

  • The scan-and-classify step runs on the user's machine and writes evidence to local storage only.
  • The desktop app computes scores from the local evidence and stores the full evidence-bearing profile only locally.
  • The server sync function explicitly strips all evidence before transmission.
  • The server has no facility to receive or store evidence remotely — only scores.

6.2 Data Flow

Table showing data privacy architecture across seven data types, indicating what stays on device versus what is sent to server. Slack messages, email content, and meeting transcripts: stays on device processed in-memory, never sent to server. Evidence descriptions such as you explained: stored locally on device, never sent to server. Raw snippets and quotes: stored locally on device, never sent to server. Evidence type tags D, V, T, I, and N: stays on device, never sent to server. Behavior scores 0 through 3: stays on device and also sent to server synced to admin server. Overall percentage score: stays on device and sent to server. Capability score: stays on device and sent to server. Total signals count: stays on device and sent to server.

6.3 Score-Only Sync Model

When the desktop app syncs, it sends only scorable data — the user's behavior scores, plus roll-up figures (overall percentage and capability score). It explicitly excludes all evidence: no descriptions, no raw snippets, no message or document content. What leaves the device is a compact set of numbers, not the underlying artifacts.

This means that even if the admin server were compromised, an attacker would gain access only to numeric scores — not to the individual's messages, emails, or meeting transcripts.

6.4 No Central Evidence Store

Evidence is stored only on the user's own device. The server has no facility to receive or persist evidence — its intake path accepts scores and rejects anything else. There is no central repository of the messages, documents, or transcripts behind a score.

The practical consequence is that a manager viewing their team's scores in the admin dashboard sees: "Taylor: AI Fundamentals 2.3/3.0, Prompt Engineering 1.8/3.0" — but has no access to the specific messages, emails, or artifacts that generated those scores. Only the individual sees their own evidence through the desktop app.

7. Recommended Journey

The following timeline represents the recommended deployment sequence for organizations adopting the Skills Inferencing System. Each phase builds on the previous one, gradually adding signal types.

7.1 Day 0: Align

Activity: Leadership alignment and program onboarding.
Who: L&D leaders, managers, executive sponsors.
What happens: Admin configures the framework in the dashboard — adopting and tailoring starter packs or authoring behaviors from scratch — and makes any customizations needed for the organization's context. Invitations are sent to team members. Organizational goals (e.g., "80% of the engineering team at Application level in Agentic Engineering by Q3") are set as targets.

7.2 Week 1+: Observe

Activity: Automatic evidence scanning begins.
Who: System (automated), with the user's connected tools.
What happens: The desktop app is installed and begins scanning automatically on its recurring schedule. Over the first week, it accumulates evidence from connected sources. Scores begin appearing and evolving. The Activity tab in the desktop app shows scan results, new evidence discovered, and score changes. Users start seeing a living profile that reflects their actual work.

7.3 Ongoing: Coach

Activity: Personalized growth recommendations and practice.
Who: Individual participants.
What happens: The Coach tab surfaces the user's top growth areas (behaviors below mastery), sorted by impact. Each growth area includes a coaching insight explaining why the behavior matters and a "Practice in Claude" button that copies a roleplay prompt to the clipboard. Users can practice any skill by pasting the prompt into Claude, which acts as a roleplay partner (e.g., a CTO asking the user to compare frontier models, or a product designer asking the user to design an agentic workflow).

7.4 Week 2–4: Baseline Assessment

Activity: Workera assessment for baseline measurement.
Who: All participants (or targeted cohorts).
What happens: Participants complete the Workera assessment, producing a standardized, proctored measurement of their AI skills. This assessment signal combines with the observation signal to create a two-lens view. Gaps between assessment and observation are surfaced in the admin dashboard.

7.5 Planned: Manager Review (Not yet implemented)

Activity: Manager-rated signal collection.
Who: Managers, using the admin dashboard.
Status: This signal type is planned but not yet available in the current product.
When implemented: Managers will review their direct reports' skill profiles and provide ratings based on their contextual knowledge. The manager signal (20% weight) will add expert judgment to the composite. The admin dashboard will provide tools for managers to see team-level patterns and identify coaching opportunities.

7.6 Ongoing: Continuous Multi-Signal

Activity: All available signal types operating in concert.
What happens: As the program matures and additional signal types are implemented, composite scores incorporate more perspectives and become more reliable. Currently, the system supports the Observed signal. Assessed, Manager-Rated and Peer-Rated signals are planned for future releases. The goal is a reliable measurement of AI proficiency that can inform team composition, learning investment, and strategic planning.

8. Validity and Intended Use

8.1 Construct Validity

Construct validity — the degree to which the system measures what it claims to measure — is established through several design choices:

  • Bloom's alignment. Each behavior is anchored to a specific Bloom's verb, ensuring that the four-level scoring scale maps to a recognized cognitive taxonomy. A score of 1 (Knowledge) genuinely reflects Remember/Understand-level demonstration, while a score of 3 (Mastery) requires Evaluate/Create-level evidence.
  • ECD traceability. Every score can be traced through a complete Evidence-Centered Design chain: observable work product → evidence classification rule → scoring rule → behavior score. No inference is made without an auditable path.
  • Behavior specificity. Each behavior describes a specific, observable action (not a trait or attitude). This grounds measurement in behavioral indicators rather than abstract constructs.
  • Multi-method convergence. When multiple signal types (observed + assessed + rated) agree on a skill level, the convergent evidence strengthens the validity of the inference beyond what any single method could provide.

8.2 Intended Interpretations

What scores mean:

  • A score of 2 (Application) means: "We have strong, multi-source evidence that this person applies this skill in their actual work."
  • A score of 0 with low evidence coverage means: "We have not observed this skill, but our observation window may be limited."
  • An overall score of 67% means: "Across the framework's behaviors, this person's observed proficiency averages to 67% of the maximum possible."

What scores do NOT mean:

  • A score of 0 does NOT mean the person lacks the skill. It means the system has not observed evidence of it. Skills demonstrated in unmonitored contexts will not be captured.
  • Scores are NOT comparable across individuals with different integration coverage. A person with 6 connected sources will naturally have more evidence opportunities than a person with 2.
  • Scores are NOT employment decisions. They are developmental signals intended to inform growth, not to serve as sole inputs into promotion, compensation, or termination decisions.

8.3 Limitations

The system has known limitations that users and administrators should understand:

  1. Observation bias toward visible work. The system can only capture skills demonstrated through connected tools. A person who mentors others in hallway conversations, demonstrates AI skills in unrecorded meetings, or builds tools on personal machines will have these skills underrepresented. The system has an inherent bias toward digitally visible work.
  2. Source coverage gaps. Not all organizations use the same tools, and not every tool has a connector (e.g., Microsoft Teams is not yet integrated). A team that communicates primarily through un-integrated tools will have lower evidence coverage than a team using well-covered ones like Slack and Granola. Coverage also varies by individual based on which connectors they have authorized.
  3. LLM classification errors. The evidence classification step uses an LLM (Claude) to interpret natural language and assign evidence types. While the classification prompt is carefully structured with strict definitions, examples, and common pitfall warnings, LLMs can make errors. A conversation might be classified as [D] (Demonstrated) when it should be [V] (Verbal), or evidence might be mapped to the wrong behavior. The deterministic scoring engine mitigates this risk by requiring volume and convergence, but individual evidence items may be misclassified.
  4. Temporal sampling. Periodic scans capture a sample of activity, not a complete record. Short-lived messages, quickly-deleted content, or activity outside business hours may be missed.
  5. Cultural and linguistic factors. The system's ability to recognize evidence depends on the language and communication norms of the workplace. Teams that discuss AI informally or using non-standard terminology may have lower recognition rates.

8.4 Complementarity with Assessment

The most reliable skill inferences arise when observation and assessment signals are combined. The following matrix describes the four possible states and their interpretations:

Two-by-two matrix combining observed score and assessment score to produce four interpretations. High observed score with high assessment score: Decision-grade confidence — the person can do it per assessment and does do it per observation, strong signal for both capability and application. High observed score with low assessment score: Practical competence — the person applies the skill effectively in context but may struggle with formal assessment conditions, consider assessment anxiety, item format mismatch, or applied knowledge exceeding theoretical knowledge. Low observed score with high assessment score: Latent capability — the person has the knowledge per assessment but is not applying it in observable contexts, coaching opportunity as the skill exists but is not being deployed, may indicate tool coverage gaps. Low observed score with low assessment score: Development need — neither assessment nor observation indicates the skill, prioritize for learning and development.

9. Program Analytics and Governance

9.1 Admin Dashboard Capabilities

The admin dashboard provides organizational leaders with a view into team-level AI proficiency without exposing individual evidence. The dashboard is organized around four primary views:

  • Home: Organization-wide summary statistics — average AI proficiency score, number of active participants, scan coverage metrics, and trend charts showing progress over time.
  • Skill Requirements: Framework management — view and edit the behavior catalog, review skill listening suggestions, customize the framework for the organization.
  • People & Scores: Individual profile views showing behavior-level scores (but not evidence). Managers can view their direct reports and identify coaching priorities.
  • Signal Configuration: Program settings including scan frequency, integration requirements, and composite weight adjustments.

9.2 Adoption Metrics

The dashboard tracks several key adoption indicators:

Table defining four program health metrics with targets. Activation Rate: percentage of invited users who have completed onboarding and run at least one scan; target is 80% or more within 2 weeks of invitation. Scan Coverage: percentage of active users with at least one scan in the last 7 days; target is 90% or more for engaged users. Integration Depth: average number of connected tools per user; target is 3 or more sources for meaningful coverage. Evidence Density: average number of evidence items per behavior per user; target is enough corroborating items per behavior.

9.3 Behavior Gap Analysis

The admin dashboard surfaces organizational-level behavior gaps: behaviors where the team's average score is below the expected level. This analysis enables L&D leaders to prioritize training investments where they will have the greatest impact. For example, if the team averages 0.8/3.0 on "Build AI business cases" but 2.4/3.0 on "Use AI daily," this suggests that the team is technically competent but needs strategic skill development.

Gap analysis can be filtered by department, team, role, and geography, enabling targeted interventions rather than one-size-fits-all training.

9.4 Access Control Model

The system uses a role-based access control model with two tiers:

Table showing data visibility permissions for two roles. Admin can see all user scores, team analytics, framework management, skill suggestions, and invitation management; cannot see evidence descriptions, raw snippets, or individual messages or artifacts. Individual can see own scores, evidence, coaching recommendations, and activity feed; cannot see other users' scores or evidence or admin analytics.

Authentication uses API keys generated at registration. Each user receives a unique key stored in the desktop app's secure configuration. Admin status is set per-user in the database and grants access to the admin dashboard views.