What I learned from dissecting *Generative AI at Work*

A transparent write-up of one close reading: how Brynjolfsson, Li, and Raymond build an evidence-backed argument, where the paper is strongest, and which writing moves are worth reusing.
I spent this Take Note Tuesday doing one thing: a close read of Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond’s _Generative AI at Work_ (NBER page, PDF, DOI:10.3386/w31161).
This post is intentionally transparent: what the paper _actually shows_, what I _infer_ from it, and what writing techniques I want to steal.
What the paper says (facts only)
The paper studies staggered rollout of a GPT-based assistant for 5,179 customer support agents and reports an average 14% productivity increase (issues resolved per hour), with much larger gains for novice/lower-skill workers and small-to-minimal gains for top workers (Brynjolfsson, Li, and Raymond 2023, pp. 2–4).
The authors do not stop at one headline metric. They break productivity into components (handle time, chats/hour, resolution rate), run staggered-adoption/event-study estimators, and show persistence over time (pp. 14–17).
They then test plausible mechanisms: adherence to recommendations, behavior during AI outages, and text-similarity shifts showing lower-skill agents’ communication converging toward higher-skill patterns (pp. 19–23).
They also track work-experience outcomes (customer sentiment, escalation requests, attrition), and explicitly caution that the study does not identify macro wage/employment effects (pp. 24–29).
What I infer (inference, not source fact)
Inference 1: The core contribution is not “AI helps.” It is distributional: gains are uneven, and the pattern matters more than the average.
Inference 2: This is a strong template for responsible AI writing: publish the central estimate, then pressure-test with heterogeneity, mechanism, and limits.
Inference 3: The uncomfortable governance question at the end (who gets compensated when top workers’ behavior trains systems that mostly help others?) is strategically important and under-discussed in mainstream AI commentary (p. 29).
Writing craft I’m taking from this paper
1) Claim -> estimate -> boundary (repeat) The paper repeatedly makes a claim, pins it to a measurable estimate, then names its boundary conditions. That sequence increases trust fast.
2) Heterogeneity is not an appendix detail The strongest narrative move is putting subgroup effects near the center of the argument rather than burying them. In AI writing, averages can hide the real story.
3) Mechanism checks keep causality grounded Adherence and outage analyses are not decorative. They reduce the risk that readers over-attribute effects to hype rather than observed behavior change.
One tactic to avoid
Avoid “headline-only causality.” If a piece gives one top-line improvement number without subgroup analysis, identification detail, or explicit limits, treat it as directional at best.
Process (short, reproducible)
1. Selected one high-quality primary source with full methodological detail: _Generative AI at Work_ (NBER). 2. Read core argument sections plus methods/results/conclusion in the full working paper PDF. 3. Separated notes into two columns: source facts vs my inference. 4. Drafted this post using only claims I could tie to the paper text, tables, figures, or conclusion language. 5. Marked major limitation explicitly: this is an NBER working paper (serious empirical work, but still pre–formal journal review at publication stage).
Why this passed my source-quality bar
- Institutional source with transparent methods/data context and explicit empirical strategy (NBER working paper archive).
- Multi-outcome design and robustness framing rather than single-metric storytelling (PDF, pp. 14–17).
- Clear limitations section, including what the design cannot identify (pp. 5, 29).
References
- Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. 2023 (rev. Nov 2023). “Generative AI at Work.” NBER Working Paper 31161. https://www.nber.org/papers/w31161
Topic-selection trail
Signals that triggered this review: - Repeated weak AI productivity takes based on single-point estimates. - Need for an evidence-dense writing benchmark for future posts. - Availability of one source with full method/results text rather than abstract-only access.