Buddhist Geometry: Notes from a Weekend Study

nlp buddhist-studies embeddings research cortex provenance

This started with a computer freeze.

I was running a long embedding job against the Pāli Nikāyas — a personal project to build something useful for my own study of early Buddhism — when LM Studio hung halfway through a model swap. The first thing I did while the machine recovered was stop, open an empty doc, and write out what exactly I thought I was measuring. That forced-pause doc is the reason there are two finished papers on this page instead of one tangled notebook that dies in a drawer.

What I thought was one experiment turned out to be two. One measured whether a larger multilingual encoder (Qwen3-Embedding-8B) places Pāli source passages near their English translations more tightly than a smaller baseline (BGE-M3). The answer is an emphatic yes — the mean per-term cross-lingual cosine gap collapses from 0.274 to 0.038 under a domain-adapted instruction prefix. The other measured whether passages bound to canonical parallel-rhetoric frames — the 32-parts body enumeration at MN10, the four-elements template at MN28 and MN62 — cluster more tightly in pooled embedding space than passages in free narrative contexts, regardless of what they are actually about. The answer is also yes, with bootstrap 95% confidence intervals excluding zero across all three encoder configurations.

Those are the two papers on this page. They share a corpus (34 Pāli terms, 101 non-null passages across three translator lineages), a pipeline (three stages in TypeScript, seeded bootstraps, seeded permutations), and a reproducibility appendix. They defend distinct empirical claims. Read either independently; the companion links inside each paper point to the other.


A word on the honest limits, because the papers themselves are careful about them and I want the framing to match.

Neither result resolves the interesting question underneath. Paper A measures cross-lingual geometry. It does not measure whether the model understands Pāli doctrinal vocabulary. Pooled-embedding cosine cannot distinguish between a model that has learned anattā and non-self denote the same concept, and a model that has compressed everything toward an English-centroid manifold so that Pāli tokens end up near their most-frequent English co-occurrents. Both accounts predict the same measurement. That is a general limit of pooled-embedding geometric analysis, not a bug in my pipeline, and I say so directly in the paper’s §5.1.

Paper B measures a cross-embedder-robust inflation on formulaic corpora. It does not cleanly separate the parallel-syntax effect from a lexical-overlap confound that is intrinsic to the Pāli Canon — the MN10 32-parts enumeration recurs near-verbatim in the MN28 earth-element definition, which inflates the kesā × pathavī cosine to 0.83 on BGE-M3 and 0.89 on Qwen3. The direction of the effect survives; the magnitude almost certainly includes a contribution from shared tokens that I cannot quantify with this data. The paper proposes a disambiguation experiment in §9 and names the confound in the abstract.

Both papers publish the negative-result versions of their secondary findings. In Paper B, the Qwen3-no-prefix permutation p regressed from 0.021 at n=89 to 0.066 at n=101 — the only direction change that crossed the α=0.05 threshold the wrong way. That number is reported in the abstract and defended symmetrically in §5.5, because a paper that reports only the convenient direction change is not reporting.


On why the output is here and not queued for a venue.

Standard peer review for a computational-linguistics paper runs six to eighteen months. The two findings are real now. The reproducibility artifacts are real now. The PRNG seed, the hardware, the wall-clock budget, the pre-flight memory, the model-load times, the CSV rows — all of it is on disk and documented in the reproducibility appendix of each paper.

The Cortex convention I am using — every non-trivial claim carries an inline [prov: ...] tag pointing to a specific CSV row, paper citation, or line of the interim research log — is designed exactly for this case. Rigor via provenance, not via gatekeeping. A reader who doubts my 33/25/32 significance counts can open the three significance-summary.md files and check. A reader who doubts the 0.274 → 0.038 mean-gap collapse can open the three per-term-stats.csv files and re-derive the mean with a one-line sum. The provenance tags make that verification fast rather than forensic.

This is not a substitute for peer review. It is a complement to it. The papers are submission-shaped — ACL citation style, structured abstract, named limitations section, reproducibility appendix, named future work. If they find a venue, great. If they do not, the work is still on the record, still checkable, and still useful to the next person asking whether off-the-shelf multilingual embedders are doing what they look like they are doing on a low-resource classical-religious corpus.


Two practical notes for the reader.

One. The provenance tags are dense. Every [prov: results/*/per-term-stats.csv] is a real file in a real directory, but I am not hosting the result CSVs alongside these papers — they sit in the private project repo. If you want to reproduce, the pipeline is specified down to the environment variables; clone the corpus, run the three-stage invocation in §8 of Paper A or §10 of Paper B, and you should recover bit-identical numerics under the same PRNG seed.

Two. The Buddhist corpus is a case study for Paper B, not the main interest. The computational-linguistics claim is about encoder behavior on formulaic text — legal briefs, liturgical canons, scientific abstracts, parliamentary procedure all have the same structural property. Paper B’s §6 lays out the generalization hypothesis and flags it as a hypothesis, not a proof. Cross-domain replication is the primary follow-up.

For Paper A, the Buddhist-studies angle is the point. Pāli is a low-resource classical language whose canonical retrieval behavior has real downstream consequences for digital-humanities researchers, and the paper’s practical recommendation — Qwen3-Embedding-8B with the custom instruction prefix is the current strong baseline for Pāli↔English retrieval on doctrinal corpora — should be directly useful to anyone building search over SuttaCentral or Access to Insight.


The papers:

Both were written after the editorial polish pass on 2026-04-22 and are, as of this page’s date, the versions I consider submission-ready.