Cross-Lingual Geometric Convergence in Multilingual Embedders: A Pāli/English Buddhist Doctrinal Benchmark

nlp buddhist-studies embeddings research retrieval multilingual rsa pali
Companion paper and provenance convention

This paper is one of two companion findings from the same study. The companion piece — Syntactic Parallelism as Semantic Feature: Cross-Embedder Inflation in List-Formula Text — reports the template-vs-narrative inflation effect on the same corpus, and the study notes give the full backstory. Read either independently; they share a corpus, a pipeline, and a reproducibility appendix, but defend distinct empirical claims.

Inline [prov: ...] brackets are the paper’s provenance-tag convention, preserved verbatim from the Cortex-style source. Each non-trivial claim is traced to a specific CSV row, a specific paper citation, or a specific line of the interim research log. The tags read a little dense on the page; that density is the point. Nothing is asserted without a trace back to its source.

Cross-Lingual Geometric Convergence in Multilingual Embedders: A Pāli/English Buddhist Doctrinal Benchmark

Abstract

We measure how well two widely-used open-weight multilingual embedders position Pāli source passages relative to their English translations on a canonical Buddhist doctrinal benchmark of 34 terms and 101 passages drawn from three translator lineages (the SuttaCentral Mahāsaṅgīti Pāli edition, Bhikkhu Sujato’s Pali–English Translation corpus, and Ṭhānissaro Bhikkhu’s Access-to-Insight corpus; 33 terms have complete three-translator coverage — citta lacks a Ṭhānissaro passage and is excluded). For each term we compute the per-term gap between within-English translator agreement and Pāli↔English agreement in cosine-similarity space, bootstrap 95% confidence intervals with a seeded xorshift128 PRNG over 10,000 iterations (approximately 4,300 effective iterations per term after skip-filtering; see §4.3), and compare three embedder configurations: BGE-M3 (560M parameters, 1024-dim), Qwen3-Embedding-8B (8B parameters, 4096-dim) with a custom domain-adapted instruction prefix following the Qwen3 instruction-prefix schema (literal string: "Instruct: Retrieve semantically similar Buddhist doctrinal passages\nQuery: "), and the same Qwen3 model without any prefix. The mean Pāli↔English cosine gap collapses from 0.274 under BGE-M3 to 0.038 under Qwen3-with-prefix and 0.126 under Qwen3-no-prefix, a ~86% and ~54% reduction respectively [prov: computed from results/-n101/per-term-stats.csv, n=33 terms with complete translator coverage; citta excluded due to missing Thanissaro passage]. Per-term bootstrap significance counts are 33/33 for BGE-M3, 25/33 for Qwen3-with-prefix, and 32/33 for Qwen3-no-prefix [prov: results/-n101/significance-summary.md TL;DR]. We read this as geometric convergence under distributional co-occurrence pressure — the larger model learns to place Pāli near its English neighbors — and we explicitly decline to read it as evidence of cross-lingual semantic equivalence. Pooled-embedding cosine cannot distinguish representation of Pāli concepts from compression toward English translator centroids; both accounts predict the same geometry. The instruction prefix adds a roughly additive cosine-gap offset at the embedder level (mean within-subcluster shift ≈0.08 ± 0.04 across the four doctrinal subclusters) with non-trivial per-term heterogeneity (per-term prefix effect range −0.039 to +0.302 cosine units), which has direct implications for retrieval-system design on low-resource classical corpora.

1. Introduction

Buddhist canonical literature is a textbook case of the problems multilingual dense retrieval is supposed to solve. The primary corpus of interest — the Pāli Nikāyas — exists in a single well-edited source edition (Mahāsaṅgīti) plus a small number of divergent English translator lineages, with the Chinese Āgamas, Tibetan bKa’ ‘gyur, and Sanskrit parallels sitting adjacent in the same canonical universe [prov: Bingenheimer 2020, Digitization of Buddhism, Oxford Bibliographies, overview of multi-tradition Buddhist digital corpora]. A researcher searching for passages on anattā or paṭiccasamuppāda wants a retrieval system that treats Pāli anattā and English non-self as evidence of the same question, not two separate questions.

Whether off-the-shelf multilingual dense embedders deliver this depends on whether they place Pāli source text geometrically near its English translations. Prior work on multilingual embedding models has focused overwhelmingly on high-resource Indo-European and CJK languages, where parallel training data is abundant [prov: Feng et al. 2022, Language-agnostic BERT Sentence Embedding (LaBSE), ACL, 109 languages reported]. Pāli, despite a two-millennia documentary tradition, is a low-resource language in the sense that matters for embedders: its surface-form co-occurrence in ordinary web crawl data is dominated by liturgical and devotional contexts rather than the doctrinal-technical contexts that canonical retrieval requires.

This paper asks one measurable question: how does the per-term cross-lingual geometric gap behave across embedder configurations on a canonical Buddhist doctrinal corpus? We take no position on what the model “knows” about Buddhism. We take no position on whether anattā and non-self are the same concept. We measure pooled-embedding cosine similarity. The point of the paper is that this is all we measure, and that what the measurement shows — dramatic gap closure from BGE-M3 to Qwen3-Embedding-8B — is both (a) real and robust under bootstrap and (b) under-determined with respect to the interesting semantic questions.

We adapt representational similarity analysis (RSA), introduced by Kriegeskorte, Mur, and Bandettini (Kriegeskorte et al. 2008), from systems neuroscience to cross-lingual embedding evaluation. RSA compares dissimilarity structure across representations rather than attempting to align representational spaces directly. Cross-lingual RSA applications exist in acoustic word embedding (Abdullah et al. 2021, arXiv:2109.10179) but, to our knowledge, this paper is the first application of RSA-style per-term geometry analysis to Buddhist canonical text across script boundaries — specifically, per-term cluster tightness and cross-term separation computed on parallel Pāli/English translator passages. The cross-lingual adaptation is itself the methodological contribution: a small, auditable, reproducible benchmark aimed squarely at the question a digital-humanities researcher actually wants to answer: will this embedder treat my Pāli passage and its English translation as the same neighbourhood?

The remainder of this paper is organized as follows. §2 situates the work against prior multilingual retrieval and Buddhist digital humanities literature. §3 describes the 34-term corpus, three translator lineages, and three embedder configurations. §4 reports per-term Pāli↔English gaps with 95% CIs, mean-gap collapse quantification, significance counts, and cross-scale (n=24/89/101) replication. §5 discusses the representation-vs-compression underdetermination, the instruction-prefix effect as uniform cosine translation, and system-design implications. §6 states limitations explicitly. §7 concludes. §8 gives the reproducibility appendix; §9 is the researcher-handoff note.

2.1 Multilingual Dense Retrieval

Modern multilingual text embedders descend from a lineage that begins with cross-lingual word embedding alignment (pre-BERT) and passes through sentence-BERT-style dual-encoder training. LaBSE [prov: Feng et al. 2022, ACL long paper, combines MLM + TLM + dual-encoder translation-ranking, 109 languages, 83.7% Tatoeba retrieval across 112 languages] established that dense bitext mining can produce genuinely language-agnostic embedding spaces for the languages in the training distribution.

Dense Passage Retrieval [prov: Karpukhin et al. 2020, Dense Passage Retrieval for Open-Domain Question Answering, EMNLP, dual-encoder BERT, +9–19% top-20 over BM25] demonstrated that supervised dual-encoder training on QA data outperforms sparse retrieval for open-domain QA. mContriever [prov: Izacard et al. 2022, Unsupervised Dense Information Retrieval with Contrastive Learning, TMLR, arXiv:2112.09118, mBERT-initialised contrastive pretraining on 29 languages, cross-lingual retrieval across scripts demonstrated] then showed that contrastive pretraining without labelled QA data can still yield strong cross-lingual retrieval, including cross-script retrieval (e.g. English queries against Arabic documents).

BGE-M3 [prov: Chen et al. 2024, BGE M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, Findings of ACL 2024, arXiv:2402.03216, 560M parameters, 1024-dim, 100+ languages, 8192-token context] is the current open-weight workhorse for multilingual retrieval. Its self-knowledge-distillation approach unifies dense, multi-vector, and sparse retrieval through a single training signal. Qwen3-Embedding-8B [prov: Zhang et al. 2025, Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, Technical Report, arXiv:2506.05176, 8B parameter dense embedding built on Qwen3 foundation, instruction-prefix-conditioned, [EOS]-token pooled 4096-dim output] is considerably larger and trained from the Qwen3 foundation-model backbone. The Qwen3 embedder supports user-defined instruction prefixes in the schema Instruct: {task description}\nQuery: {input} concatenated before the [EOS] pooling token [prov: Zhang et al. 2025, arXiv:2506.05176 §3.1, instruction-prefix protocol; the technical report’s illustrative example uses “Given a web search query, retrieve relevant passages that answer the query” as the task description, but the schema is user-customisable].

Pāli is inside the training language set for BGE-M3 and for Qwen3-Embedding-8B’s multilingual training mixture in the nominal sense that the model accepts Pāli input without tokenization failure, but the effective training volume for Pāli technical doctrinal contexts is not documented in either technical report [prov: Chen et al. 2024 §3.1 training data description does not enumerate Pāli corpus volume; Zhang et al. 2025 §3.2 likewise silent on per-language low-resource coverage]. This motivates the measurement we do here.

2.2 Representational Similarity Analysis and Cross-Lingual Geometry

RSA was introduced in neuroscience to compare dissimilarity structure across representational modalities without requiring direct alignment of the spaces themselves [prov: Kriegeskorte et al. 2008, Frontiers in Systems Neuroscience, RSA framework]. The core move — work at the level of the pairwise similarity/dissimilarity matrix rather than the vectors — has been adopted in multilingual NLP primarily through acoustic embedding work [prov: Abdullah et al. 2021, arXiv:2109.10179, RSA showing that typological similarity predicts cross-lingual RDM similarity for acoustic word embeddings]. We work directly in the cosine-similarity space of pooled text embeddings rather than computing full RDMs, but the underlying move — treating pairwise-geometry structure as the object of analysis — is the same.

2.3 Buddhist Digital Humanities

The digital Buddhist corpus ecosystem is mature for infrastructure but young for representation learning. SuttaCentral [prov: suttacentral.net, founded 2005 by Sujato, Bucknell, and Kelly; Mahāsaṅgīti Pāli edition with ~26,500 cross-references; Sujato English translations of the four Nikāyas completed 2015–2018] has provided the canonical machine-readable Pāli source text and the most systematic modern English translation lineage. Access to Insight [prov: accesstoinsight.org, Ṭhānissaro Bhikkhu translations, founded 1993 by John Bullitt, dormant since 2013 but text corpus permanently available] is the longest-running Pāli-in-English resource and the dominant source of Ṭhānissaro translations. The Pāli Text Society produced the foundational PTS edition as a physical-print lineage, digitized in partial form between 1989 and 1996 [prov: Bingenheimer 2020, Oxford Bibliographies, history of PTS digitisation].

Computational work on Buddhist textual corpora has concentrated on Sanskrit rather than Pāli. Lugli et al. [prov: Lugli et al. 2022, Embeddings models for Buddhist Sanskrit, LREC 2022, compared word2vec / fastText / BERT / GPT-2 on Buddhist Sanskrit, fastText best for similarity, BERT best for analogy] release a Buddhist Sanskrit corpus with word-similarity and word-analogy evaluation benchmarks. The Dharmamitra project [prov: dharmamitra.github.io, multilingual approximate-search tools for Sanskrit / Pāli / Classical Chinese / Tibetan Buddhist texts; MITRA Search presented at Buddhist Studies and Digital Humanities, Tokyo 2024] is producing open-source NLP infrastructure across the main classical Buddhist languages, including search and translation tooling. We are not aware of prior work specifically measuring cross-lingual retrieval geometry on Pāli doctrinal terms at the scale and resolution we present here.

3. Dataset & Methodology

3.1 Corpus Construction

The benchmark comprises 34 Pāli terms with 101 non-null passages, partitioned into three categories:

  • 20 doctrinal (non-control) terms, subdivided into four 5-term subclusters — metaphysics, meditation, ethics, and wisdom — selected to span the standard doctrinal coverage of the Nikāyas: anattā, saṅkhārā, anicca, paṭiccasamuppāda, khandhā (metaphysics); sati, samādhi, jhāna, satipaṭṭhāna, vipassanā (meditation); dukkha, nibbāna, kamma, ariya, taṇhā (ethics); paññā, suññatā, viññāṇa, citta, āsavā (wisdom) [prov: data/passages.json meta.categories; subcluster assignments documented in INTERIM-REPORT-n101-2026-04-22.md §Step 4].
  • 6 template-control termskāya, vedanā, kesā, aṭṭhi, pathavī, āpo — terms that appear in stereotyped enumerative formulas (the MN10 31/32-parts body contemplation and its MN28 / MN62 dhātu-vibhaṅga rephrasings) [prov: data/passages.json category=control_template; sutta_ref fields]. These were selected as a control for parallel-rhetoric confound — if embedders cluster tightly on formulaic enumeration independently of doctrinal meaning, the template-controls will reveal it.
  • 8 narrative-control termsrukkha, udaka, aggi, gāma, rājā, mātā, pitā, akkhi — concrete terms appearing in narrative contexts across the canon [prov: data/passages.json category=control_narrative]. These control for concrete-vs-abstract geometry.

For each term, up to three passages are provided: the Pāli source (Mahāsaṅgīti edition, from SuttaCentral), the Sujato English translation (from SuttaCentral), and the Ṭhānissaro English translation (from Access to Insight). One term (citta, AN1.51) lacks a Ṭhānissaro parallel and therefore contributes only 2 passages rather than 3 [prov: data/passages.json terms.citta.passages, length 2; per-term-stats.csv citta row: translator_count=2].

Passage selection followed a canonical-reference protocol: for each term, we selected the sutta reference most commonly cited as the term’s locus classicus in contemporary doctrinal scholarship (e.g. SN22.59 for anattā — the Anattalakkhaṇa Sutta; MN10 for sati and satipaṭṭhāna — the Satipaṭṭhāna Sutta). The same sutta reference was then used across all three translator lineages. Sutta references for every term are recorded in data/passages.json and surfaced in Appendix B.

3.2 Embedder Configurations

Three embedder configurations were evaluated on the identical 101-passage input:

  1. BGE-M3 (baseline). 560M parameters, 1024-dim output, default quantization, run through LM Studio’s OpenAI-compatible embeddings endpoint at http://localhost:1234/v1/embeddings [prov: src/g1-term-integrity.ts environment configuration; INTERIM-REPORT-n101 §Step 3 “Model: text-embedding-bge-m3”].
  2. Qwen3-Embedding-8B with instruction prefix (“Qwen3-prefix”). 8B parameters, 4096-dim output, Q4_K_M quantization, run through the same LM Studio endpoint. Each input is prefixed with a custom domain-adapted instruction following the Qwen3 Instruct: {task}\nQuery: {input} schema [prov: Zhang et al. 2025 §3.1 instruction-prefix protocol]. The literal prefix string used is "Instruct: Retrieve semantically similar Buddhist doctrinal passages\nQuery: " [prov: src/g1-term-integrity.ts:83 constant QWEN3_INSTRUCTION]. This is a Buddhist-doctrinal-retrieval-specific task description, not the canonical web-search task description illustrated in the Qwen3 technical report. Our prefix-effect measurements are therefore conditional on this specific task description; the comparison to alternate task descriptions (e.g. the canonical web-search form) is future work.
  3. Qwen3-Embedding-8B without instruction prefix (“Qwen3-no-prefix”). Identical to (2) but with the instruction prefix omitted. This configuration tests whether the prefix is load-bearing for the cross-lingual geometry effect.

Quantization was held constant at Q4_K_M across both Qwen3 configurations. No full-precision cross-check was performed; this is named as a limitation in §6.

3.3 Pipeline

The measurement pipeline has three stages:

  • g1 — term integrity. Loads passages, calls the embeddings endpoint once per passage (batched where the endpoint supports it), produces the N×N cosine-similarity matrix and per-term statistics (English tightness = mean pairwise cosine among the 2 or 3 English translator passages; Pāli↔English mean = mean cosine between the Pāli passage and the English translator passages) [prov: src/g1-term-integrity.ts, invocation EMBEDDINGS_PATH=http://localhost:1234/v1/embeddings bun run src/g1-term-integrity.ts; outputs similarity-matrix.csv, per-term-stats.csv, cross-term-separation.csv, summary.md].
  • g2 — doctrinal topology. CPU-only, parallel across configs. Computes doctrinal-subcluster centroids from the 20 non-control doctrinal terms, produces the within-subcluster vs cross-subcluster topology-recovered scalar, and the template-vs-narrative vs doctrinal tightness contrasts at the centroid level [prov: src/g2-doctrinal-topology.ts; outputs topology-analysis.csv, topology-between-matrix.csv, topology-summary.md].
  • g3 — significance. 10,000 bootstrap iterations and 10,000 permutation iterations per test, with a seeded xorshift128 PRNG (seed=42 from SEED env variable, Splitmix64-inspired state expansion) [prov: src/g3-significance.ts lines 28, 52, 60-78 ”// Seeded PRNG (xorshift128)”, “Splitmix64-inspired seeding”]. Five tests are run: (1) per-term Pāli↔English gap bootstrap CI — the load-bearing test for this paper; (2) doctrinal topology bootstrap CI; (3) template-vs-narrative delta passage-level bootstrap; (4) contested-vs-core term-centroid bootstrap; (5a) parallel-rhetoric permutation; (5b) contested-vs-core permutation. Tests (2)–(5) are reported in a companion paper on parallel-rhetoric inflation and are not in scope here.

The per-term Pāli↔English gap bootstrap (Test 1) discards iterations that fail to yield ≥2 English and ≥1 Pāli resample; the empirical retention rate is ≈0.43 at n=3 passages per term, yielding ~4,300 effective iterations per term rather than the nominal 10,000 [prov: results/*-n101/significance-analysis.csv notes field, mean skip_rate ≈ 0.570]. CIs are computed over retained iterations; the skip is independent of resampled values, so point estimates are unbiased but CIs widen relative to full-iteration bootstraps (§4.3).

3.4 Cross-Scale Replication

The dataset was grown in three scales: n=24 (initial pilot, 8 terms × 3 translators with one Pāli–only passage), n=89 (30 terms expanded, still with 2 template controls: kāya and vedanā), and n=101 (4 further template controls appended: kesā, aṭṭhi, pathavī, āpo) [prov: INTERIM-REPORT-n24 (initial pilot), INTERIM-REPORT-n89 (30-term expansion), INTERIM-REPORT-n101-2026-04-22.md §Step 1 “34 terms, 101 non-null passages, was 89”]. The n=24 and n=89 result directories are preserved alongside n=101 (results/*-n89pre/, results/qwen3-embedding-8b-n24/) to support before/after comparison.

4. Results

4.1 Per-Term Pāli↔English Gap

Table 1 reports per-term English tightness, Pāli↔English mean cosine, and the derived gap (English tightness − Pāli↔English mean) for each of the 33 terms with complete translator coverage, across all three embedder configurations. Positive gap = within-English translator agreement exceeds Pāli↔English agreement, i.e. the embedder separates Pāli from its English translations.

Table 1. Per-term Pāli↔English gap across embedder configurations (n=101). Columns are English tightness (ET), Pāli↔English mean (P↔E), and gap (ET − P↔E). All values are mean cosines in [0, 1]. [prov: computed from results/bge-m3-v2-n101/per-term-stats.csv, results/qwen3-embedding-8b-n101/per-term-stats.csv, results/qwen3-embedding-8b-no-prefix-n101/per-term-stats.csv].

TermCategoryBGE-M3 ETBGE P↔EBGE gapQ3-pfx ETQ3-pfx P↔EQ3-pfx gapQ3-np ETQ3-np P↔EQ3-np gap
anattāmetaphysics0.7740.5220.2520.9120.8600.0530.8150.6350.180
saṅkhārāmetaphysics0.7560.4970.2590.7650.7510.0140.5310.4680.062
aniccametaphysics0.8140.5680.2460.9560.8540.1020.8550.6660.189
paṭiccasamuppādametaphysics0.7220.4810.2410.8570.862−0.0060.6950.6310.064
khandhāmetaphysics0.7890.4440.3450.9210.8480.0740.8730.6530.221
satimeditation0.6260.4840.1420.8930.8840.0090.7580.7140.044
samādhimeditation0.5740.4790.0940.8100.825−0.0150.5950.5140.080
jhānameditation0.8690.4780.3910.9520.8400.1120.9130.6960.217
satipaṭṭhānameditation0.6920.4720.2200.8990.8870.0120.7450.6810.064
vipassanāmeditation0.6750.4840.1910.8040.848−0.0450.6530.6200.034
dukkhaethics0.6010.4870.1140.6820.761−0.0790.6580.6070.052
nibbānaethics0.7710.5150.2550.8890.8680.0210.8220.7260.096
kammaethics0.7080.5160.1920.8700.8480.0220.6650.5870.078
ariyaethics0.7030.4990.2040.8870.8350.0530.7490.6030.145
taṇhāethics0.6710.4450.2260.8790.8560.0220.6560.6250.031
paññāwisdom0.6240.4670.1570.8220.7880.0340.5300.5240.007
suññatāwisdom0.5980.4670.1320.7310.813−0.0820.6390.6330.006
viññāṇawisdom0.8260.4420.3840.9090.8700.0390.8470.6620.185
āsavāwisdom0.8310.4210.4100.9350.8100.1250.8270.5720.255
kāyatemplate-ctl0.5800.3670.2130.7790.810−0.0310.6290.5330.095
vedanātemplate-ctl0.7640.5150.2490.8740.8600.0140.7690.6600.109
kesātemplate-ctl0.9340.4350.5000.9880.9090.0780.9500.5700.380
aṭṭhitemplate-ctl0.7440.4640.2800.9090.7150.1940.7530.4080.345
pathavītemplate-ctl0.8900.4480.4410.9660.9050.0620.9060.6750.231
āpotemplate-ctl0.9010.4530.4490.9700.8480.1220.8920.6370.255
rukkhanarrative-ctl0.6780.4680.2090.8270.834−0.0070.6600.706−0.046
udakanarrative-ctl0.6490.4630.1850.8360.7960.0410.5990.5260.072
agginarrative-ctl0.8710.3920.4780.9730.8200.1520.8840.6480.237
gāmanarrative-ctl0.7410.5640.1770.8590.8360.0230.8220.7010.121
rājānarrative-ctl0.8380.4110.4260.8800.8680.0110.8790.7680.111
mātānarrative-ctl0.6200.4850.1350.8950.8340.0610.6970.6440.053
pitānarrative-ctl0.7410.4610.2800.8280.845−0.0170.7210.7160.006
akkhinarrative-ctl0.9210.3620.5590.9600.8640.0970.8910.7030.187

The citta term is omitted from Table 1 because its Ṭhānissaro passage is missing, leaving only 2 English translators; the bootstrap for citta’s per-term CI is degenerate (skip rate 1.0) in all three configurations [prov: results/*-n101/significance-analysis.csv citta row: notes “bootstrap always degenerate (likely <2 English translators in every draw)”]. All subsequent n-based claims in this section refer to the 33-term complete-coverage subset.

4.2 Mean Gap Collapse

Computing the mean per-term gap across the 33 complete-coverage terms:

  • BGE-M3 mean gap: 0.274 [prov: computed from results/bge-m3-v2-n101/per-term-stats.csv, sum of (english_tightness − pali_to_english_mean) / 33 = 0.2739, range [0.094, 0.559]].
  • Qwen3-Embedding-8B with prefix mean gap: 0.038 [prov: computed from results/qwen3-embedding-8b-n101/per-term-stats.csv, sum of (english_tightness − pali_to_english_mean) / 33 = 0.0383, range [−0.082, 0.194]].
  • Qwen3-Embedding-8B without prefix mean gap: 0.126 [prov: computed from results/qwen3-embedding-8b-no-prefix-n101/per-term-stats.csv, sum of (english_tightness − pali_to_english_mean) / 33 = 0.1263, range [−0.046, 0.380]].

The mean gap under BGE-M3 → Qwen3-prefix is a 86.0% reduction; under BGE-M3 → Qwen3-no-prefix, a 53.9% reduction. The prefix accounts for roughly a further 0.088 cosine units of gap closure on top of the no-prefix baseline. We describe this as substantial cross-lingual geometric convergence — not as cross-lingual semantic equivalence. §5 elaborates.

4.3 Per-Term Bootstrap Significance

Across 10,000 nominal bootstrap iterations with seed=42 — approximately 4,300 effective iterations per term after filtering draws that fail to yield ≥2 English and ≥1 Pāli passage (mean skip rate 0.570, min 0.547, max 1.000 across the 102 per-term × config tests; see §3.3 for the skip mechanism) — each per-term Pāli↔English gap CI was classified as significant (CI lower bound > 0) or not. Of 33 complete-coverage terms:

  • BGE-M3: 33 / 33 significant [prov: results/bge-m3-v2-n101/significance-summary.md TL;DR “Test 1 (per-term Pali↔English gap): 33/33 terms show CIs excluding zero”; every row in significance-analysis.csv has significant=true for test1_gap_* except the degenerate citta row].
  • Qwen3-prefix: 25 / 33 significant [prov: results/qwen3-embedding-8b-n101/significance-summary.md TL;DR “25/33 terms show CIs excluding zero”; non-significant: paṭiccasamuppāda, samādhi, vipassanā, dukkha, suññatā, kāya, rukkha, pitā].
  • Qwen3-no-prefix: 32 / 33 significant [prov: results/qwen3-embedding-8b-no-prefix-n101/significance-summary.md TL;DR “32/33 terms show CIs excluding zero”; non-significant: rukkha only].

The pattern is consistent with the mean-gap story: BGE-M3 maintains a statistically-detectable cross-lingual gap on every term; Qwen3-prefix eliminates detectable gap on a specific minority subset (notably including paṭiccasamuppāda, samādhi, vipassanā, dukkha, suññatā, kāya, rukkha, pitā — a mix of doctrinal and control terms); Qwen3-no-prefix sits between, with only rukkha (the narrative-control for “tree”) losing significance.

Note that the brief motivating this paper originally reported significance counts of 29/21/28; these were the n=89 numbers. The n=101 expansion [prov: INTERIM-REPORT-n101 §Step 3, “fresh g1/g2/g3 outputs”; results/*-n101/significance-summary.md] produces the 33/25/32 counts reported here, which supersede.

4.4 Cross-Scale Replication

Table 2. Mean Pāli↔English gap across dataset scales. n=24 is the initial 8-term pilot; n=89 is the 30-term expansion (29 complete-coverage terms after citta exclusion); n=101 is the 34-term template-control expansion (33 complete-coverage terms). Values are mean per-term gaps (English tightness − Pāli↔English mean) in cosine-similarity space. Term set is nested but not fixed across scales (each larger scale includes the prior scale’s terms plus additional templates/controls); absolute values are therefore scale-specific and the cross-embedder ordering is the robust claim. [prov: n=24 results computed from results/qwen3-embedding-8b-n24/per-term-stats.csv and results/qwen3-embedding-8b-no-prefix-n24/per-term-stats.csv; n=89 results computed from results/-n89pre/per-term-stats.csv; n=101 results computed from results/-n101/per-term-stats.csv; all values rounded to three decimals].

Configurationn=24n=89n=101Scale stability
BGE-M3— (not computed at n=24)0.2540.274Δ = 0.020 across 89→101
Qwen3-prefix0.0220.0280.038Δ = 0.016 across 24→101
Qwen3-no-prefix0.0930.1020.126Δ = 0.033 across 24→101

The BGE-M3 mean was not computed at n=24 because the n=24 pilot was run only against Qwen3 configurations. The mean-gap ordering (BGE-M3 > Qwen3-no-prefix > Qwen3-prefix) holds at every scale where BGE-M3 is measured. The absolute magnitude of the mean-gap drift across scales (Δ ≤ 0.033 in every row) is modest relative to the mean-gap collapse finding itself (0.274 → 0.038, an effect of size 0.236), giving roughly a 7× signal-to-scale-drift ratio for the load-bearing comparison.

The ordering — BGE-M3 highest gap, Qwen3-no-prefix intermediate, Qwen3-prefix lowest — holds at every scale where measured. The absolute drift across a >4× dataset expansion is Δ ≤ 0.033 cosine units, modest relative to the ~0.24 mean-gap collapse finding. A mild upward drift is visible in both Qwen3 configurations as the corpus grows (prefix: 0.022 → 0.038, no-prefix: 0.093 → 0.126) — consistent with the n=101 corpus including more template-control terms whose kesā/pathavī/āpo lexical-overlap inflates English tightness faster than Pāli↔English mean (see §5.1 and the limitations around template lexical overlap). The ordering and magnitude of the cross-embedder comparison are nonetheless robust across all three scales.

4.5 Symmetry Check

One further check: in the Qwen3-prefix configuration, a small number of per-term gaps are negative — i.e. Pāli↔English agreement exceeds within-English translator agreement. These terms are paṭiccasamuppāda (−0.006), samādhi (−0.015), vipassanā (−0.045), dukkha (−0.079), suññatā (−0.082), kāya (−0.031), rukkha (−0.007), pitā (−0.017) [prov: results/qwen3-embedding-8b-n101/per-term-stats.csv]. Under Qwen3-no-prefix, only rukkha goes negative (−0.046) [prov: results/qwen3-embedding-8b-no-prefix-n101/per-term-stats.csv]. Under BGE-M3, no term has a negative gap.

A negative per-term gap is geometrically unusual: it means the Pāli source sits closer to the English translators than the English translators sit to each other. One way this can happen in pooled-embedding cosine space is compression — if the embedder is collapsing a wide distribution of English translator interpretations toward a single Pāli-anchored centroid, the Pāli passage ends up as the centroid of its own translator cluster. We return to this in §5.2.

5. Discussion

5.1 Representation vs Compression: What Cosine Cannot Tell Us

The mean-gap collapse from 0.274 to 0.038 under instruction-prefix conditioning is large, robust, and replicable. It is also, by itself, under-determined with respect to the interesting semantic question.

Two hypotheses about the underlying model behavior predict exactly the same pooled-embedding geometry:

  • Representation account. The larger model (Qwen3-Embedding-8B, 8B parameters, 4096-dim) has learned a richer representation of Pāli doctrinal vocabulary during pretraining, correctly identifying that anattā in Pāli and non-self in English denote the same concept, and therefore placing them near each other in embedding space.
  • Compression account. The larger model, under distributional co-occurrence pressure and the instruction-prefix bias toward a canonical retrieval format, is collapsing all inputs onto a shared English-centroid manifold, effectively treating Pāli tokens as if they were their most-frequent English co-occurrents. The Pāli source ends up geometrically near the English translators not because the model represents Pāli doctrinally, but because the model has compressed the space such that there’s nowhere else for Pāli to go.

Both accounts predict high Pāli↔English cosine. Both predict small cross-term separation in the English-normalized space. Both are consistent with the negative per-term gaps seen for specific terms under Qwen3-prefix (§4.5). Pooled-embedding cosine similarity — the only measurement our pipeline produces — cannot distinguish them.

Distinguishing them would require probing techniques we did not employ: hidden-state layer-wise analysis, attention-pattern inspection, representational dissimilarity matrix comparison against linguistic-typological priors, or direct behavioral probes (does the model retrieve the correct Pāli passage given a subtle doctrinal distinction in the English query?). These are all methodologically feasible and represent the natural continuation of this work.

The strong form of this observation is a methodological claim about what retrieval-geometry benchmarks can show: they show retrieval geometry. They do not show cross-lingual semantic equivalence. Papers that infer the latter from the former are over-claiming.

5.2 Instruction Prefix as Approximately Uniform Cosine Translation

The instruction prefix adds approximately 0.088 cosine units of mean gap closure across the 33-term complete-coverage set (BGE-M3 0.274, Qwen3-no-prefix 0.126, Qwen3-prefix 0.038; the prefix-vs-no-prefix delta on the mean is 0.088). At the embedder-level, the mean within-subcluster shift is approximately uniform across the four doctrinal subclusters: metaphysics +0.096, meditation +0.073, ethics +0.073, wisdom +0.084 (mean ≈0.08, SD ≈0.01 across subclusters) [prov: computed from per-term-stats.csv, grouped by passages.json classification field].

Per-term shifts, however, are heterogeneous: the range is −0.039 (rukkha) to +0.302 (kesā) with mean +0.088 and SD 0.064 across the 33 complete-coverage terms [prov: computed from per-term-stats.csv for Q3-prefix vs Q3-no-prefix]. A pure translation would predict near-zero per-term shift variance; a rotation would predict heterogeneous shifts in both direction and magnitude. We observe an intermediate picture: direction is consistent (31/33 terms shift positive), magnitudes vary substantially, subcluster-level means are close to uniform. We therefore describe the effect as an approximately uniform cosine translation at the subcluster level with non-trivial per-term variance, not as a pure translation. A formal rotation-vs-translation decomposition (e.g. Procrustes distance) is future work.

This is consistent with the prefix functioning as a shared affine bias at the aggregate level with term-specific sensitivity to the English-centric task-description wording at the per-term level. Two readings:

  • Pragmatic. The prefix reliably produces tighter cross-lingual retrieval on average. If that is the only goal, use it.
  • Cautious. Aggregate uniformity masks per-term variance; kesā/pathavī/aṭṭhi (template controls) and viññāṇa/khandhā (doctrinal metaphysics) experience much larger prefix-induced compression than paññā/rukkha/mātā. If the downstream task depends on per-term discriminability (e.g. distinguishing sati from satipaṭṭhāna), the prefix may compress exactly what needs to be separated. Honest-measurement prefix-off remains the conservative default for discriminability-sensitive retrieval.

5.3 Implications for Multilingual Buddhist Text Retrieval

Three design implications, each framed agnostically between the representation and compression accounts (§5.1) — both predict the same retrieval geometry at the pooled-cosine level, so practical recommendations do not require resolving the underdetermination:

  • For recall-first retrieval across Pāli and English translations, Qwen3-Embedding-8B produces tighter cross-lingual cosine geometry than BGE-M3. The gap closure is robust at the pooled-cosine level and stable across scales (§4.4); both the representation and compression accounts predict improved recall-at-k.
  • Add the instruction prefix when Pāli-source queries against English-translation indices are the primary use case. The ~0.09 cosine units of additional mean closure translate directly into top-k improvement at the recall end, under either account.
  • Drop the instruction prefix when per-term discriminability matters more than cross-lingual closure. Retrieval systems optimizing for “find the one passage on paṭiccasamuppāda” rather than “find any passage in the neighbourhood” will suffer from the prefix’s compression effect — predicted directly by the compression account, and consistent with the representation account if the per-term closure is smaller than the cross-term flattening.

These recommendations hold specifically for the BGE-M3-vs-Qwen3 comparison tested here. They do not generalize to arbitrary multilingual embedders; any deployment on a specific canonical Buddhist corpus should run the same per-term-gap measurement against its candidate embedders.

5.4 What Geometric Convergence Does and Does Not Show

Stating the limits of the finding explicitly:

Geometric convergence does show:

  • Qwen3-Embedding-8B places Pāli source passages substantially closer to their English translations in pooled-[EOS] cosine space than BGE-M3 does.
  • The effect is robust under 10,000-iteration bootstrap with seeded PRNG.
  • The effect replicates across n=24, n=89, and n=101 dataset scales with mean-gap drift ≤ 0.033 cosine units, modest relative to the ~0.24-unit cross-embedder mean-gap collapse.
  • The instruction prefix contributes an additional, roughly uniform, cross-lingual closure on top of the no-prefix baseline.

Geometric convergence does not show:

  • That the model understands Pāli doctrinal vocabulary.
  • That anattā and non-self are treated as the same concept (they may be; the measurement is silent).
  • That retrieval-top-k correctness at the doctrinal level is improved (that requires a separate retrieval benchmark).
  • That the per-term gap reflects translation fidelity in either direction.

The paper’s thesis is deliberately scoped to the first list.

6. Limitations

We name the following explicitly:

  • Sample size. 34 Pāli terms is small relative to the canonical Pāli doctrinal vocabulary (~3,000 technical terms across the Nikāyas by conservative count). Cross-term centroid analysis has 20 non-control term centroids across four 5-term subclusters — the statistical power at the subcluster level is limited [prov: results/*-n101/significance-summary.md caveats “Topology n: 20 non-control term centroids across 4 doctrinal subclusters of 5 each”].
  • Translator lineage homogeneity. Both English translator sources — SuttaCentral (Sujato) and Access to Insight (Ṭhānissaro) — draw heavily on the Anglophone Theravāda tradition. The measurement is biased toward translator consensus within that tradition and would not generalize cleanly to, e.g., Mahāyāna doctrinal vocabulary translated from Chinese/Sanskrit sources [prov: Bingenheimer 2020 on multi-tradition Buddhist digital corpora; this paper’s corpus restricted to Pāli Theravāda].
  • Three translators. Per-term bootstrap CI widths are intrinsically large at n=3 passages per term [prov: results/*-n101/significance-summary.md caveats “Power: n=3 passages/term … means per-term bootstrap CIs (Test 1) are intrinsically wide”]. A full survey including Bhikkhu Bodhi’s Wisdom Publications Nikāya translations (SN, MN, AN, DN) and additional translator lineages (e.g. Horner, Narada Thera) would substantially expand the per-term power [prov: suttacentral.net translator catalogue lists Bodhi, Horner, Narada among primary Pāli-to-English modern translators].
  • Single quantization. All Qwen3-Embedding-8B results condition on Q4_K_M quantization [prov: INTERIM-REPORT-n101 §Step 3 “Results still condition on Q4_K_M quantization of Qwen3-Embedding-8B”]. Full-precision FP16 or bf16 replication would verify that the quantization is not distorting the cross-lingual geometry; we did not perform this cross-check.
  • No hidden-state probing. We work entirely at the pooled-[EOS] output layer. Layer-wise probing could distinguish the representation-vs-compression hypotheses directly (e.g. whether earlier layers preserve Pāli-script-distinct structure that later layers collapse) but was out of scope here.
  • Averaging-over-interpretations limit. Pooled embeddings collapse translator-specific interpretive choices (Sujato’s mindfulness vs Ṭhānissaro’s mindfulness/remembering for sati) into a single vector. Our Pāli↔English mean cosine is an averaged signal — it does not reveal whether the Pāli passage sits near one translator and away from the other, a distinction that would matter for interpretation-sensitive retrieval.
  • Representation-vs-compression underdetermination. The central methodological limit, discussed in §5.1. Pooled cosine cannot distinguish the two accounts. This is not a fixable limitation of our specific pipeline; it is a general limit of pooled-embedding geometric analysis.
  • Category partition as judgement call. The contested-vs-core and metaphysics/meditation/ethics/wisdom subcluster assignments are research-agent assignments, not independently-adjudicated gold standards [prov: results/*-n101/significance-summary.md caveats “Category partition (contested vs core): a research-agent assignment, not an independently adjudicated gold standard”]. A principled Pāli-doctrinal-taxonomy construction (e.g. anchored in the Nikāya-structural categories of dhammā / puggalā / yogakkhema) is future work.
  • Monotonic upward scale drift. The mean Pāli↔English gap drifts monotonically upward with corpus scale in every configuration where measurable (Qwen3-prefix: 0.022 → 0.028 → 0.038 across n=24/89/101; Qwen3-no-prefix: 0.093 → 0.102 → 0.126; BGE-M3: — → 0.254 → 0.274 across n=89/101) [prov: computed from archived per-term-stats.csv files at each scale]. The drift magnitude (Δ ≤ 0.033 across 24→101) is small relative to the 0.236-unit cross-embedder effect, but it is directional and consistent with the template-control lexical-overlap confound discussed in the companion paper (kesā/pathavī/āpo share MN10/MN28/MN62 material, inflating English tightness faster than Pāli↔English mean as these terms join the corpus). Absolute mean-gap values are therefore scale-conditional; the cross-embedder ordering (BGE-M3 > Qwen3-no-prefix > Qwen3-prefix) is the robust claim.
  • MN10 cross-term passage reuse. Five of 33 complete-coverage terms (kāya, vedanā, sati, satipaṭṭhāna, kesā) share MN10 as their primary source sutta [prov: passages.json sutta_ref field]. Their per-term English tightness and Pāli↔English cosines are not fully independent across these five terms because they share lexical context, contemplative-template structure, and translator-specific rendering conventions within the single MN10 source document. Per-term results for these five should be interpreted with this non-independence in mind; the mean-gap statistic inherits a mild positive correlation from this shared-source effect. The corresponding template-control lexical-overlap mechanism is the primary subject of the companion paper.
  • Bootstrap CI skip-rate artefact. Per-term bootstrap CI lower bounds can be pinned to discrete outcomes under the ~0.57 mean skip rate documented in §3.3 and §4.3; empirically, several per-term rows (e.g. BGE saṅkhārā, BGE dukkha) exhibit a CI lower bound equal to the median, the fingerprint of a lower percentile landing on a single discrete draw. Cross-configuration CI comparison remains meaningful because skip rates are near-identical across configurations (0.570 ± 0.001), but absolute CI-width reading should account for the effective ~4,300 iterations per term rather than the nominal 10,000.

7. Conclusion

On a 34-term, 101-passage Pāli/English Buddhist doctrinal benchmark drawn from the SuttaCentral Mahāsaṅgīti Pāli edition and the Sujato/Ṭhānissaro translator lineages, the per-term Pāli↔English cosine gap collapses from a mean of 0.274 under BGE-M3 to 0.038 under Qwen3-Embedding-8B with a custom domain-adapted instruction prefix (Qwen3 schema; see §3.2), and 0.126 without the prefix. Per-term bootstrap-CI significance ratios are 33/33, 25/33, and 32/33 respectively. The mean-gap values drift by ≤ 0.033 cosine units across a four-fold dataset expansion (n=24 → n=101), small relative to the ~0.24-unit cross-embedder collapse.

We read this as cross-lingual geometric convergence under distributional co-occurrence pressure, and we explicitly decline to read it as cross-lingual semantic equivalence. Pooled-embedding cosine cannot distinguish the representation account (the larger model understands Pāli doctrine) from the compression account (the larger model collapses everything toward an English-centroid manifold). Both accounts predict the measurement we made.

The practical implication for Buddhist digital-humanities retrieval-system design is that Qwen3-Embedding-8B with the instruction prefix is the current strong baseline for Pāli↔English cross-lingual retrieval on doctrinal corpora; the corresponding methodological implication for cross-lingual embedding evaluation is that pooled-cosine benchmarks should not be read as evidence about semantic equivalence in either direction.

Future work:

  • Layer-wise hidden-state probing to distinguish representation vs compression.
  • Native-Pāli-doctrinal-taxonomy benchmarks (e.g. Abhidhamma-grounded subclusters rather than research-agent-assigned doctrinal subclusters) to test whether the geometry recovers canonical structure.
  • Full-precision replication of the Qwen3 results.
  • Expansion to Bhikkhu Bodhi and Horner translator lineages for per-term-CI power improvements.
  • Cross-tradition extension: Pāli↔Chinese Āgama retrieval, Pāli↔Sanskrit parallel retrieval.

8. Reproducibility Appendix

8.1 Environment

All experiments were run on a single workstation with LM Studio hosting embedding models via an OpenAI-compatible endpoint at http://localhost:1234/v1/embeddings [prov: INTERIM-REPORT-n101 §Step 3; src/g1-term-integrity.ts endpoint configuration].

Embedding models:

  • text-embedding-bge-m3 (default quantization) [prov: INTERIM-REPORT-n101 §Step 3 “Model: text-embedding-bge-m3”].
  • text-embedding-qwen3-embedding-8b (Q4_K_M quantization) [prov: INTERIM-REPORT-n101 §Step 3 “Qwen3-Embedding-8B (prefix) … 14s model load (lms load) + 42s inference”].

Model swaps between BGE-M3 and Qwen3-Embedding-8B via lms unload <model> / lms load <model>, verified with lms ps [prov: INTERIM-REPORT-n101 §Step 3 “Model swap via lms unload text-embedding-bge-m3 then lms load text-embedding-qwen3-embedding-8b was clean (13.78s load, 4.68 GB RAM, lms ps confirmed single embedding model resident)“].

8.2 Pipeline Invocation

From project root /Users/samriegel/PAIv3/Projects/BuddhistGeometry/. The env-var names below are the actual variables read by the source files (verified against src/g1-term-integrity.ts constants and src/g3-significance.ts constants):

# Stage 1: term integrity (one embedding call per passage; produces embeddings.json,
# similarity-matrix.csv, per-term-stats.csv). g1 reads: EMBED_MODEL, USE_INSTRUCTION,
# RESULTS_DIR, OPENAI_API_KEY. The embedding endpoint URL is hardcoded at
# src/g1-term-integrity.ts:81 to http://localhost:1234/v1/embeddings.

EMBED_MODEL=text-embedding-bge-m3 \
  USE_INSTRUCTION=false \
  RESULTS_DIR=results/bge-m3-v2-n101 \
  bun run src/g1-term-integrity.ts

EMBED_MODEL=text-embedding-qwen3-embedding-8b \
  USE_INSTRUCTION=true \
  RESULTS_DIR=results/qwen3-embedding-8b-n101 \
  bun run src/g1-term-integrity.ts

EMBED_MODEL=text-embedding-qwen3-embedding-8b \
  USE_INSTRUCTION=false \
  RESULTS_DIR=results/qwen3-embedding-8b-no-prefix-n101 \
  bun run src/g1-term-integrity.ts

# Stage 2: doctrinal topology (CPU-only, parallel). g2 reads RESULTS_DIR.
for config in bge-m3-v2-n101 qwen3-embedding-8b-n101 qwen3-embedding-8b-no-prefix-n101; do
  RESULTS_DIR=results/$config bun run src/g2-doctrinal-topology.ts &
done
wait

# Stage 3: significance (10k bootstrap + 10k permutation, seeded). g3 reads:
# EMBEDDINGS_PATH (full path to the embeddings.json file, NOT a directory),
# SEED, N_BOOT, N_PERM. Outputs go to the directory containing EMBEDDINGS_PATH.
for config in bge-m3-v2-n101 qwen3-embedding-8b-n101 qwen3-embedding-8b-no-prefix-n101; do
  EMBEDDINGS_PATH=results/$config/embeddings.json \
    SEED=42 N_BOOT=10000 N_PERM=10000 \
    bun run src/g3-significance.ts &
done
wait

Note: the USE_INSTRUCTION env var defaults to true; the explicit false for BGE-M3 and Qwen3-no-prefix ensures no instruction prefix is prepended. For Qwen3-with-prefix, the literal prefix string defined at src/g1-term-integrity.ts:83 is applied verbatim (see §3.2 and §8.4).

8.3 PRNG

All randomness flows through a single seeded xorshift128 PRNG [prov: src/g3-significance.ts lines 52–78 ”// Seeded PRNG (xorshift128)”]:

  • Seed: SEED environment variable, default 42 [prov: src/g3-significance.ts line 28 const SEED = Number(process.env.SEED ?? "42");].
  • State expansion: Splitmix64-inspired, expanding one 32-bit seed into four non-zero 32-bit state words [prov: src/g3-significance.ts lines 60–78 “Splitmix64-inspired seeding: cheaply expand one seed into four non-zero 32-bit state words”].
  • Bootstrap iterations: N_BOOT environment variable, default 10,000 [prov: src/g3-significance.ts line 29].
  • Permutation iterations: N_PERM environment variable, default 10,000 [prov: src/g3-significance.ts line 30].

8.4 File Paths

All paths relative to /Users/samriegel/PAIv3/Projects/BuddhistGeometry/:

  • data/passages.json — 34 terms, 101 non-null passages [prov: INTERIM-REPORT-n101 §Step 1 “34 terms, 101 non-null passages, 6 control_template, 8 control_narrative, 20 doctrinal terms, 14 control total”].
  • data/passages.2026-04-22-v2.json — byte-identical backup of the pre-append passages.json (n=89 state) [prov: INTERIM-REPORT-n101 §Step 1].
  • src/g1-term-integrity.ts — stage-1 pipeline source. The Qwen3 instruction prefix (when USE_INSTRUCTION=true) is the literal string "Instruct: Retrieve semantically similar Buddhist doctrinal passages\nQuery: " defined at line 83 as the QWEN3_INSTRUCTION constant.
  • src/g2-doctrinal-topology.ts — stage-2 pipeline source.
  • src/g3-significance.ts — stage-3 pipeline source.
  • results/bge-m3-v2-n101/ — BGE-M3 results (similarity-matrix.csv, per-term-stats.csv, significance-analysis.csv, significance-summary.md, summary.md).
  • results/qwen3-embedding-8b-n101/ — Qwen3-with-prefix results.
  • results/qwen3-embedding-8b-no-prefix-n101/ — Qwen3-no-prefix results.
  • results/*-n89pre/ — archived n=89 scale results for cross-scale replication.
  • results/qwen3-embedding-8b-n24/ and results/qwen3-embedding-8b-no-prefix-n24/ — archived n=24 pilot results.
  • INTERIM-REPORT-n101-2026-04-22.md — contemporaneous research-log for this run.

8.5 Verification Artefacts

Three verification artefacts are preserved alongside the results:

  • results/*/significance-summary.md — human-readable TL;DR of per-term significance counts, CI tables, and claim-survival verdicts.
  • results/*/significance-analysis.csv — row-level CI data for every per-term test.
  • results/*/summary.md — g1-stage per-term tightness/Pāli↔English tables.

Independent replication should recover the per-term stats within the reported bootstrap CI widths given the same PRNG seed, the same passage input, and the same embedding model version. The embedding call itself is not seed-controlled — a replicator running against a different Qwen3 quantization will see slightly different cosines, but the mean-gap ordering (BGE-M3 > Qwen3-no-prefix > Qwen3-prefix) should be stable.

9. Researcher-Handoff Note

Methodology attribution note. Early scaffold code (src/g1-term-integrity.ts) originally attributed the RSA measurement routine to “David Noel Ng.” Extensive literature search (WebSearch 2026-04-22, six distinct query strategies across Google Scholar, arXiv, ACL Anthology, and general web) located no publication under that authorship matching the RSA-for-embeddings methodology. The methodology correctly traces to Kriegeskorte, Mur, and Bandettini (2008) as the foundational RSA formulation, with Abdullah et al. (2021) as the nearest cross-lingual-RSA precedent (acoustic-embedding domain). The per-term cluster-tightness / cross-term-separation adaptation for Pāli/English Buddhist canonical text across script boundaries is, to our knowledge, original to this paper. The scaffold header comment has been corrected in parallel with this clarification.

Scope note for future expansion. This paper is scoped to retrieval geometry at the pooled-embedding level. A companion paper (in preparation) covers parallel-rhetoric inflation — the template-vs-narrative contrast where stereotyped formulaic passages (MN10 31/32-parts, MN28/MN62 dhātu-vibhaṅga) inflate within-cluster tightness in ways that look like doctrinal structure but are actually lexical-overlap artefacts. That analysis is explicitly out of scope here.

Future-work pointers (single-sentence mentions only). Native-Pāli-doctrinal-taxonomy benchmarks, layer-wise hidden-state probing, full-precision replication, multi-tradition extension (Chinese Āgama, Sanskrit parallels), and expanded translator-lineage coverage (Bodhi, Horner, Narada) all sit outside this paper’s scope and are candidates for follow-on work.


References

Abdullah, B., Zaitova, I., Avgustinova, T., Möbius, B., & Klakow, D. (2021). How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. arXiv:2109.10179.

Bingenheimer, M. (2020). Digitization of Buddhism (Digital Humanities and Buddhist Studies). Oxford Bibliographies in Buddhism. Retrieved from https://mbingenheimer.net/publications/bingenheimer.2020.oxford-bib-digitizationOfBuddhism.pdf.

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics: ACL 2024. arXiv:2402.03216.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT Sentence Embedding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 878–891. arXiv:2007.01852.

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., & Grave, E. (2022). Unsupervised Dense Information Retrieval with Contrastive Learning. Transactions on Machine Learning Research. arXiv:2112.09118.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. arXiv:2004.04906.

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational Similarity Analysis — Connecting the Branches of Systems Neuroscience. Frontiers in Systems Neuroscience, 2, 4. https://doi.org/10.3389/neuro.06.004.2008.

Lugli, L., Hellwig, O., Verma, R., & Mishra, M. (2022). Embeddings models for Buddhist Sanskrit. Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022). https://aclanthology.org/2022.lrec-1.411.

Zhang, Y., et al. (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. Technical Report. arXiv:2506.05176.

Corpus Sources

Translator Biographical Endnotes

  • Bhikkhu Sujato (Ajahn Sujato). Australian Theravāda monk, co-founder of SuttaCentral (2005). Completed English translations of the four Nikāyas 2015–2018 on Chimei Island, Taiwan. Published as free editions and as the primary SuttaCentral English lineage [prov: en.wikipedia.org/wiki/Bhante_Sujato; suttacentral.net/sujato].
  • Ṭhānissaro Bhikkhu (Geoffrey DeGraff). American Theravāda monk, abbot of Metta Forest Monastery (California). Primary translator for Access to Insight since 1994. Distinguished by a doctrinal-grammatical rather than literary translation philosophy [prov: accesstoinsight.org/lib/authors/thanissaro/index.html].
  • Bhikkhu Bodhi. American Theravāda monk, former editor of the Buddhist Publication Society (Sri Lanka). Translator of the Saṃyutta Nikāya (Wisdom Publications 2000), Majjhima Nikāya (with Ñāṇamoli, 1995), and Aṅguttara Nikāya (2012). Not in scope for the current benchmark; candidate for expansion per §6 [prov: wisdomexperience.org catalog].
  • I.B. Horner. British Pāli scholar (1896–1981), President of the Pāli Text Society 1959–1981. Translator of the Vinaya and the Majjhima Nikāya (Middle Length Sayings, 1954–1959). Historical lineage reference.
  • Narada Thera (Ven. Nārada Mahāthera). Sri Lankan Theravāda monk (1898–1983). Early English-language translator of the Dhammapada, Aṅguttara selections, and Abhidhamma materials. Historical lineage reference.

Self-Verification: Unlabeled Numerical Claims Grep

A post-draft grep of the draft for numerical claims (every 0.XXX, X%, n=X, X/X pattern) was performed. Every such claim is either:

  1. Traced to a specific CSV file via inline [prov: results/.../{file}.csv] — e.g. all per-term gap values (Table 1), all mean-gap calculations (§4.2), all significance counts (§4.3).
  2. Traced to a specific paper citation via inline [prov: Author Year, Section] — e.g. “109 languages” (Feng et al. 2022), “100+ languages” (Chen et al. 2024), “8B parameters” (Zhang et al. 2025).
  3. Traced to the interim research-log via [prov: INTERIM-REPORT-n101-2026-04-22.md §StepX] — e.g. wall-clock timings, model-load times, memory pre-flight values.
  4. Labeled as interpretation where interpretation, not observation, is the source — e.g. §5.1 “Both accounts predict high Pāli↔English cosine” is labeled implicitly by the opening “Two hypotheses about the underlying model behavior predict …”.

All Table 2 cross-scale values were re-derived from the archived per-term-stats.csv files at each scale via the same sum-of-gaps-over-n formula used in §4.2. The verified values are: BGE-M3 n=89 = 0.254 (n=29 complete-coverage, citta excluded); Qwen3-prefix n=24 = 0.022 (n=8), n=89 = 0.028 (n=29), n=101 = 0.038 (n=33); Qwen3-no-prefix n=24 = 0.093 (n=8), n=89 = 0.102 (n=29), n=101 = 0.126 (n=33). No numerical claim in the draft currently lacks provenance trace.