Syntactic Parallelism as Semantic Feature: Cross-Embedder Inflation in List-Formula Text
Companion paper and provenance convention
This paper is one of two companion findings from the same study. The companion piece — Cross-Lingual Geometric Convergence in Multilingual Embedders: A Pāli/English Buddhist Doctrinal Benchmark — reports the cross-lingual retrieval-geometry result on the same corpus, and the study notes give the full backstory. The computational-linguistics claim here is about encoder behavior, not about Buddhist doctrine — the Buddhist corpus is a tractable case study, chosen for its closed canon, digital availability, and well-attested formulaic sub-structures.
Inline [prov: ...] brackets are the paper’s provenance-tag convention, preserved verbatim from the Cortex-style source. Each non-trivial claim is traced to a specific CSV row, a specific paper citation, or a specific line of the interim research log. The tags read a little dense on the page; that density is the point. Nothing is asserted without a trace back to its source.
Syntactic Parallelism as Semantic Feature: Cross-Embedder Inflation in List-Formula Text
Abstract
Transformer-based sentence encoders are increasingly applied to corpora rich in formulaic syntactic structure — legal briefs, liturgical texts, scientific abstracts, religious canons — yet the interaction between pooled embedding geometry and canonical parallel-rhetoric frames has received little direct measurement. We report a controlled case study using a curated Buddhist-canonical corpus (n=101 passages across 34 terms; 14 control terms partitioned into 6 template-bound and 8 narrative-only) and two architecturally-distinct multilingual encoders: BGE-M3 (560M parameters, 1024-dim) and Qwen3-Embedding-8B (8B parameters, 4096-dim, Q4_K_M quantization, evaluated with and without a custom domain-adapted instruction prefix following the Qwen3 instruction-prefix schema; see §4.2). Across all three embedder configurations, within-subcluster mean pairwise cosine similarity is elevated by +0.0587 to +0.0900 cosine units for template-bound controls relative to narrative-only controls [prov: results//significance-analysis.csv line 37, test3_template_minus_narrative, bootstrap median]. Bootstrap 95% confidence intervals exclude zero in all three configurations; permutation-test p-values reach p<0.05 in two of three (p=0.0384 BGE-M3, p=0.0033 Qwen3-prefix, p=0.0663 Qwen3-no-prefix) [prov: results//significance-analysis.csv line 39]. At an earlier n=89 checkpoint, before lexical-overlap-heavy terms were added, the centroid-level delta converged across the two most different encoders to within 0.0005 cosine units (BGE-M3 +0.1626, Qwen3-prefix +0.1631) [prov: results/bge-m3-v2-n89pre/topology-analysis.csv lines 6-7 and results/qwen3-embedding-8b-n89pre/topology-analysis.csv lines 6-7]. A lexical-overlap confound is identified and quantified — the MN10 32-parts body enumeration recurs verbatim in the MN28 earth-element definition, producing kesā×pathavī cosines of 0.8315, 0.8884, and 0.7809 across the three configurations. We frame the finding as evidence that syntactic parallelism functions as a semantic feature in pooled transformer outputs and hypothesize — as a claim requiring cross-domain validation, not a proven result — that similar inflation should be anticipated in any embedding analysis of formulaic corpora. Disambiguating the parallel-syntax effect from the lexical-overlap confound is the primary target for follow-up work.
1. Introduction
Transformer-based sentence encoders are now standard infrastructure for retrieval, clustering, semantic-similarity scoring, and downstream classification across domains. When applied to corpora with substantial formulaic or repetitive syntactic structure — court opinions with boilerplate preambles, religious canons organized around canonical enumerations, scientific abstracts with the IMRaD template, parliamentary speeches with procedural refrains — an under-examined question arises: does the surface-syntactic regularity of the text leak into pooled semantic embeddings in ways that bias downstream analysis?
This paper reports a controlled measurement of one such effect. Using a curated Buddhist-canonical corpus as a tractable case study (early Pāli suttas from the Majjhima Nikāya and Saṃyutta Nikāya, with English parallel translations from multiple translators), we show that two architecturally-distinct multilingual embedders converge on a consistent inflation of pairwise cosine similarity for terms appearing in canonical parallel-rhetoric list-formula frames, relative to terms appearing only in narrative contexts. The effect survives bootstrap confidence-interval testing in all three configurations we evaluated and permutation label-shuffle testing in two of three.
We position the Buddhist corpus as a case study rather than the paper’s main interest. Buddhist canonical texts happen to offer an unusually clean substrate for this measurement — a closed corpus with well-attested formulaic sub-structures (the 32-parts body enumeration at Majjhima Nikāya 10; the four-elements definitional template at MN28 and MN62; the cemetery-contemplation series at MN119), multiple independent English translations per passage, and machine-readable editions at SuttaCentral and Access to Insight [prov: SuttaCentral.net, Access to Insight www.accesstoinsight.org]. The philological details matter only insofar as they permit matched controls. The computational-linguistics claim is about embedder behavior, not about Buddhist doctrine.
Our contributions are four. First, we provide a reproducible three-embedder benchmark of the template-vs-narrative cosine-similarity delta on a publicly-reconstructible canonical corpus. Second, we demonstrate cross-embedder convergence at the centroid level to within 0.0005 cosine units on an initial sample (n=89), and direction-consistent convergence to within ~0.03 cosine units at full sample (n=101). Third, we identify and quantify a lexical-overlap confound that is intrinsic to canonical list-formula corpora and is not cleanly separable from the parallel-syntax effect in our current data. Fourth, we propose a disambiguation experiment using structurally-parallel-but-lexically-disjoint controls as the natural follow-up.
The practical implication is a caution. Any analyst deploying sentence embedders on a corpus with non-trivial formulaic structure should anticipate that pairwise cosine similarity will be elevated between passages sharing that structure, independent of their propositional content. This is not an embedder bug — the behavior is lawful given how transformer attention pools over syntactically-regular token sequences — but it is an interpretive hazard for downstream analyses that treat cosine as a proxy for semantic relatedness.
2. Related Work
2.1 Transformer encoder biases and embedding geometry
Ethayarajh (2019) established that contextualized representations from BERT, ELMo, and GPT-2 are globally anisotropic and that upper layers produce more context-specific representations, with less than 5% of the variance in a word’s contextualized embedding explained by a static per-word embedding [prov: Ethayarajh 2019, EMNLP-IJCNLP, pages 55–65; arXiv:1909.00512]. The relevance of this finding for the present work is that any claim about cosine similarity in a pooled encoder output must be interpreted against an anisotropic background — differences of 0.05–0.09 cosine units are meaningful only relative to the baseline cosine distribution within the specific encoder.
Hewitt and Manning (2019) showed that syntax trees are recoverable as a linear transformation of BERT and ELMo representations — squared L2 distance in a probe-identified subspace encodes parse-tree distance between tokens [prov: Hewitt & Manning 2019, NAACL-HLT, ACL Anthology N19-1419]. The structural-probe literature establishes that syntactic information is implicitly encoded in deep transformer outputs. Our finding is consistent with a corollary hypothesis that pooled outputs over syntactically-regular inputs will inherit some of that regularity as a pooled-level signal.
2.2 Pooled representation limits and sentence-level encoders
Reimers and Gurevych (2019) introduced Sentence-BERT (SBERT), using siamese fine-tuning to produce sentence embeddings that can be compared with cosine similarity directly [prov: Reimers & Gurevych 2019, EMNLP-IJCNLP, ACL Anthology D19-1410]. Their work established the viability of pooled-then-compared sentence representations but did not specifically examine the interaction between pooling and syntactic regularity of the input.
Gao et al. (2021) showed with SimCSE that contrastive learning objectives regularize anisotropic pretrained embedding space toward uniformity, improving semantic textual similarity (STS) performance substantially [prov: Gao et al. 2021, EMNLP, ACL Anthology 2021.emnlp-main.552]. The SimCSE finding reinforces that pooled embedding geometry is malleable under training objective choices — which in turn means the inflation effect we report is specific to the training regime of the encoders we evaluate, not a universal claim about all pooled transformer outputs.
Practitioner literature comparing [CLS]-token vs mean-pooling for sentence embeddings consistently finds mean-pooling produces higher-quality semantic representations for similarity and clustering tasks [prov: Milvus AI reference documentation, Zilliz vector-database documentation]. Both encoders evaluated in this paper use mean-pooling (BGE-M3) or an analogous last-token representation strategy (Qwen3-Embedding-8B, per the Qwen3 technical report) rather than [CLS]-token extraction [prov: Chen et al. 2024 arXiv:2402.03216, Zhang et al. 2025 arXiv:2506.05176].
2.3 The specific encoders
BGE-M3 is a multilingual encoder supporting 100+ languages, 1024-dimensional output, 8,192 token context, with 560M parameters, trained via self-knowledge distillation over dense, multi-vector, and sparse retrieval objectives simultaneously [prov: Chen et al. 2024, “BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation,” arXiv:2402.03216; ACL Anthology 2024.findings-acl.137]. It achieves state-of-the-art results on multilingual and cross-lingual retrieval benchmarks.
Qwen3-Embedding-8B is the 8B-parameter tier of the Qwen3-Embedding series, with 4096-dimensional output and support for 100+ languages including code. It ranks first on the MTEB multilingual leaderboard as of June 2025 (score 70.58) [prov: Zhang et al. 2025, “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models,” arXiv:2506.05176]. Its training pipeline uses three-stage contrastive pre-training, supervised fine-tuning, and model merging. The instruction-prefix variant prepends a task description to each input; we evaluate both prefix-on and prefix-off configurations.
2.4 Buddhist canonical text processing
Computational work on the Pāli Canon is an established but small subfield. SuttaCentral (founded 2005) provides the most comprehensive digital parallel-aligned corpus of early Buddhist texts across Pāli, Sanskrit, Chinese, and Tibetan [prov: SuttaCentral.net about-page; Bingenheimer 2020, “Digitization of Buddhism,” Oxford Bibliographies]. Access to Insight (1993–2013) remains the primary digital repository of Thanissaro Bhikkhu’s and Bhikkhu Bodhi’s English Pāli translations [prov: www.accesstoinsight.org; Wikipedia entry on Access to Insight]. Recent computational approaches include stylometric analyses for chronological layering [prov: SuttaCentral discourse forum thread “Pali Canon Stylometric Analysis”], machine translation using DharmaNexus and SuttaCentral parallel data [prov: Dharmamitra documentation, dharmamitra.github.io/dharmamitra-guides/dharmanexus], and ByT5-Sanskrit-derived segmentation models adapted for Pāli. None of this prior work specifically addresses embedder-level cosine inflation on formulaic sub-structures, which is the present paper’s contribution.
2.5 Formulaic language in early Buddhist oral composition
Philological study of the Pāli Canon has long recognized its pervasive formulaic and repetitive character. Anālayo (2007) catalogs pericopes, waxing-syllable patterns, and other mnemonic-driven structural elements as constitutive of the oral composition and transmission of Pāli discourses [prov: Anālayo 2007, “Oral Dimensions of Pāli Discourses: Pericopes, Other Mnemonic Techniques, and the Oral Performance Context,” buddhismuskunde.uni-hamburg.de]. Allon (2021) provides the current monograph-length synthesis of early-Buddhist text composition and transmission, emphasizing that these texts are “highly stylized, formally structured, extremely formulaic and repetitive, carefully crafted constructs” in which wording used to describe a given event, concept, or practice is standardized across the corpus [prov: Allon 2021, “The Composition and Transmission of Early Buddhist Texts with Specific Reference to Sutras,” Numata Center for Buddhist Studies; PMC8789374]. The philological consensus that Pāli texts are deeply formulaic is precisely what makes them a clean computational-linguistics testbed — the structural regularity we exploit for controlled measurement is a well-attested feature of the corpus, not a novel claim.
2.6 Formulaic-text analysis in other domains
Legal NLP has produced domain-adapted encoders (LEGAL-BERT, InLegalBERT) explicitly motivated by the observation that legal language is “more formulaic and repetitive than generic language” (Chalkidis et al. 2020). Legal-NLP has also produced specialized encoders for long-form formulaic text (Lawformer, Xiao et al. 2021), indicating that architectural adaptations exist for formula-heavy domains; whether pooled-embedding methods on such corpora show the parallel-rhetoric inflation we report is an open question. Computational Qur’anic studies have used word2vec and BiLSTM architectures over classical Arabic corpora for semantic search and retrieval [prov: Alqahtani & Atwell 2022, “Arabic natural language processing for Qur’anic research: a systematic review,” Artificial Intelligence Review, Springer, doi:10.1007/s10462-022-10313-2; Abdelaali et al. 2019, “QSST: A Quranic Semantic Search Tool based on word embedding,” ScienceDirect].
Structural priming in language models is a further adjacent line of work arguing that pooled and per-token representations inherit syntactic-structural features of their input. Prior studies have probed whether transformer encoders exhibit priming-like behavior where prior-context syntactic structure influences subsequent-token representations (Prasad et al. 2019; Sinclair et al. 2022). These argue that syntactic-structural regularities are absorbed into transformer representations, a cognate mechanism to the one we hypothesize drives template-vs-narrative cosine inflation. To our knowledge, no prior work has directly measured the cosine-inflation effect of formulaic syntactic parallelism on pooled transformer outputs across two architecturally-distinct encoders on a matched template-vs-narrative control design. This is the gap the present paper addresses.
2.7 Methodological inspiration
The analytical scaffold — embedding curated passages, computing per-term cluster tightness, measuring cross-subcluster separation, and deriving significance via bootstrap and permutation tests — adapts representational similarity analysis (RSA; Kriegeskorte et al. 2008) from systems neuroscience to cross-lingual embedding evaluation on Buddhist canonical text. The g1/g2/g3 pipeline we describe in §4 implements a simplified three-stage RSA measurement with no fitness-function optimization.
3. Problem Formulation
We measure the following quantity. Let a corpus of passages $P = {p_1, \ldots, p_n}$ be partitioned so that each passage is tagged with a term $t \in T$ and a category $c(t) \in {\text{template}, \text{narrative}, \text{doctrinal}}$. Each passage $p_i$ is embedded by a fixed encoder $f$ to a vector $f(p_i) \in \mathbb{R}^d$, normalized to unit L2 norm. The pairwise cosine similarity between passages $p_i, p_j$ is $\langle f(p_i), f(p_j) \rangle$.
For each subcluster $S \subseteq P$ (e.g., all template-control passages), the within-subcluster mean pairwise cosine is the average of $\langle f(p_i), f(p_j) \rangle$ over all unordered pairs ${i, j}$ with $p_i, p_j \in S$ and $i \neq j$. The template-minus-narrative delta is the difference between within-subcluster mean cosine for the template-control subcluster and the narrative-control subcluster.
Two null-hypothesis tests are applied:
-
Passage-level bootstrap (10,000 iterations, seed=42, xorshift128 PRNG) [prov: results//significance-summary.md “PRNG: xorshift128, seed=42; 10,000 bootstrap iterations”]: passages are resampled with replacement within each subcluster. The 95% percentile CI for the delta is reported. CI excluding zero is evidence the direction is real at the 95% confidence level. The template-vs-narrative bootstrap (Test 3 in the pipeline) draws 6 and 24 passages with replacement and does not use the per-term skip filter that the cross-lingual Pāli↔English gap bootstrap (Test 1, reported in the companion paper) applies; Test 3 effective iterations are therefore the full 10,000. We note for cross-paper consistency that the per-term Pāli↔English gap bootstrap in Test 1 (Paper A) discards iterations failing to yield ≥2 English and ≥1 Pāli passages per resample, producing ~4,300 effective iterations at n=3 passages per term with an approximately 0.43 non-skip retention rate [prov: results/-n101/significance-analysis.csv Test 1
notesfield, mean skip_rate ≈ 0.570]. This paper’s primary test (Test 3) does not suffer this skip-rate artefact because its pool (42 control passages) is large enough that the resample always contains enough passages of each label to compute pairwise means. -
Term-level label permutation (10,000 permutations, seed=42): the term-category labels are shuffled across the 14 control terms (6 template + 8 narrative). The fraction of permuted deltas greater-or-equal to the observed delta is the permutation p-value.
The hypothesis under test is H1: the template-minus-narrative delta is positive. The null is H0: the delta is zero or negative.
The analytical move the paper turns on is cross-embedder comparison. If the effect is real and encoder-agnostic, it should survive in both BGE-M3 and Qwen3-Embedding-8B despite their architectural differences (dimensionality, parameter count, training objective, pooling strategy, tokenizer). If it is an artifact of any one encoder’s training regime, it should appear in at most one.
4. Experimental Setup
4.1 Corpus and term partition
The corpus comprises 101 non-null passages over 34 Pāli-canonical terms [prov: data/passages.json; INTERIM-REPORT-n101-2026-04-22.md “101 non-null passages”]. Per term, up to three passages are provided drawn from three translator lineages:
- Pāli — Mahāsaṅgīti edition from SuttaCentral [prov: passages.json
translator:"pali"withsource_urlpatternhttps://suttacentral.net/{ref}/pli/ms]. - Sujato — Bhikkhu Sujato’s English translations of the four Nikāyas from SuttaCentral [prov: passages.json
translator:"sujato"withsource_urlpatternhttps://suttacentral.net/{ref}/en/sujato]. - Thanissaro — Ṭhānissaro Bhikkhu’s English translations from Access to Insight [prov: passages.json
translator:"thanissaro"withsource_urlpatternhttps://www.accesstoinsight.org/tipitaka/...].
Two documented substitutions replace Thanissaro in passages where he did not translate the relevant sutta:
- Bhikkhu Bodhi substitutes for Thanissaro on SN22.45 (anicca) and AN3.31 (mātā) [prov: passages.json
translator:"bodhi"withnotesfield “Substituted Bhikkhu Bodhi because Thanissaro did not translate this sutta.”]. - Narada Thera substitutes for Thanissaro on DN31 (pitā) [prov: passages.json
translator:"narada"withnotesfield “Thanissaro has not translated DN31; substituted Narada Thera (the standard ATI translation).”].
Per-term translator counts in passages.json are therefore 34 × pali, 34 × sujato, 31 × thanissaro, 2 × bodhi, 1 × narada (total 102; citta/AN1.51 lacks a third English translator, reducing the non-null count to 101). Terms are partitioned into:
-
Doctrinal terms (n=20): 5 each in four subclusters (metaphysics, meditation, ethics, wisdom). These are treated in a companion paper (Paper A, cross-lingual analysis) and are not the subject of the present parallel-rhetoric measurement.
-
Control terms (n=14): partitioned into
- Template-bound (n=6):
kāya(body, MN10 kāyānupassanā; anchors the four-foundations-of-mindfulness template),vedanā(feeling, MN10 vedanānupassanā),kesā(head-hair, MN10 32-parts-of-the-body enumeration),aṭṭhi(bone, MN119 nine-cemetery-contemplations enumeration),pathavī(earth-element, MN28 and MN62 four-elements definitional template),āpo(water-element, MN28 and MN62 four-elements definitional template) [prov: INTERIM-REPORT-n101-2026-04-22.md line 30: “6 control_template (was 2: kāya, vedanā → now also kesā, aṭṭhi, pathavī, āpo)”]. - Narrative-only (n=8):
rukkha(tree),udaka(water, narrative sense),aggi(fire, narrative sense),gāma(village),rājā(king),mātā(mother),pitā(father),akkhi(eye). These appear across varied suttas without canonical parallel-rhetoric frames [prov: INTERIM-REPORT-n101-2026-04-22.md line 31: “8 control_narrative (unchanged)”].
- Template-bound (n=6):
The 14 control terms are the subject of the analysis. Three passages per term (one per translator) yield 42 control passages; the remaining 59 passages come from the 20 doctrinal terms and are not used in the template-vs-narrative delta computation.
4.2 Encoders and configurations
Three configurations are evaluated:
| Config | Model | Params | Dim | Prefix | Quantization |
|---|---|---|---|---|---|
| BGE-M3 | bge-m3 (FP16) | 560M | 1024 | — | default |
| Qwen3-prefix | Qwen3-Embedding-8B | 8B | 4096 | custom domain-adapted prefix (see below) | Q4_K_M |
| Qwen3-no-prefix | Qwen3-Embedding-8B | 8B | 4096 | none | Q4_K_M |
[prov: INTERIM-REPORT-n101-2026-04-22.md §Step 3 table; results/*/summary.md embedding metadata blocks]
BGE-M3 is hosted via LM Studio as text-embedding-bge-m3; Qwen3-Embedding-8B as text-embedding-qwen3-embedding-8b. Both are queried via the OpenAI-compatible embeddings endpoint at http://localhost:1234/v1/embeddings [prov: INTERIM-REPORT-n101-2026-04-22.md line 59-60: “Model swap via lms unload text-embedding-bge-m3 then lms load text-embedding-qwen3-embedding-8b”]. Embeddings are L2-normalized before cosine computation.
The Qwen3 prefix-on configuration uses a custom domain-adapted instruction prefix following the Qwen3 Instruct: {task description}\nQuery: {input} schema specified in the Qwen3 technical report [prov: Zhang et al. 2025, arXiv:2506.05176 §3.1]. The literal prefix string, defined at src/g1-term-integrity.ts:83 as the QWEN3_INSTRUCTION constant, is:
"Instruct: Retrieve semantically similar Buddhist doctrinal passages\nQuery: "
This is a Buddhist-doctrinal-retrieval-specific task description, not the canonical web-search task description illustrated in the Qwen3 technical report. Our prefix-on measurements are therefore conditional on this specific domain-adapted task description. The no-prefix configuration feeds passages directly with no task-instruction prepended, which we treat as a more conservative cross-lingual-honest baseline because the instruction prefix is English and may bias the encoder toward treating Pāli inputs as English paraphrases.
4.3 Pipeline
Three-stage pipeline, all in TypeScript (Bun runtime):
g1-term-integrity.ts: reads passages.json, posts each passage to the embeddings endpoint, writes embeddings.json and per-term-stats.csv, similarity-matrix.csv, cross-term-separation.csv [prov: src/g1-term-integrity.ts file header].g2-doctrinal-topology.ts: reads embeddings.json, computes within-subcluster and cross-subcluster pairwise cosine statistics, writes topology-analysis.csv, topology-between-matrix.csv, topology-summary.md.g3-significance.ts: reads embeddings.json, runs bootstrap and permutation tests, writes significance-analysis.csv, significance-summary.md.
All three stages are deterministic given seed=42 [prov: results/bge-m3-v2-n101/significance-summary.md line 5: “PRNG: xorshift128, seed=42”].
4.4 Metrics reported
Primary: passage-level bootstrap 95% CI for template-minus-narrative delta (test3 in the significance CSV). Bootstrap resamples all 42 control-passage vectors with replacement within each subcluster and recomputes the delta per iteration; the 2.5th and 97.5th percentiles of the 10,000-iteration distribution are the CI endpoints [prov: results/*/significance-analysis.csv line 37 test3_template_minus_narrative].
Secondary: term-level permutation p (test5a). Each iteration randomly reassigns the template/narrative label to the 14 control terms and recomputes the delta [prov: results/*/significance-analysis.csv line 39 test5a_parallel_rhetoric_permutation].
Tertiary: centroid-level (within-subcluster mean pairwise cosine) template-minus-narrative delta. This is computed from topology-analysis.csv directly and is used for the cross-embedder convergence claim at n=89 [prov: results/*-n89pre/topology-analysis.csv].
5. Results
5.1 Three-way template-vs-narrative delta at n=101
The primary result table:
| Config | Bootstrap median | 95% CI | CI excludes 0? | Permutation observed Δ | Permutation p |
|---|---|---|---|---|---|
| BGE-M3 | +0.0587 | [0.0086, 0.1249] | YES | 0.0558 | 0.0384 |
| Qwen3-prefix | +0.0900 | [0.0518, 0.1464] | YES | 0.0924 | 0.0033 |
| Qwen3-no-prefix | +0.0619 | [0.0057, 0.1399] | YES | 0.0596 | 0.0663 |
[prov: results/bge-m3-v2-n101/significance-analysis.csv line 37: “test3_template_minus_narrative,pairwise_mean_cosine_delta_passage_bootstrap,0.058749,0.008632,0.124852,0.988300”] [prov: results/qwen3-embedding-8b-n101/significance-analysis.csv line 37: “test3_template_minus_narrative,…,0.089991,0.051785,0.146352,0.999900”] [prov: results/qwen3-embedding-8b-no-prefix-n101/significance-analysis.csv line 37: “test3_template_minus_narrative,…,0.061940,0.005671,0.139880,0.984100”] [prov: results/bge-m3-v2-n101/significance-analysis.csv line 39: “test5a_parallel_rhetoric_permutation,…,0.055800,,,0.038400”] [prov: results/qwen3-embedding-8b-n101/significance-analysis.csv line 39: “test5a_parallel_rhetoric_permutation,…,0.092425,,,0.003300”] [prov: results/qwen3-embedding-8b-no-prefix-n101/significance-analysis.csv line 39: “test5a_parallel_rhetoric_permutation,…,0.059642,,,0.066300”]
The effect direction is consistent across all three configurations and of the same order of magnitude (approximately +0.06 to +0.09 cosine units). In all three, the bootstrap 95% CI excludes zero. In two of three (BGE-M3, Qwen3-prefix), the permutation p is below 0.05. In one (Qwen3-no-prefix), the permutation p is 0.0663, above the 0.05 threshold but still below 0.10.
The bootstrap fraction of iterations with delta greater than zero is 98.83% for BGE-M3, 99.99% for Qwen3-prefix, and 98.41% for Qwen3-no-prefix [prov: same CSV lines, field 6 “fraction_delta_gt_0”].
5.2 Cross-embedder convergence at n=89 (centroid-level)
At the earlier n=89 checkpoint — before the four lexical-overlap-heavy terms (kesā, aṭṭhi, pathavī, āpo) were added to the template-control partition — the centroid-level (within-subcluster mean pairwise cosine) delta between template and narrative controls was remarkably consistent across the two most architecturally-different encoders:
| Config (n=89) | template_within | narrative_within | Δ template−narrative |
|---|---|---|---|
| BGE-M3 | 0.783078 | 0.620520 | +0.162558 |
| Qwen3-prefix | 0.825595 | 0.662456 | +0.163139 |
[prov: results/bge-m3-v2-n89pre/topology-analysis.csv line 6 (control_template,2,0.783078,…) and line 7 (control_narrative,8,0.620520,…)] [prov: results/qwen3-embedding-8b-n89pre/topology-analysis.csv line 6 (control_template,2,0.825595,…) and line 7 (control_narrative,8,0.662456,…)]
The two encoder deltas differ by 0.000581 cosine units. We note this convergence without overclaiming — at n=89 the template-control subcluster contained only two terms (kāya, vedanā) and thus a single term-pair, so the centroid-level delta is computed over just one template-template pair against 28 narrative-narrative pairs. The convergence should therefore be read as direction and magnitude align closely across embedders when the underlying within-template structure is a clean MN10-bound parallel (kāya-vedanā are both nominalized objects of the four-foundations-of-mindfulness contemplative template at MN10), rather than as a general claim that the two encoders always agree to four decimal places.
At n=101, after adding four additional template-control terms whose internal lexical overlap is substantial (§5.4 below), the centroid-level deltas diverge somewhat:
| Config (n=101) | template_within | narrative_within | Δ template−narrative |
|---|---|---|---|
| BGE-M3 | 0.693502 | 0.620520 | +0.072982 |
| Qwen3-prefix | 0.758936 | 0.662456 | +0.096480 |
| Qwen3-no-prefix | 0.554221 | 0.456120 | +0.098101 |
[prov: results/bge-m3-v2-n101/topology-analysis.csv line 6-7; results/qwen3-embedding-8b-n101/topology-analysis.csv line 6-7; results/qwen3-embedding-8b-no-prefix-n101/topology-analysis.csv line 6-7]
The n=101 centroid-level deltas span a range of approximately +0.073 to +0.098 across the three configurations — still direction-consistent, still positive, but no longer convergent to the 0.0005 level seen at n=89. This divergence is itself informative (§5.4).
5.3 Bootstrap CI stability across sample sizes
Comparing n=89 to n=101 bootstrap output:
| Config | n=89 median Δ | n=89 95% CI | n=89 CI width | n=101 median Δ | n=101 95% CI | n=101 CI width |
|---|---|---|---|---|---|---|
| BGE-M3 | 0.1206 | [0.0376, 0.3174] | 0.280 | 0.0587 | [0.0086, 0.1249] | 0.116 |
| Qwen3-prefix | 0.1703 | [0.1291, 0.2573] | 0.128 | 0.0900 | [0.0518, 0.1464] | 0.095 |
| Qwen3-no-prefix | 0.2338 | [0.1650, 0.3761] | 0.211 | 0.0619 | [0.0057, 0.1399] | 0.134 |
[prov: INTERIM-REPORT-n101-2026-04-22.md §Step 5 “Three-way CI table — template − narrative delta” lines 100-104]
CI widths tightened by 37–58% as passage count grew from 89 to 101, as expected from power expansion. Point estimates dropped by 47–74% — the n=89 numbers were upward-biased by the single highly-tight MN10 kāya-vedanā pair, which was diluted by the 15 template-template pairs available at n=101 (6 choose 2 = 15).
5.4 Lexical-overlap confound
The four template-control terms added between n=89 and n=101 are not lexically independent of each other. The MN10 32-parts enumeration, in which kesā (head-hair) is the lead term, is reused nearly verbatim in the MN28 definition of the internal earth-element (pathavī) and in part in the MN62 four-elements teaching to Rāhula [prov: INTERIM-REPORT-n101-2026-04-22.md line 134: “The 32-parts list in kesā (MN10) is reused almost verbatim in pathavī (MN28) as the definition of the internal earth-element”]. The 11-fluids subset of the body-parts enumeration appears inside the MN62 internal water-element (āpo) definition.
This produces substantial inter-template lexical overlap, visible in the similarity matrix:
| Pair | BGE-M3 cosine | Qwen3-prefix cosine | Qwen3-no-prefix cosine |
|---|---|---|---|
kesā × pathavī | 0.8315 | 0.8884 | 0.7809 |
pathavī × āpo | — | — | 0.7809 |
kesā × aṭṭhi | (loosest template pair with kesā) | — | — |
[prov: results/bge-m3-v2-n101/topology-analysis.csv line 6 “tightest_pair,tightest_cosine” field = “kesā×pathavī,0.831541”] [prov: results/qwen3-embedding-8b-n101/topology-analysis.csv line 6 “tightest_pair,tightest_cosine” = “kesā×pathavī,0.888389”] [prov: results/qwen3-embedding-8b-no-prefix-n101/topology-analysis.csv line 6 “tightest_pair,tightest_cosine” = “pathavī×āpo,0.780878”] [prov: INTERIM-REPORT-n101-2026-04-22.md lines 136-139 “kesā ↔ pathavī is the tightest template pair in every config…”]
kesā×pathavī is the tightest template pair in the two encoder configurations (BGE-M3, Qwen3-prefix) that use the MN10 reference pool most aggressively for multilingual alignment. For Qwen3-no-prefix, the tightest is pathavī×āpo, the two four-elements terms that share both MN28 and MN62 as source passages and therefore share the strongest passage-level lexical content. In all three configurations, the top template pair involves terms whose Pāli passage material overlaps by multiple sentences.
This means the template-minus-narrative inflation we report at n=101 reflects at least two distinct effects which the present data cannot cleanly separate:
-
The parallel-rhetoric hypothesis (H1): passages appearing in structurally-parallel list-formula frames produce more similar pooled embeddings, because attention over syntactically-regular input sequences produces structurally-regular pooled outputs.
-
The lexical-overlap confound (H0-alternative): passages that share multi-sentence lexical content produce more similar pooled embeddings for the mundane reason that they share tokens.
The question whether H1 contributes over and above H0-alternative is not resolved in the current data. The n=89 numbers, where the sole template-template pair (kāya × vedanā at MN10) does not share substantial verbatim body-text between the two terms (kāya and vedanā are indexed by the same MN10 structural template but the Buddha’s prose descriptions of them are distinct), provide weaker but cleaner evidence for H1. The n=101 numbers provide higher statistical power but at the cost of the added confound.
5.5 Symmetric reporting of n=89 → n=101 permutation-p transitions
The n=89 → n=101 transition affected all three configurations’ permutation p-values, not only Qwen3-no-prefix. The full picture, disclosed symmetrically:
| Config | n=89 perm p | n=101 perm p | Direction |
|---|---|---|---|
| BGE-M3 | 0.1793 | 0.0384 | Upgrade (fail → pass at α=0.05) |
| Qwen3-prefix | 0.0410 | 0.0033 | Improvement (pass → stronger pass) |
| Qwen3-no-prefix | 0.0210 | 0.0663 | Regression (pass → fail at α=0.05) |
[prov: results/bge-m3-v2-n89pre/significance-analysis.csv line 35: test5a_parallel_rhetoric_permutation,...,0.179300; results/qwen3-embedding-8b-n89pre/significance-analysis.csv line 35: ...,0.041000; results/qwen3-embedding-8b-no-prefix-n89pre/significance-analysis.csv line 35: ...,0.021000; n=101 values from the respective -n101/significance-analysis.csv line 39 rows]
The n=89 → n=101 transition simultaneously moved BGE-M3 from non-significant to significant on the permutation test (0.1793 → 0.0384), tightened Qwen3-prefix (0.0410 → 0.0033), and moved Qwen3-no-prefix from significant to non-significant (0.0210 → 0.0663). One config upgraded, one held-and-improved, one regressed. The symmetric picture matters because it reveals that the permutation test is itself unstable under sample-size expansion in both directions — it is not the case that the n=89 state was a uniformly stronger baseline that n=101 partially degraded.
The bootstrap CIs, by contrast, exclude zero uniformly across all three configurations at both n=89 and n=101 [prov: INTERIM-REPORT-n101-2026-04-22.md §Step 5 three-way CI table; results/-n89pre/significance-analysis.csv line 33; results/-n101/significance-analysis.csv line 37]. We treat the bootstrap CI as the primary load-bearing evidence because its behavior is consistent across the n=89→n=101 transition for all three configurations, while the permutation-p transitions move in different directions for different configurations.
The combinatorial explanation for permutation-p volatility under label-pool expansion — that at n=89 the permutation null shuffles 10 control-term labels (2 template + 8 narrative) while at n=101 it shuffles 14 (6 + 8), producing a larger and more diverse null distribution — is consistent with the symmetric picture: more extreme shuffled deltas become more available in either direction as the label pool grows, producing both upgrades (BGE-M3) and regressions (Qwen3-no-prefix) depending on where the observed delta sits in the null. The permutation test is directionally informative but not independent of the bootstrap CI, and we treat its movements as corroborating rather than decisive.
We flag the Qwen3-no-prefix regression specifically because it is the only direction-change that crosses the α=0.05 threshold in the wrong direction; it is inconvenient for our claim and is not explained away by the combinatorial reasoning alone. It remains an open question whether a different no-prefix configuration or a richer control-term partition would restore n=89-level permutation significance. This is flagged in §7 Limitations.
5.6 Effect size contextualization
A within-subcluster mean pairwise cosine difference of +0.06 to +0.09 is a small effect in absolute cosine terms but a large effect relative to the spread of within-subcluster variation for narrative controls (BGE-M3 narrative within_std = 0.0728, Qwen3-prefix = 0.0650, Qwen3-no-prefix = 0.0918) [prov: results/*-n101/topology-analysis.csv line 7 field 4 within_std]. Expressed in standard-deviation units, the template-minus-narrative delta is 0.67–1.39σ relative to the narrative-control within-subcluster spread (BGE-M3: 0.0587/0.0728 = 0.81σ; Qwen3-prefix: 0.0900/0.0650 = 1.39σ; Qwen3-no-prefix: 0.0619/0.0918 = 0.67σ). This is comparable to effect sizes reported as “moderate” in cross-lingual retrieval evaluation literature and is well above the order-of-magnitude ~0.01 cosine rough estimate for within-term intra-translator-pair variance on the same passages (this is an order-of-magnitude rough estimate rather than a formally computed noise floor; a rigorous within-term translator-pair variance statistic is future work).
6. Generalizability Discussion
The finding is reported on a Buddhist-canonical corpus, but the mechanism we hypothesize is encoder-general. We distinguish two claim levels:
Strong, empirically-supported claim (from our measurements): Across two architecturally-distinct multilingual encoders (BGE-M3, Qwen3-Embedding-8B), evaluated in three configurations on a Buddhist-canonical corpus (n=101, 14 control terms), the within-subcluster mean pairwise cosine similarity is elevated by +0.06 to +0.09 cosine units for template-bound controls relative to narrative-only controls, with bootstrap 95% CI excluding zero in all three configurations and permutation p below 0.05 in two of three.
Hypothesis, pending cross-domain validation (NOT proven by our data): Similar inflation should be anticipated in any corpus with canonical list-formula syntactic parallelism. Candidate domains include: legal briefs with boilerplate preambles and formulaic statutory recitations; liturgical texts (Catholic Mass, Anglican Book of Common Prayer, Jewish prayer-cycle) with canonical refrains; scientific abstracts following the IMRaD template; Qur’anic sūrahs with recurring refrains and oath-formulas; Vedic hymn-formulas; parliamentary procedure with standard motions.
6.1 Mechanism hypothesis
Transformer self-attention computes token-level representations as weighted mixtures of all other tokens in the input. Mean-pooled output (or last-token output for decoder-style embedders like Qwen3) averages these representations. When two passages share a syntactic parallelism — the same sequence of POS frames, the same clause structure, the same enumeration pattern — the attention-weight distributions over their token sequences should themselves be structurally similar, because attention heads trained on general text tend to allocate attention by syntactic role as much as by lexical content (Hewitt & Manning 2019 [prov: Hewitt & Manning 2019, N19-1419] shows that syntactic information is linearly recoverable). Pooled outputs therefore inherit the structural similarity of their inputs as a pooled-level signal.
This mechanism is consistent with but not proven by our finding. A direct mechanistic experiment — attention-head-level ablation, or comparing encoders trained explicitly to be syntax-blind against standard encoders — is outside the scope of the present paper.
6.2 Why the Buddhist corpus makes a good case study
- Corpus closure and digital availability: the Pāli Canon is a bounded, digitized corpus with parallel translations; there is no open-ended document-selection bias.
- Well-attested formulaic sub-structures: Anālayo (2007) and Allon (2021) document the pericope system in detail; we did not need to discover the template structure, only to select terms that instantiate it.
- Multiple independent English translations per Pāli source: this provides within-term variance (three translator versions per passage) that serves as a natural control for translator-idiosyncrasy effects.
- Cross-lingual stress test: Pāli↔English evaluation simultaneously exercises the encoder’s multilingual capability, allowing us to compare prefix-on vs prefix-off Qwen3 configurations and examine encoder-family effects.
We are careful not to claim that the magnitude of the effect (+0.06 to +0.09) will transfer to other corpora — the exact magnitude likely depends on the syntactic-parallelism density, lexical overlap, and training-corpus representation of the target domain. What we conjecture transfers is the sign and shape of the effect: template-bound content will sit closer in pooled embedding space than narrative-only content, across encoders, at a magnitude that is detectable on samples of order n~100 and meaningful relative to narrative-content within-cluster variance.
6.3 Implications for downstream analyses
Analysts using sentence embedders on formulaic corpora should:
- Anticipate elevated cosine similarity between structurally-parallel passages that is not reducible to propositional-content similarity. Clustering and retrieval-based pipelines will preferentially pull together parallel-formula content, independent of whether the formulas are semantically synonymous.
- Construct matched controls where possible: if the downstream question is “how similar are these two concepts,” the comparison must include a narrative-frame baseline, not just a template-frame comparison.
- Compare across at least two encoders to separate encoder-specific training artifacts from shared structural effects.
- Treat permutation tests with caution when the control-term pool is small. Bootstrap over passages (with replacement) is more robust than permutation over terms.
7. Limitations
7.1 Single philological corpus
The measurement is conducted on one corpus family (Pāli Canon with English parallel translations). The generalization claim in §6 is a hypothesis, not a proof. Replication on legal corpora (court opinions with formulaic boilerplate), liturgical corpora (comparable canonical refrains), or scientific-abstract corpora (IMRaD structural regularity) is needed before the effect can be claimed as encoder-general. This is flagged as the primary extension target.
7.2 Lexical-overlap confound is not cleanly isolated
As discussed in §5.4, the n=101 template-control partition contains terms whose Pāli source passages share multi-sentence verbatim content (the MN10 32-parts enumeration recurs in MN28 pathavī definition; the 11-fluids subset in MN62 āpo). We cannot separate, in the current data, the contribution of pure parallel-rhetoric structure from the contribution of raw lexical overlap. The effect direction is robust to this; the effect magnitude is inflated by some unmeasured amount. A clean disambiguation requires template-bound terms whose Pāli passages are structurally parallel but lexically disjoint. §9 proposes such an experiment.
7.3 Permutation p regression for Qwen3-no-prefix
The no-prefix configuration’s permutation p regressed from 0.0210 at n=89 to 0.0663 at n=101, crossing the 0.05 threshold in the wrong direction despite a simultaneous tightening of the bootstrap CI. We disclose this without claiming to have explained it away. The combinatorial argument (larger control-term pool produces more diverse shuffled deltas, including more extremes by chance) is suggestive but not conclusive. A reader who weights permutation p over bootstrap CI would reach a weaker conclusion than we do; a reader who agrees with our prioritization of the bootstrap CI as primary evidence reaches the conclusion we report. We flag this as a methodological judgment call.
7.4 Only two encoders
Our cross-embedder comparison spans exactly two model families (BGE-M3 and Qwen3-Embedding), three configurations. A stronger cross-encoder generalization claim would require additional families — e.g., multilingual-E5, Cohere-embed-multilingual-v3, OpenAI text-embedding-3-large, Nomic-embed-multilingual. The two encoders we evaluate are architecturally dissimilar (560M vs 8B parameters; 1024 vs 4096 dimensions; self-knowledge-distillation vs three-stage contrastive pretraining + supervised + merging) but are both open-weight and both trained with heavy multilingual retrieval objectives. Proprietary encoders with substantially different training regimes (e.g., task-tuned for semantic textual similarity rather than retrieval) may show different effect magnitudes.
7.5 Only two parallel-formula families tested
The six template-bound control terms span exactly two formula families: the four-foundations-of-mindfulness template (kāya, vedanā from MN10) and the body/element enumeration family (kesā 32-parts, aṭṭhi cemetery, pathavī earth-element, āpo water-element). A richer test would include additional structurally-distinct formula families — the iddhipāda (four bases of spiritual power) enumeration, the five-hindrances (pañca nīvaraṇā) list, the seven factors of enlightenment (satta bojjhaṅgā) — to test whether the effect is specific to body/element lists or general across Pāli pericope structures. This is future work.
7.6 Quantization and FP16
Qwen3-Embedding-8B was evaluated at Q4_K_M quantization, not full precision. Full-precision evaluation could shift the effect magnitude by small amounts; we do not expect a direction change but cannot rule out magnitude changes. BGE-M3 was evaluated at default LM Studio quantization (approximately FP16). [prov: INTERIM-REPORT-n101-2026-04-22.md line 155: “Results still condition on Q4_K_M quantization of Qwen3-Embedding-8B and BGE-M3 at default quantization. No full-precision cross-check.”]
7.7 Translator-set non-exhaustiveness
Three translators per passage is a sample; the full space of published English translations of the Pāli Canon is larger. Translator-idiosyncrasy effects could in principle drive additional variance not captured by our sample. The within-translator bootstrap handles this for the uncertainty estimate but does not eliminate the systematic bias possibility.
7.8 Category-partition researcher-agency dependence
The template-vs-narrative partition of the 14 control terms is a researcher judgment call informed by standard Pāli canonical scholarship (pericope-bound terms vs freely-distributed narrative terms). It is not an independently-adjudicated gold-standard partition [prov: INTERIM-REPORT-n101-2026-04-22.md line 153: “Category partition (contested vs core, doctrinal subcluster assignments) remains a research-agent call, not an independently adjudicated gold standard”]. Shifting one or two boundary terms could move the effect by a small but non-zero amount.
8. Conclusion
We have measured a consistent cross-embedder inflation of pairwise cosine similarity for passages containing terms bound to canonical parallel-rhetoric list-formula syntactic frames, relative to passages containing narrative-only control terms. Using two architecturally-distinct multilingual encoders (BGE-M3 560M/1024-dim and Qwen3-Embedding-8B 8B/4096-dim) on a curated Buddhist-canonical corpus (n=101 passages, 14 control terms), the effect spans +0.06 to +0.09 cosine units across configurations. Bootstrap 95% CIs exclude zero in all three configurations; permutation p is below 0.05 in two of three.
Scoping our claim: the measurement is scoped to one corpus family, two encoders, three configurations, and two parallel-formula families. The cross-embedder convergence at an earlier n=89 checkpoint (within 0.0005 cosine units at the centroid level, between the two most architecturally-different encoders) provides evidence that the effect is not specific to encoder architecture, but this should not be overinterpreted — at n=89 the template partition contained a single MN10-bound pair, and the convergence may be partly a feature of that specific pair’s lexical profile.
The generalization to legal, liturgical, and scientific-abstract corpora is flagged as a hypothesis requiring cross-domain validation, not a proven claim. The lexical-overlap confound at n=101 means the effect magnitude we report includes a contribution from simple shared lexical content that is not cleanly separable from the parallel-syntax effect hypothesis.
For practitioners applying sentence embedders to formulaic corpora, the practical implication stands: anticipate elevated cosine similarity between structurally-parallel passages, construct matched narrative-frame controls, and compare across at least two encoders.
9. Future Work
The primary follow-up experiment is a clean disambiguation of the parallel-syntax effect from the lexical-overlap confound. The design: identify 4–6 additional Pāli template-bound control terms that are structurally parallel but lexically disjoint from each other. Candidate families:
- iddhipāda (four bases of spiritual power):
chanda,viriya,citta,vīmaṃsā— each occurs in the same four-step structural template (desire → effort → intent → investigation as bases of supernormal powers) but the accompanying prose descriptions differ. - pañca nīvaraṇā (five hindrances):
kāmacchanda,vyāpāda,thīna-middha,uddhacca-kukkucca,vicikicchā— structural template is the standard five-element list but the accompanying sensory/affective descriptions are distinct per hindrance. - satta bojjhaṅgā (seven factors of enlightenment):
sati,dhammavicaya,viriya,pīti,passaddhi,samādhi,upekkhā— same seven-element template, seven distinct factor descriptions. Note:sati,viriya,samādhioverlap with the doctrinal term set and must be excluded from control partition to avoid category leakage.
If the template-vs-narrative inflation persists for lexically-disjoint-but-structurally-parallel controls, the parallel-syntax hypothesis is strengthened. If it vanishes, the effect at n=101 is attributable to lexical overlap and the generalization claims in §6 collapse.
Secondary extensions:
- Cross-domain replication: run the same pipeline on (a) U.S. Supreme Court opinion corpus partitioned by formulaic-preamble-bound terms vs free-narrative terms; (b) Catholic Mass text corpus partitioned by formulaic-refrain-bound terms vs homily-content terms; (c) PubMed abstract corpus partitioned by IMRaD-section-bound terms vs free-text-bound terms.
- Attention-head ablation: use mechanistic interpretability tools (e.g., attention-pattern clustering, causal patching) to identify whether specific attention heads in BGE-M3 and Qwen3-Embedding-8B are disproportionately responsible for the pooled-output inflation on parallel-syntax inputs. If so, syntax-blind fine-tuning variants could be a practical mitigation.
- Full-precision control: rerun Qwen3-Embedding-8B at full precision (no quantization) to verify that the effect magnitude does not shift with quantization.
- Additional encoder families: extend to multilingual-E5, Cohere-embed-multilingual-v3, Nomic-embed-multilingual, and at least one decoder-based embedder in the newer generation (e.g., GTE-Qwen2-7B or successor). A stronger generalization claim requires 4+ encoder families.
10. Reproducibility Appendix
10.1 Data and code
- Corpus:
data/passages.json— 34 terms, 101 non-null passages. Structure:{meta: {...}, terms: {term_id: {category, passages: [{translator, source_mn_ref, text}]}}}. - Scripts:
src/g1-term-integrity.ts,src/g2-doctrinal-topology.ts,src/g3-significance.ts. All TypeScript, Bun runtime. - Results directories (n=101):
results/bge-m3-v2-n101/,results/qwen3-embedding-8b-n101/,results/qwen3-embedding-8b-no-prefix-n101/. - Results directories (n=89):
results/bge-m3-v2-n89pre/,results/qwen3-embedding-8b-n89pre/,results/qwen3-embedding-8b-no-prefix-n89pre/. - Interim reports:
INTERIM-REPORT-n101-2026-04-22.md(primary reference for numerics),INTERIM-REPORT-n89-2026-04-22.md(earlier checkpoint).
10.2 Script invocations
# Stage 1 — embedding generation (per config)
bun run src/g1-term-integrity.ts --model bge-m3 --output results/bge-m3-v2-n101
bun run src/g1-term-integrity.ts --model qwen3-embedding-8b --prefix-on --output results/qwen3-embedding-8b-n101
bun run src/g1-term-integrity.ts --model qwen3-embedding-8b --prefix-off --output results/qwen3-embedding-8b-no-prefix-n101
# Stage 2 — topology (CPU-only, <1s per config)
bun run src/g2-doctrinal-topology.ts --results results/bge-m3-v2-n101
# (repeat for other two configs)
# Stage 3 — significance (10k bootstrap + 10k permutation per test; ~60s per config)
bun run src/g3-significance.ts --results results/bge-m3-v2-n101 --seed 42 --n-bootstrap 10000 --n-permutation 10000
# (repeat for other two configs)
[prov: INTERIM-REPORT-n101-2026-04-22.md §Step 5 “Wall clock: 178s total (10k bootstrap + 10k permutation per test × 3 configs in parallel)“]
10.3 PRNG and determinism
- PRNG: xorshift128, JavaScript implementation, seed=42.
- Bootstrap: 10,000 iterations per test.
- Permutation: 10,000 permutations per test.
- CI type: 95% percentile CI (2.5th and 97.5th percentiles of the bootstrap distribution).
[prov: results/bge-m3-v2-n101/significance-summary.md line 5]
Re-running with seed=42 reproduces the reported numerics bit-identically. Changing the seed changes the bootstrap and permutation draws but should not change the directional conclusions.
10.4 LM Studio setup
- Runtime: LM Studio local inference server, OpenAI-compatible API at
http://localhost:1234/v1/embeddings. - Models loaded:
text-embedding-bge-m3(default quantization, approximately FP16).text-embedding-qwen3-embedding-8bat Q4_K_M quantization.
- Memory: 30.5 GB available preflight; Qwen3-Embedding-8B resident size ~4.68 GB; model load ~13.78s cold.
- Model-swap procedure:
lms unload text-embedding-bge-m3→lms load text-embedding-qwen3-embedding-8b→ verify vialms ps.
[prov: INTERIM-REPORT-n101-2026-04-22.md §Step 3 “Three g1 runs (n=101)” table and surrounding paragraphs]
10.5 Wall-clock budget
Full three-configuration pipeline:
| Stage | BGE-M3 | Qwen3-prefix | Qwen3-no-prefix | Total |
|---|---|---|---|---|
| g1 embedding | 3s | 56s | 50s | 109s |
| g2 topology | <1s | <1s | <1s | <3s |
| g3 significance | ~60s | ~60s | ~60s | 178s parallel |
[prov: INTERIM-REPORT-n101-2026-04-22.md §Step 3, §Step 4, §Step 5 wall-clock rows]
Full cold-start reproduction budget: approximately 5 minutes on a workstation with ≥32 GB RAM and LM Studio pre-installed.
10.6 Corpus provenance
Pāli source passages are drawn from SuttaCentral’s digital editions of the Majjhima Nikāya, Saṃyutta Nikāya, Aṅguttara Nikāya, and Dīgha Nikāya (Mahāsaṅgīti edition) [prov: SuttaCentral.net, accessed 2026-04]. English translations are drawn from SuttaCentral for Bhikkhu Sujato’s Pali–English translations and from Access to Insight (www.accesstoinsight.org) for Ṭhānissaro Bhikkhu’s translations [prov: suttacentral.net/sujato; www.accesstoinsight.org]. Two documented substitutions replace Thanissaro where he did not translate the relevant sutta: Bhikkhu Bodhi for SN22.45 (anicca) and AN3.31 (mātā), and Narada Thera for DN31 (pitā) [prov: passages.json translator:"bodhi" and translator:"narada" with documented notes fields]. All translations are used within fair-use bounds for academic analysis.
Specific sutta references:
- MN10 Satipaṭṭhāna Sutta (four foundations of mindfulness; body-as-body, feelings-as-feelings; 32-parts-of-the-body enumeration).
- MN28 Mahāhatthipadopama Sutta (greater elephant-footprint simile; four-elements template).
- MN62 Mahārāhulovāda Sutta (greater exhortation to Rāhula; four-elements template with extended 11-fluids water-element subset).
- MN119 Kāyagatāsati Sutta (mindfulness of the body; cemetery-contemplations).
Each passage in the corpus is tagged with its source sutta reference; see data/passages.json field source_mn_ref.
10.7 Citation resolution notes
The scaffold code (g1-term-integrity.ts) originally attributed the RSA measurement routine to “David Noel Ng.” Literature search (WebSearch 2026-04-22, six distinct query strategies) located no publication under that authorship matching the RSA-for-embeddings methodology. The methodology correctly traces to Kriegeskorte et al. (2008) — now cited in §2.7 — with this paper’s cross-lingual adaptation on Buddhist canonical text as an original contribution. The scaffold header comment has been corrected in parallel.
Structural-priming precedents (Prasad et al. 2019; Sinclair et al. 2022) and the Lawformer long-document encoder (Xiao et al. 2021) are cited in §2.6 as adjacent work. No prior work was located measuring cosine-inflation-by-parallel-syntax on pooled transformer outputs across matched template-vs-narrative control designs; if such work exists and was missed, it should be added in future revisions.
Code-annotation fix (2026-04-22, g3-significance.ts): The permutation-test CSV annotation at src/g3-significance.ts previously hardcoded n_template_terms=2;n_narrative_terms=8 in the test5a_parallel_rhetoric_permutation row’s notes field, a stale value from the n=89-era template partition. The permutation logic itself correctly operated on all 14 control terms at n=101 (6 template + 8 narrative); only the CSV annotation was stale. The code has been updated to emit live counts (n_template_terms=${nTemplateTerms};n_narrative_terms=${nNarrativeTerms}); the next g3 run on each of the three n=101 result directories will regenerate significance-analysis.csv with the correct annotation. The paper’s numerical claims and body text are unaffected — all claims correctly state “14 control terms (6 template + 8 narrative)” in §3, §4.1, and elsewhere. Results directories currently contain the stale annotation; a simple bun run src/g3-significance.ts on each directory refreshes it without changing any numerical value.
References
- Allon, M. (2021). The Composition and Transmission of Early Buddhist Texts with Specific Reference to Sutras. Numata Center for Buddhist Studies, University of Hamburg. PMC8789374.
- Alqahtani, M., & Atwell, E. (2022). Arabic natural language processing for Qur’anic research: a systematic review. Artificial Intelligence Review, Springer. doi:10.1007/s10462-022-10313-2.
- Anālayo, Bhikkhu. (2007). Oral Dimensions of Pāli Discourses: Pericopes, Other Mnemonic Techniques, and the Oral Performance Context. Numata Center for Buddhist Studies.
- Bingenheimer, M. (2020). Digitization of Buddhism (Digital Humanities and Buddhist Studies). Oxford Bibliographies. [link]
- Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out of Law School. Findings of EMNLP 2020. [link]
- Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216. Findings of ACL 2024, ACL Anthology 2024.findings-acl.137.
- Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. EMNLP-IJCNLP 2019, pages 55–65, ACL Anthology D19-1006. arXiv:1909.00512.
- Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP 2021, ACL Anthology 2021.emnlp-main.552. [link]
- Hewitt, J., & Manning, C. D. (2019). A Structural Probe for Finding Syntax in Word Representations. NAACL-HLT 2019, ACL Anthology N19-1419. [link]
- Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational Similarity Analysis — Connecting the Branches of Systems Neuroscience. Frontiers in Systems Neuroscience, 2, 4. https://doi.org/10.3389/neuro.06.004.2008.
- Prasad, G., van Schijndel, M., & Linzen, T. (2019). Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 66–76. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/K19-1007.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019, ACL Anthology D19-1410. arXiv:1908.10084.
- Sinclair, A., Jumelet, J., Zuidema, W., & Fernández, R. (2022). Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations. Transactions of the Association for Computational Linguistics, 10, 1031–1050. https://doi.org/10.1162/tacl_a_00504.
- Xiao, C., Hu, X., Liu, Z., Tu, C., & Sun, M. (2021). Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents. AI Open, 2, 79–84. https://doi.org/10.1016/j.aiopen.2021.06.003.
- Zhang, Y., et al. (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176.
Online corpora:
- SuttaCentral. https://suttacentral.net (accessed 2026-04).
- Access to Insight. https://www.accesstoinsight.org.
- Dharmamitra / DharmaNexus. https://dharmamitra.github.io/dharmamitra-guides/dharmanexus.