- About this Blog
- Principal Component Analysis
- Multidimensional Scaling
- Cluster Analysis
About this Blog
This blog is the result of my experimentation in applying various statistics to the apparatus data of Novum Testamentum Graece (NTG, 27th edition) [Nestle and Aland]. This process of experimentation has been driven largely by my curiosity to understand the inter-relationships, special characteristics and history of the surviving texts of the New Testament. My personal goal throughout has been to place the extant forms of the ancient text in historical sequence with insight as to how the forms may have arisen. My method has been to apply computers for both data-management using a relational database (Microsoft SQL Server 2008) and statistics using the R programming language. My experience has so far proven to me the value of these computerized studies to the area of NT textual criticism both in controlling bias and in automating what would otherwise be a painstaking and error-prone task of data-entry.
There are others who have preceded me in this endeavor and whose work has inspired and directed my own. An extremely useful examination of Multivariate Statistical Analysis for Manuscript Classification as applied to the New Testament has been performed by J. C. Thorpe [Thorpe]. Wieland Willker has done a significant amount of statistical work in particular on the gospels using Principal Component Analysis (PCA) [Willker]. In addition, Timothy J. Finney has done a thorough survey of a variety of statistical methods particularly applied to the book of Hebrews [Finney].
I do not believe that computers can ever replace the human element in the study of New Testament texts. In my statistical work, I view computers as merely another tool to assist with the traditional theological, historical and philological tools of textual criticism. In fact, the use of the traditional tools is exactly necessary for the accurate interpretation of the statistics. Moreover, the traditional tools ought to guide the statistical inquiry in meaningful and productive directions. It is my hope that the statistics presented here will assist other able textual critics by providing an objective big picture and eliminating some of the grunt work of tallying, managing and especially comprehending a massive store of agreements and disagreements between witnesses using traditional methods.
Early New Testament Witnesses
The process of analysis must start with actual data from the witnesses to the New Testament text, whether manuscripts, versions or church fathers. Due to my personal interest in early manuscripts as well as for practical reasons of availability, I’ve limited my study to witnesses included in the apparatus of the 27th edition of Novum Testamentum Graece (NTG). Generally, these are manuscripts from the first millenium A.D. or they are later manuscripts preserving early texts. Included also are groups of manuscripts cited in the apparatus that tend to agree collectively, such as the Vetus Latina, the Vulgate, Sahidic Coptic, the Peshitta or the Majority Text.
I’ve also limited my study for practicality to variants from the NTG apparatus. This excludes many minor variants, such as patterns of orthography, which may have value in geographical and historical placement of a witness. In spite of this limitation, NTG provides a large enough data set to provide some significance to the results. For example, NTG documents more variants than UBS 4. The numbers of variants and readings by sections of the NT are:
|New Testament Section||Number of Variants||Number of Readings|
|Paul (including Hebrews)||2438||4376|
|General Epistles (excluding Hebrews)||895||1684|
Parsing the Apparatus
To avoid the time-consuming and error-prone process of manual data entry, I automated the data import process by constructing a parser to read the apparatus data from an electronic form of the NTG apparatus into a relational database of witnesses and variants. This was possible because the NTG apparatus is sufficiently structured to permit machine interpretation. For example, witnesses appear in the same order arranged by type with symbols that are uniquely identifiable. Additionally, versification is provided inline along with symbols for each type of variant. As a precautionary measure, I added an intermediate step of rendering the output of the parser in a clear format for visual verification before committing to the database. Usually, the parser produced consistent output that required very few corrections. The result was apparatus data in a relational database schema.
Importing the apparatus into a relational database has facilitated data preparation allowing me to cover a lot of ground quickly, since all of the vectors for statistical processing are automatically constructed from the database and never have to be touched by hand. In addition to the generation of vectors for statistical analysis, other simpler queries are possible using the database, such as percentage of agreement queries between witnesses, queries to find singular or dual support for readings of one or two witnesses by themselves or witness comparisons over a range of chapters.
The apparatus of NTG has some peculiarities owing to its conciseness that must be considered. First, there are two forms of the apparatus, 1) the positive apparatus where both support for and against the edited text is cited and 2) the negative apparatus, where only dissent from the edited text is cited. Secondly, there are first-order witnesses (which are always cited) and second-order witnesses (which are cited only when they differ from the Majority Text).
Positive and Negative Apparatus
I’ve addressed the issue of the positive and negative apparatus by accepting the data without change. Alternatively, I could have replaced the negative apparatus with a positive apparatus by filling out the apparatus with additional readings in support of the edited text and assigning the unassigned first-order witnesses to the additional readings. However, I opted against converting the entire data set to a positive apparatus, since adding readings where the bulk of witnesses already agree would not have added much new information. In other words, noting the dissenting witnesses captures the desired information of a group of witnesses agreeing together, implicitly assigning the other witnesses by their absence to outside the dissenting group.
First- and Second-Order Witnesses
I’ve addressed the issue of the first- and second-order witnesses by restricting the analysis to first-order witnesses only. This decision reflects my personal interest in the earliest manuscripts, which are usually categorized as first-order witnesses on the basis of their age. In a few cases, such as Majuscule 037 (Δ) in Mark, where my curiosity about the manuscript encouraged me to analyze it, I converted the manuscript to a first-order witness in my data set by agreeing this manuscript with the Majority Text in all positive-apparatus locations where it is both extant and not explicitly cited, essentially using the definition of a second-order witness to extrapolate the data set.
I’ve handled the more specific version symbols by expanding them to include the corresponding more inclusive symbols where the specific versions are not cited. For example, in preparing the data, the symbol it (Vetus Latina) represents readings first for it and next readings for latt (the entire Latin tradition) when it is not specified. Likewise, the symbol syp (the Peshitta) is taken to include readings for sy (the entire Syriac tradition) where syp is not specified. Finally, the symbol sa (Sahidic) is taken to include readings for co (the entire Coptic tradition) where sa is not specified.
Statistics for Analysis and Comparison
Thorpe has enumerated some of the motivators for applying multivariate statistical analysis to comparing the variants between New Testament witnesses, listing objectivity, rigor and comprehensiveness [Thorpe, 4-8]. Assuming a valid experiment, objectivity aims at control of the inevitable biases that affect any one observer, at least for the computational phase of the investigation. Rigor entails a formalization of methodology that allows the experiment to be repeated and verified by others. Comprehensiveness refers to the ability to handle large data sets as conveniently as small data sets, which is more the result of computerization than statistical method. The purpose of multivariate analysis is to uncover (within the limitations of each method) the meaningful patterns in data sets that would otherwise be overwhelming for the observer to consider. Thorpe rightly cautions that the usefulness of statistics in textual criticism is limited by both the correct formation of hypotheses and the correct interpretation of the results [Thorpe, 10], underscoring the necessity of traditional tools for the accurate analysis of the results.
Although many statistical tools are available, I have been successful using the R programming language. I have used a number of add-on packages for R, including RODBC for database connectivity, Cairo for rendering unicode fonts, vegan for some non-linear analysis, cluster for cluster analysis and both bpca and rgl for 3D graphics.
I’ve prepared the data for statistical analysis in the form of a vector for each witness, where the length of each vector equals the number of readings. For each position in the vector, if the witness supports the reading with the corresponding index, ‘1’ is inserted in the vector at that position. If the witness does not support the reading,’0′ is inserted in the vector. For all analysis except PCA, if the witness is absent at the index of the reading, ‘na’ is inserted in the vector. For PCA, ‘0’ is inserted to indicate either absence or non-support of a reading.
Principal Component Analysis (PCA)
The theory and mathematics behind Principal Component Analysis (PCA) is complex and beyond the scope of this blog [see Wikipedia PCA]. There is substantial coverage of PCA and its application in books and throughout the web. The PCA implementation used in this blog comes from the R programming language. The usage is described in the R documentation [R Foundation, 1388].
PCA is a technique to represent high-dimensional data in terms of a smaller set of uncorrelated variables [Thorpe, 39-42]. PCA can be used to detect relationships between a high number of witnesses by identifying the components that account for most of the variation and representing the variables in terms of those components. PCA should be used with caution with New Testament witnesses because there is no way to encode absence where a witness has a lacuna. When PCA is applied to textual analysis, the variables are the witnesses and the observations are the readings of the witnesses over a range of the text.
PCA of the Gospels has been performed by Wieland Willker for all variants discussed in his Online Textual Commentary (407 for Matthew, 353 for Mark, 425 for Luke and 339 for John). PCA of the book of Hebrews has been performed by Finney for the variants in UBS 4.
Multidimensional Scaling (MDS)
The theory and mathematics of Multidimensional Scaling (MDS) are covered elsewhere [see Wikipedia MDS]. The MDS implementation used in this blog (known as classical or metric MDS) comes from the R programming language. The usage is described in the R documentation [R Foundation, 1151]. Both MDS and Cluster Analysis require a computed distance matrix [R Foundation, 1188].
Like PCA, MDS is a technique of dimension reduction. The goal of Multidimensional Scaling (MDS) is to represent high-dimensional data in fewer dimensions by preserving as close as possible the original distance of the variables in fewer dimensions [Finney, 4.3]. MDS of the book of Hebrews has been performed by Finney for the variants in UBS 4.
The goal of cluster analysis is to break a set of variables into subsets related by some measure of similarity [Wikipedia CA]. The clustering implementation used in this blog is an agglomerative hierarchical function from the R programming language. The usage is described in the R documentation [R Foundation, 1239].
Hierarchical cluster analysis identifies clusters either by successively working up from the individual nodes and coalescing related nodes to form clusters (agglomerative) or by working from the collection as a whole and splitting off dissimilar nodes (divisive). The result of the analysis is a tree-structure or dendrogram with items more-related in the direction from the root to the branches to the leaves. While the results form a stemmata, its important to note that the tree structure arrived at is statistical rather than genealogical (though statistical similarity may result from genealogical correspondence). Hierarchical cluster analysis in various forms has been performed by Finney for the book of Hebrews for the variants in UBS 4.
K-means Partitioning is a form of cluster analysis where the number of partitions is selected in advance and the variables partitioned to minimize the sum of dissimilarities for each partition [Wikipedia KM]. The clustering implementation used in this blog comes from the R programming language. The usage is described in the R documentation [R Foundation, 2214]. It has been applied to the book of Hebrews by Finney.
The real contribution of statistics ought to be improved understanding of the patterns of variation in the texts of the New Testament leading to an improved understanding of the history of the text. Its been my desire for some time to share these results on a blog in order to contribute in some way to this understanding.
[Nestle and Aland] Nestle, Eberhard, Erwin Nestle, Kurt Aland et al. Novum Testamentum Graece. 27. Aufl., rev. Stuttgart: Deutsche Bibelstiftung, 1993.
[R Foundation] R Foundation for Statistical Computing. R: A Language and Environment for Statistical Computing – Reference Index. (1999-2010): 1388. Accessed August, 4 2011. http://cran.r-project.org/doc/manuals/fullrefman.pdf.
[Thorpe] J. C. Thorpe. “Multivariate Statistical Analysis for Manuscript Classification.” TC: A Journal of Biblical Textual Criticism 7 (2002). Accessed August 4, 2011. http://purl.org/TC/vol07/Thorpe2002.html.
[Willker] Wieland Willker. “Principal Component Analysis from all variants discussed in An Online Textual Commentary on the Greek Gospels.” (2009). Accessed August 4, 2011. http://www-user.uni-bremen.de/~wie/TCG/PCA/index.html.
[Wikipedia MDS] Wikipedia. “Multidimensional Scaling.” Accessed August 3, 2011. http://en.wikipedia.org/wiki/Multidimensional_scaling.
[Wikipedia PCA] Wikipedia. “Principal Component Analysis.” Accessed August 3, 2011. http://en.wikipedia.org/wiki/Principal_component_analysis.