A Quantitative Linguistic Approach to Wittgenstein's Tractatus logico-philosophicus

Is there any relation between the content of Wittgenstein's work and the syntactic formulation of the statements that compose it?

The Tractatus Logico-Philosophicus stands as one of philosophy's most structurally distinctive works of Wittgenstein. Through comprehensive computational linguistic analysis (at the end you can find all the methodology), this study tries to examine how the work's structural and linguistic features align with its philosophical content, to show patterns that support and exemplify Wittgenstein's arguments about language and logic.

In this case and as part of the final project of the subject Linguistics of Humanities Bachelor Degree taught by M. J. Castellà, I write this essay to help understand why the content is strictly related to the continent in Wittgenstein’s Tractatus.

Instructions for the project:

  1. The work must be prepared using Claude or ChatGPT (free access version) as an assistant and must deal with some of the subjects of the assignment and will have an essay or, at least, argumentative nature. It must answer a question that the student formulates and that will be refined with a series of dialogues with ChatGPT and with classic bibliographical resources, until obtaining a text that sufficiently satisfies the student's inquisitiveness. The final text may incorporate parts elaborated by the AI, parts written by the student himself and parts of other works conveniently cited.
  2. From now on, the parts that have been written by the LLM will be put in italics. My contributions will be written in regular font.

Research question

Is there any relation between the content of Wittgenstein's work and the syntactic formulation of the statements that compose it? What tendencies and uses do you make of language in his own work and how can it correlate with the content?

The analysis employs computational tools to examine both lexical and grammatical patterns across the text's hierarchical proposition levels. This includes assessment of lemma distribution, type-token ratios, syntactic complexity, and parts of speech distribution, providing quantitative insights into the text's linguistic architecture.

Before starting, I think it is convenient to briefly explain and define these terms: a lemma is the base or canonical form of a word (as it appears in the dictionary), they are words that are not inflected, neither verbally, nor have a plural or gender inflections.

A token is each individual occurrence of a word in a text, regardless of its form. For example, in the phrase "the cat and the black cats". There are 6 tokens (each word counts as a token). Example: "the cat and the black cats"

The depth of the syntactic tree is a measure that indicates the levels of syntactic dependency in a sentence. In a syntax tree, each word (token) is connected to a "head" (head) on which it is syntactically dependent. Let's take the sentence "The cat eats fresh fish".

"Eat" is the root (depth 0)."The", "cat" and "fish" depend on "eat" (depth 1) and "fresh" depends on "fish" (depth 2).

For each word, we count how many "jumps" it takes to reach the root. A higher average depth indicates more syntactically complex sentences. A lower depth indicates simpler or linear structures. Subordination rates are measures that indicate the proportion of subordinate clauses in relation to the total number of clauses. The subordination rate is an important indicator of syntactic complexity of the text, writing style and formality of the discourse. In the case of the Tractatus, these rates can help to understand Wittgenstein's argumentative structure. Once this is defined, let's go with the results of the analysis.

The structural analysis shows a carefully crafted hierarchical organization that appears far from arbitrary. Beginning with seven fundamental propositions at level 0, containing 72 tokens and a syntactic depth of 1.38, the text expands through middle levels before returning to seven propositions in its final level.

Level 1 expands to 25 propositions while maintaining similar syntactic depth (1.42), primarily serving to define and clarify the fundamental propositions. Level 2 marks a significant expansion with 120 propositions and the highest syntactic depth (1.52), suggesting increased complexity in argumentation. The text reaches its maximum extension at level 3 with 242 propositions and 1,944 tokens, though with slightly decreased syntactic depth (1.38). Level 4 contracts to 117 propositions while showing the highest subordination ratio (0.20), indicating complex syntactic structures despite reduced depth.

Finally, level 5 returns to seven propositions with minimal syntactic depth (1.14), creating a symmetrical structure that mirrors Wittgenstein's views on language's limits.

Lexical distribution patterns reflects the text's philosophical aims. Nouns and determiners dominate, comprising 45.5% of all words, while verbs maintain a notably low 8.6% frequency. This distribution suggests a style focused on description and definition rather than action or narrative. The high presence of prepositions (13.6%) indicates complex syntactic relationships, while the limited use of adjectives (5.9%), pronouns (5.5%), and adverbs (5.1%) points to a preference for direct, unembellished expression. The remaining percentage is punctuation, which I have also taken into account for the analysis.

The text's lexical density shows an interesting evolution across levels. Early sections display higher lemma counts, with levels 0 and 1 averaging 3.14 and 3.4 lemmas per proposition respectively. This count decreases towards level 5, reaching 2.28 lemmas, while lexical density paradoxically increases to its maximum value of 1.0. This pattern suggests a progression from elaborate explanation to increasingly precise, economical expression.

Depth of syntactic tree and index of subordination by proposition level

General distribution of grammatical categories (%)

Word frequency

Type-Token Ratio (TTR) and its corrected version (CTTR) show consistently high values across all levels, with peaks in level 1 (TTR of 0.98). These high ratios indicate diverse vocabulary with minimal repetition, characteristic of technical-philosophical writing. Notably, level 5, shows maximum density (1.0) with minimal lemmas, reflecting highly precise language use in the Tractatus's final propositions.

Semantic analysis reveals concentrated usage of philosophical terminology. "Proposition" appears 103 times, followed by "sign" (39 occurrences) and "world" (31 occurrences). These frequencies cluster into three distinct semantic fields: logic and language (proposition, sign, expression), ontology (world, object, reality), and epistemology (representation, meaning, thought). Technical-philosophical terms like tautology, a priori, and variable appear frequently, alongside references to philosophers like Frege and Russell.

The predominant verbs are epistemic and logical in nature (show, represent, express, signify) or related to intellectual action (comprehend, understand, think), reinforcing the text's logical-philosophical character and its focus on relationships between language, thought, and reality.

The quantitative findings demonstrate significant correlations between form and content. The decreasing syntactic complexity coupled with increasing lexical density suggests a deliberate movement from detailed exposition to precise, economical expression. This pattern aligns with Wittgenstein's philosophical journey toward clarity and his views on the limits of expressible thought.The symmetrical structure, particularly the seven propositions at both first and last levels, appears intentionally crafted to reflect the text's philosophical arguments about language's boundaries. The progression from complex to precise expression mirrors Wittgenstein's goal of delineating what can and cannot be said.

In this visualization I tried to represent the logical atomism of Wittgenstein and his theory of objects and states of affairs. In proposition 2.0121 says: "If things can occur in states of affairs, this possibility must be in them from the beginning.". So, I used points to represent these objects, cause the objects are the simplest thing in the logical field. And the concentric and continuous movement is because in 2.014, he says: "Objects contain the possibility of all situations." The interconnected nature of the points reflects proposition 2.0271: "Objects are what is unalterable and subsistent; their configuration is what is changing and unstable."

The central clustering and radial connections represent what Wittgenstein describes in proposition 2.03: "In a state of affairs objects fit into one another like the links of a chain." Also, in 4.22: "An elementary proposition consists of names. It is a nexus, a concatenation, of names." So, I chose to represent this ecosystem in that way. This visual representation of lemmatic relationships in the text demonstrates what Wittgenstein explains in 3.144: "States of affairs can be described but not named. (Names are like points; propositions like arrows—they have sense)".

The structure itself becomes a demonstration of Wittgenstein's philosophical argument that language has inherent logical structure that reflects reality. As shown in the visualization, each lemma (represented by points) connects to others in ways that mirror the "logical structure" Wittgenstein describes in his philosophical arguments.

Conclusions

The symmetrical structure, particularly the seven propositions at both first and last levels, appears intentionally crafted to reflect the text's philosophical arguments about language's boundaries. The progression from complex to precise expression mirrors Wittgenstein's goal of delineating what can and cannot be said.

This integration of form and content is not coincidental but demonstrates Wittgenstein's principle that "The structure of the proposition stands in internal relations to the structure of its sense" (4.431).

This analysis demonstrates that the Tractatus's linguistic structure is not merely a vehicle for its philosophical content but an integral part of its philosophical expression. The careful balance of structural elements, the evolution of lexical density, and the distribution of semantic fields all support and enhance Wittgenstein's philosophical arguments about language and logic. The text's form becomes inseparable from its content, embodying the very principles it seeks to elucidate about the nature of language and meaning.

Regarding the questions and workflow developed with Claude 3.5, it has been the following: I did the textual analysis on my own, and the visualizations too. Claude has been very good at describing and making correlations between my initial question and the data I obtained from my Python analysis. However, it sometimes presented errors because, for example, in the Type Token Ratio calculation section, I calculated this metric with short texts (propositions of one or two sentences in most cases), so the results are not truly conclusive. However, when I asked if this metric was correct, it said yes. To correct the issue of sentence length, I used the corrected type token ratio. It hasn't been very good at obtaining citations beyond what I provided to it.

However, I now realize that this syntactic and stylistic analysis makes more sense if I can compare it with other textual genres, because perhaps it has confirmed what was suspected (symmetrical structure, nominalized and with simple sentence structures or at most with substantive subordinate clauses). For future analyses and projects, I would like to pursue a more comparative structure.

1. Data Preparation

Required Libraries:

  • pandas: For data manipulation
  • numpy: For numerical operations
  • spaCy: For natural language processing
  • re: For regular expressions
  • nltk: For additional text processing

Initial Setup:

  • Load language model (Spanish)
  • Import necessary text files: Wittgenstein, L. (1921). Tractatus Logico-Philosophicus (Spanish Wikisource Collaborative Translation). Wikisource. Retrieved from https://es.wikisource.org/
  • Setup text preprocessing functions

2. Text Preprocessing

  • Convert text to lowercase
  • Remove punctuation and numbers
  • Remove stop words
  • Lemmatization of words
  • Clean and tokenize text

3. Lexical Analysis

Key Metrics:

  • Type-Token Ratio (TTR)
  • Herdan TTR (HTTR)
  • Corrected TTR (CTTR)
  • Total lemmas count
  • Unique lemmas count
  • Hapax legomena (words appearing only once)
  • Lemma length statistics

4. Grammatical Analysis

Analyzed Features:

  • Part-of-speech distribution
  • Syntactic tree depth
  • Subordination proportions
  • Sentence structure analysis

5. Data Export and Visualization

  • Export results to CSV files
  • Generate JSON structure for syntactic trees
  • Create dataframes for analysis
  • Statistical visualization options

About me: I'm Carlos Albaladejo, a student of Journalism and Humanities who has done this work for the subject of Linguistics, UPF. 01/12/2024.