Multiple Classification and Authorship of the Hebrew Bible

Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc.  I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy).  The remaining three appear to be a confusing hodgepodge compilation from multiple authors.  I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style.  If you are going to be able to detect that style, statistically, it is convenient to detect it using function words.  Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed.  Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

With that in mind, I started my journey.  The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for.  For the Torah text, I relied on the inimitable  They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English.  However, I found a good 32 of them, as listed below:

Transliteration Hebrew Function Word Rough Translation Word Count
al עַל Upon 1262
el אֶל To 1380
asher אֲשֶׁר That 1908
ca_asher כַּאֲשֶׁר As 202
et אֶת (Direct object marker) 3214
ki כִּי For/Because 1030
col וְכָל + כָּל + לְכָל + בְּכָל + כֹּל All 1344
ken כֵּן Yes/So 124
lachen לָכֵן Therefore 6
hayah_and_variants תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה Be 819
ach אַךְ But 64
byad בְּיַד By 32
gam גַם Also/Too 72
mehmah מֶה + מָה What 458
haloh הֲלֹא Was not? 17
rak רַק Only 59
b_ad בְּעַד For the sake of 5
loh לֹא No/Not 1338
im אִם If 332
al2 אַל Do not 217
ele אֵלֶּה These 264
haheehoo הַהִוא + הַהוּא That 121
ad עַד Until 324
hazehzot הַזֶּה + הַזֹּאת + זֶה + זֹאת This 474
min מִן From 274
eem עִם With 80
mi מִי Who 703
oh אוֹ Or 231
maduah מַדּוּעַ Why 10
etzel אֵצֶל Beside 6
heehoo הִוא + הוּא + הִיא Him/Her/It 653
az אָז Thus 49

This list is not exhaustive, but definitely not small!  My one hesitation when coming up with this list surrounds the Hebrew word for “and”.  “And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning.  It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

Anyhow, following is the structure of Torah:

‘Chumash’ / Book Number of Chapters
‘Bereishit’ / Genesis 50
‘Shemot’ / Exodus 40
‘Vayikra’ / Leviticus 27
‘Bamidbar’ / Numbers 36
‘Devarim’ / Deuteronomy 34

Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

m = ggplot(torah, aes(x=wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Chumash

You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

torah$randu = runif(187, 0,1)
torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

For this analysis, it seemed that using Random Forests made the most sense.  However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

> importance(torah.rf)

 Word MeanDecreaseAccuracy MeanDecreaseGini
hayah 31.05139 5.979567
heehoo 20.041149 4.805793
loh 18.861843 6.244709
mah 18.798385 4.316487
al 16.85064 5.038302
caasher 15.101464 3.256955
et 14.708421 6.30228
asher 14.554665 5.866929
oh 13.585169 2.38928
el 13.010169 5.605561
gam 5.770484 1.652031
ki 5.489 4.005724
haheehoo 2.330776 1.375457

> importance(torah.rf.props)

Word MeanDecreaseAccuracy MeanDecreaseGini
asher_1 37.074235 6.791851
heehoo_1 29.87541 5.544782
al_1 26.18609 5.365927
el_1 17.498034 5.003144
col_1 17.051121 4.530049
hayah_1 16.512206 5.220164
loh_1 15.761723 5.157562
ele_1 14.795885 3.492814
mi_1 12.391427 4.380047
gam_1 12.209273 1.671199
im_1 11.386682 2.651689
oh_1 11.336546 1.370932
mah_1 9.133418 3.58483
caasher_1 5.135583 2.059358

It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list.  That’s the same result as in the Shakespeare et al. analysis!  Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
> torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")

> xtabs(~torah.test$chumash + torah.test$pred.chumash)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            4            5          2         8          7
       'Bereishit'           1           14          1        14          2
       'Devarim'             1            2         17         0          1
       'Shemot'              2            4          4         9          2
       'Vayikra'             5            0          4         0          5

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
       'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
       'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
       'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
       'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286

> xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            0            5          0        13          8
       'Bereishit'           1           16          0        13          2
       'Devarim'             0            2         11         4          4
       'Shemot'              1            4          2        13          1
       'Vayikra'             3            3          0         0          8

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
       'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
       'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
       'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
       'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts.  Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings.  If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

As above, below I show the distribution of word count by book, for comparison’s sake:

> m = ggplot(neviim, aes(wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Prophets Book

Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them.  As you can see below, the result was anything but clear on that matter:

> xtabs(~neviim$chumash + neviim$pred.chumash)
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           3            7          7         6          1
      'Judges'           2           11          2         6          0
      'Kings'            0            8          3        10          1

> xtabs(~neviim$chumash + neviim$pred.chumash.props)
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           2            8          6         7          1
      'Judges'           0            9          2         9          1
      'Kings'            0            7          6         7          2

I didn’t even bother to re-express this table into fractions, because it’s quite clear that each  book of the prophets that I analyzed didn’t seem to be clearly classified in any one category.  Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.


I don’t think that this has been a full enough analysis.  There are a few things in it that bother me, or make me wonder, that I’d love input on.  Let me list those things:

  1. As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis.  I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
  2. Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables.  Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”.  How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
  3. Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
  4. Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis?  I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent.  If you’d like, you can look at the python code that I used to scrape the website here: python code for scraping, although please forgive the rudimentary coding!  You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set.  By all means, do the analysis yourself and tell me how to do it better 🙂


12 thoughts on “Multiple Classification and Authorship of the Hebrew Bible

  1. I think a word by word analysis has a long legacy but is a relatively crude strategy; similar to comparing the percentage of amino acids in a protein sequence. It might be fun to use tools which search for quotes of arbitrary length from one text in another, for example, or better yet, do the same with fuzzy matching.

    • I’m open to different strategies for analysis, but I’m afraid I don’t understand the logic behind what you propose. How would it help to search for quotes of arbitrary length from one text in another text? When the content varies so widely between texts, how could one expect to find very much match between the two?

      • My hypothesis is that a deuteronomic text, which is a ‘restatement’ in some sense, could show interesting patterns of aggregation/redaction while perhaps the other texts wouldn’t draw from it in a concerted fashion.

      • So you should expect to see more similarities between strings of arbitrary length in Deuteronomy and the other texts than between strings of arbitrary length amongst other texts?

      • I would expect to see that strings of moderate length (3 words to two sentances, for example) would be represented in a later redaction such that it drew from each of the earlier texts at a rate which they did not draw from each other; you are right, this could be confounded by shared subject matter. It would probably need to be done on the chapter basis to have a suffiencient population of texts for hypothesis testing – this is always an issue I’ve had with the investigation of key historical documents, sample size.

  2. Who Wrote the Bible by Friedman has a pretty good description of the documentary hypothesis, the analysis that led to it, as well as annotated passages that label which author is which. It’s important with some books to look by perek because authorship largely alternates. The E and J sources, presumably because they merged so much earlier and were largely the same oral tradition with two recorders separated geographically, often switch off at the verse level.

    • Thank you for this comment! I would love to try to redo the analysis with a list that distinguishes authorship at the chapter level. Does the Friedman book show authorship at the level of verse, considering what you say about the E and J sources?

  3. Applause to your ambitious efforts.

    Regarding the identification of “and” as well as other words mentioned as problematic… If you haven’t yet done so, consider trying to use regular expressions along with the grep() family of functions to solve this piece of the puzzle. Regular expressions are quite adept at identifying words containing specific patterns and character combinations.

    • Can I use regular expressions with hebrew letters? Furthermore, how do I distinguish between one-letter word-initial morphemes and single letters that are simply part of a word? The morphology of hebrew seems to make it a bit more difficult to automatically parse (unless I’m missing some really simple and elegant piece of logic here!).

  4. Hi, very interesting post. Have you ever tried any measure of distance between the vectors representing the WHOLE vocabulary of the different books? I tried something similar with intertextual distance, applied to a corpus of italian novels. I used distance + clustering technique.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s