Word Frequency Count for Nouvo Testaman Morisien

This is based on a related blog post, written in Morisien. This is both a translation and an elaboration, particularly in the method section.

The goal was to produce a list of all of the words in the Nouvo Testaman an Morisien, and to sort them in order of decreasing frequency of use. It turns out that there are 171,060 words in that edition of the New Testament, with only 4,134 distinct words.

Rationale

First of all, this was a personal project, in the context of learning the Morisien language. It is also an exercise in computational linguistics, like one that I undertook with a colleague forty years ago, described in the paper "A Word Frequency Count of Mormon English". Finally, it's a way to compare the New Testament in Morisien with the latest dictionary of the Morisien language.

Method

The method consisted of: finding the text on the Internet, extracting all of the words, and counting the distinct words together with how many times each one was used in the full text. There were problems with homonyns and with verb forms, since most verbs in Morisien have two forms. These had to be manually looked up in the dictionary and combined into a single entry. Finally, the completed work is shared on the Internet.

Acquiring the content

By searching, I discovered the content on-line, at https://www.bible.com/versions/344 and from there, looked up the Bible Society of Mauritius, phoned them, and eventually located their office, where I purchased 12 printed copies, the morning of September 29, 2014. A brief look at copyright notices convinced me that I would not be in violaton by counting the words. Any copy I would have of the book would be for personal, and non-commercial use. My finished product will be just a list of words in a particular order, and will be freely available. Anyone with the time and willingness could reproduce the list by following the steps outlined here.

The first thing that I produced was the file map.txt which contains three columns, the three letter abbreviation used by the bible.com website to refer to the books of the New Testament, the abbreviation used by the Morisien edition, and finally the full name of each book in that edition. I created this file manually, based on observations of the web pages, and consulting the printed copies. Along the way, I corrected one error in printing (the second epistle of John they call simply, Zn, instead of what it should be, "2 Zn") which is both in the printed version and on the Internet.

The file map.txt shown here and, then the file chapters.txt, also built manually, to hold the number of chapters in each book.

mat	Mt	Bonn Nouvel dapre Matie
mrk	Mk	Bonn Nouvel dapre Mark
luk	Lk	Bonn Nouvel dapre Lik
jhn	Zn	Bonn Nouvel dapre Zan
act	Zis	Zistwar Bann Apot
rom	Rom	Let pou Romin
1co	1 Ko	Premie let pou Korintien
2co	2 Ko	Deziem let pou Korintien
gal	Ga	Let pou Galat
eph	Ef	Let pou Efezien
php	Fil	Let pou Filipien
col	Kol	Let pou Kolosien
1th	1 Tes	Premie let pou Tesalonisien
2th	2 Tes	Deziem let pou Tesalonisien
1ti	1 Tim	Premie let pou Timote
2ti	2 Tim	Deziem let pou Timote
tit	Tit	Let pou Tit
phm	Flm	Let pou Filemon
heb	Eb	Let pou Ebre
jas	Zak	Let Zak
1pe	1 Pi	Premie let Pier
2pe	2 Pi	Deziem let Pier
1jn	1 Zn	Premie let Zan
2jn	2 Zn	Deziem let Zan
3jn	3 Zn	Trwaziem let Zan
jud	Zid	Let Zid
rev	Rev	Revelasion

mat	28
mrk	16
luk	24
jhn	21
act	28
rom	16
1co	16
2co	13
gal	6
eph	6
php	4
col	4
1th	5
2th	3
1ti	6
2ti	4
tit	3
phm	1
heb	13
jas	5
1pe	5
2pe	3
1jn	5
2jn	1
3jn	1
jud	1
rev	22

Scraping

"Scraping" is the technical term for obtaining content from a web page by downloading the source code of the page and extracting the desired information. The following program downloads the text of all of the chapters. Rather than fire up a database, I chose to store the text of each chapter in a file, named for the chapter number, and stored in a folder named for the book code name (column one of map.txt).

for b in `jot '' 1 27`
do
  BOOK_COD=`sed -n ${b}p ../map.txt | cut -f 1`
  CHAP_CNT=`grep ^$BOOK_COD ../chapters.txt | cut -f 2`
  echo "$BOOK_COD	$CHAP_CNT"
  mkdir -p $BOOK_COD
  for c in `jot '' 1 $CHAP_CNT`
  do
    curl -k "https://www.bible.com/bible/344/$BOOK_COD.$c.ntkm"\
      | grep class=.verse.v1 >$BOOK_COD/$c.html
  done
done

The variable b will take on the numbers 1 through 27 in the loop comprising lines 2-12. These lines will be executed 27 times, once for each of the books.
The loop over the 27 books begins.
The variable BOOK_COD will be assigned the value in the first column of the bth line of the file map.txt.
The variable CHAP_CNT will take on the value in the second column of the corresponding line of the file chapters.txt.
As a sort of progress indicator, this line will display the book code and number of chapters on the screen, as the command runs.
We make a folder for the current book. It will be named the same as the book code, ex. mat, mrk, ... rev.
The variable c will take on the numbers 1 through however many chapters there are in the current book, in the loop comprising lines 8-11. This line will be executed once for each chapter in the current book.
The loop over the chapters of the current book begins.
The curl command gets the specified chapter page from the internet. The first internet page will be mat.1.ntkm and the very last internet page fetched will be rev.22.ntkm. These URLs match the ones in use by the bible.com domain at the time I scraped the pages.
When the HTML source code is retrieved, it is piped through a program, grep, which selects just the one line containing the pattern class=.verse.v1 because inspection of the files showed that all verses of each chapter are contained in that one (very long) line of source code. This line of code will be saved in the folder named by the book code, as a file named by the chapter number with the .html file extension.
The loop over the chapters ends.
The loop over the books ends.

I ran this program on October 28, 2014, starting at 09:13:04 and ending at 09:44:08 (date and time stamps for Matthew chapter one and Revelations chapter twenty-two, respectively). I remember wondering at the time if anyone would ever check the logs and notice the sequential pattern of access. It is the same pattern of access that would be seen in the logs if someone set out to read the entire New Testament and visited the chapters one by one, starting at Matthew and on to the end. However, this hypothetical reader would certainly not finish the reading in half an hour!

Extracting the verses

Now begins the work of extracting each verse from its chapter. I wrote a script to do this work for each downloaded chapter. This script grew gradually, as I learned about the content of the various chapters and their markup. The finished version is presented here.

The job of this script is to start with the single line file downloaded from the internet and produce a corresponding text file with all of the HTML tags removed, and with one line per verse.

case $# in
  2) ;;
  *) echo Usage $0 book chapter; exit ;;
esac
cat $1/$2.html\
 | sed -e 's/^ *<div class="label">/Sapit /'\
 | perl -p -e 's/(?=<span.class="verse.v[^>]*><span.class="label)/\n/g'\
 | perl -p -e 's/(?=<div.class="q[12]*">)/ /g'\
 | sed -e 's/<span class="heading">[^>]*>[^>]*>//g'\
 | sed -e 's/&#822[01];/"/g'\
 | sed -e "s/&#821[67];/'/g"\
 | sed -e "s/&#8211;/-/g"\
 | sed -e 's/<[^>]*>//g'\
 | sed -e '/^ *$/d'\
 | sed -e 's/  *$//'\
 | sed -e 's/^[1-9][0-9]*/& /'\
 >$1/$2.txt

Making sure this script has the right number of arguments.
If it is 2 arguments, that is correct, so do nothing. The first argument, which will be called $1 in the rest of the script, is the code name of the book. The second argument, known as $2 throughout the script is the chapter number.
If there are not exactly two arguments, print a "usage" message and exit.
End of checking arguments.
Taking the chapter line as downloaded earlier as input to a pipeline of edits.
Edit spaces <div class="label"> into the word "Sapit " (ie. chapter).
Match the beginning of each verse, and insert a new-line character (\n).
Match a quotation div and insert a space character.
Remove heading spans.
Change left and right double quotes into the ASCII double quote character.
Same thing for single quotes.
Change dashes into hyphens.
Remove all remaining HTML tags.
Remove all blank lines.
Remove all spaces at the end of lines.
Insert a space character after the verse number.
Finally, save the edited file (with one line per verse) into a text file of the same name.

I ran this script for each of the chapters, and examined the output, adjusting until it worked correctly (except, see below).

Next, I will show the smallest chapter, 1st John chapter 1, as it was downloaded from the internet, and then as it was after running the script. Three of the patterns are highlighted. The one highlighted in light blue is replaced by "Sapit ", and the one highlighted in light green is left intact (to later be removed in line 13) but marks where a line break will occur in the output text file. The one highlighted in pink is simply entirely removed.

Sample HTML chapter file as downloaded

This is the file named 1jn/1.html

<div class="label">1</div><div class="s">Parol Lavi</div><div class="p"> 1Nou anons zot ki seki ti la depi komansman, seki nou finn tande, seki nou finn trouve avek nou lizie, seki nou finn gete ek touse avek nou lame, samem parol lavi ki nou pe anons zot. 2Sa lavi la finn manifeste li, e nou finn tann li, nou finn trouv li e temwagn lor li; e nou anons zot lavi eternel ki ti avek nou Papa e ki finn manifeste ar nou. 3Nou anons zot seki nou finn trouve ek tande pouki zot osi zot res ini avek nou, ki ini avek nou Papa e so Garson Zezi Kris. 4Nou pe ekrir tousala pouki nou lazwa li konple.</div><div class="s">Bondie limem lalimier</div><div class="p"> 5Samem mesaz ki nou finn tande sorti kot li, ki finn anons zot ki Bondie li lalimier, dan li pena okenn teneb. 6Si nou dir ki nou viv ini avek li letan ki nou pe mars dan teneb, nou koz manti e nou pa pe fer seki vre. 7Me si nou mars dan lalimier parey kouma limem ki dan lalimier, nou pou viv ini avek nou kamarad, e disan so Garson Zezi lav tou nou pese. 8Si nou dir ki nou pena pese, nou tronp noumem, e laverite pa dan nou. 9Si nou konfes nou pese, Bondie ki fidel ek zis, pou pardonn nou pese ek lav tou seki mal dan nou. 10Si nou dir ki nou pa finn fer pese, nou fer li pas pou enn manter, e so parol pa dan nou.</div>

Corresponding text file as created by the script

This is the file named 1jn/1.txt

Sapit 1 1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, seki nou finn trouve avek nou lizie, seki nou finn gete ek touse avek nou lame, samem parol lavi ki nou pe anons zot. 2 Sa lavi la finn manifeste li, e nou finn tann li, nou finn trouv li e temwagn lor li; e nou anons zot lavi eternel ki ti avek nou Papa e ki finn manifeste ar nou. 3 Nou anons zot seki nou finn trouve ek tande pouki zot osi zot res ini avek nou, ki ini avek nou Papa e so Garson Zezi Kris. 4 Nou pe ekrir tousala pouki nou lazwa li konple. 5 Samem mesaz ki nou finn tande sorti kot li, ki finn anons zot ki Bondie li lalimier, dan li pena okenn teneb. 6 Si nou dir ki nou viv ini avek li letan ki nou pe mars dan teneb, nou koz manti e nou pa pe fer seki vre. 7 Me si nou mars dan lalimier parey kouma limem ki dan lalimier, nou pou viv ini avek nou kamarad, e disan so Garson Zezi lav tou nou pese. 8 Si nou dir ki nou pena pese, nou tronp noumem, e laverite pa dan nou. 9 Si nou konfes nou pese, Bondie ki fidel ek zis, pou pardonn nou pese ek lav tou seki mal dan nou. 10 Si nou dir ki nou pa finn fer pese, nou fer li pas pou enn manter, e so parol pa dan nou.

An alert reader will have noticed that the script fails to dowload the introductory text at the start of each book. This was a deliberate decision, as I wanted to include only the words of the (modern) translation of the ancient text.

Manual edits

Unfortunately, there were other things that caused problems with the script. Thirty-eight of the 260 chapters had one or more footnotes, and these were not consistently coded, so I could not modify the script easily to remove them. I opted to manually edit these HTML files so that they would work with the script shown above.

Hover to see the chapters that were edited manually. mat 6, mat 18, mat 21, mrk 1, mrk 4, mrk 7, mrk 9, mrk 11, mrk 16, luk 8, luk 16, luk 22, luk 23, jhn 8, act 2, act 8, act 15, act 24, act 28, rom 1, rom 16, 1co 1, 1co 11, 2co 6, gal 1, gal 2, eph 1, eph 5, php 3, heb 9, jas 4, rev 1, rev 2, rev 4, rev 7, rev 8, rev 9, rev 17

Once I had a little practice, I was able to edit each chapter in a minute or two.

Hover to see the order in which the edits were completed. php 3, mrk 4, eph 5, mat 18, mat 6, mat 21, mrk 1, mrk 11, mrk 16, mrk 7, mrk 9, heb 9, jas 4, rev 2, rev 4, rev 8, rev 9, rev 1, rev 17, rev 7, eph 1, gal 1, gal 2, 2co 6, 1co 11, 1co 1, rom 16, rom 1, act 28, act 15, act 8, act 2, act 24, jhn 8, luk 8, luk 16, luk 22, luk 23

Other than these manual edits, an ambitious reader who set out to check my work could follow these same steps and come up with the same list of words. The guideline that I used in dealing with the footnotes was to retain all of the verses that correspond to verses in my reference work, the New Testament of the King James version of the Bible.

Extracting the words of each chapter

In retrospect, I see that I could have counted the words without doing this step, but my intention here is to document what I actually did.

case $# in
  1) ;;
  *) echo Usage $0 book; exit ;;
esac
ls $1/*.txt\
  | grep "^$1/[1-9]"\
  | sort -n -t / -k 2\
  >$1/map.txt
for i in `cat $1/map.txt `
do
  grep -H "^[1-9][0-9]*" $i\
  | sed -e "s/.txt//"\
    -e "s/\// /"\
    -e "s/\(:[1-9][0-9]*\)./\1	/";\
done >$1/all.txt

Making sure this script has the right number of arguments.
If it is 1 argument, that is correct, so do nothing. This argument, which will be called $1 in the rest of the script, is the code name of the book.
If there is not exactly one argument, print a "usage" message and exit.
End of checking arguments.
List all of the text files in the folder for the book.
Select just those whose file names start with a digit. These are the chapter files created by the previous script for this book.
Sort them numerically by chapter number.
Save the resulting list in a new file named map.txt in the book folder.
Loop over all of the chapter files, setting the variable i to the folder slash chapter file name.
Start the loop over the chapter files.
Find each line which begins with a digit. These leaves out lines like "Sapit 1". The -H option prefixes each line with the file name and a colon.
In each line, remove the ".txt" part of the file name.
Also change the slash between the book name and the chapter number into a space character.
Also replace the space after the verse number with a tab character.
End of the loop, with all of the output lines going into a new file named all.txt in the book folder.

Here is the map.txt file for book 1jn.

1jn/1.txt
1jn/2.txt
1jn/3.txt
1jn/4.txt
1jn/5.txt

And here is the first line of the first chapter of the book 1jn, as it appears at different times during the script. This shows the line after steps 11-14.

1jn/1.txt:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn/1:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn 1:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn 1:1	Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...

Counting word usage

>all.txt
for b in `cut -f 1 ../map.txt `; do
  cat $b/all.txt >>all.txt; done
cut -f 2 all.txt\
  | sed -e "s/ -/ /g"\
  | tr "])\?,;:\"'.\!([" " "\
  | tr -s " " "\n"\
  >words.txt
sort -u -f words.txt >unique.txt
for i in `cat unique.txt `; do
  echo -n "$i	"
  grep -i "^$i$" words.txt
    | wc -l
    | tr -d " "
done >wfc.tsv

Make an empty file named all.txt.
Looping with b set to each of the book codes.
Copy the book's all.txt file (created by the previous script) into the all.txt file, appending to what is already there from the previous books.
Retaining only the second column of the all.txt file (that is, everything after the tab character).
Remove any dashes at the start of a word.
Replace all punctation characters used in the text with a space character.
Squeeze all space characters into a single space character, and replace that character with a new line character.
Write the resulting list of words, one per line, to a new file named words.txt.
Sort the list of words, retaining only one copy of those that appear more than once, into a new file named unique.txt.
Looping over the words in the unique.txt file.
Output the word followed by a tab character (but no new line character).
Find all occurences of the current unique word in the words.txt file.
Count the words found.
Delete the space from the count, and output the number.
End of loop over the unique words, writing the result to a new file named wfc.tsv.

Here are the first couple of lines of the file all.txt. This file has 7933 lines in all, one per verse.

mat 1:1	Ala lalis bann anset Zezi Kris, desandan David ek Abraam.
mat 1:2	Abraam ti papa Izaak...

And here are the first ten lines of the files words.txt, unique.txt, and wfc.tsv. These files have, respectively, 171060, 4455, and 4455 lines. Hence, 171060 words, and 4455 unique words.

Ala
lalis
bann
anset
Zezi
Kris
desandan
David
ek
Abraam

12000
a
Aaron
Abadon
abandone
abandonn
abat
Abba
Abe
Abel

12000	12
a	16
Aaron	5
Abadon	1
abandone	3
abandonn	25
abat	6
Abba	3
Abe	19
Abel	4

This last script was run in the early morning hours of November 19, 2014, as the four files maintain these timestamps: 03:48:01, 03:49:55, 03:50:21, and 03:58:39. The reader will notice that it took over 8 minutes to count the number of occurrences for each word.

The final file, wfc.tsv, has file extension "tsv" or "tab separated values". This is one of the formats which can be imported into a spreadsheet.

Publishing on the internet

I elected to publish the list as a Google Docs spreadsheet. After creating the spreadsheet, I uploaded the file wfc.tsv and named the tab, "Raw data".

Once in spreadsheet form, I could add formulas to do some interesting things, such as summing the counts so as to verify that they add up to 171,060, which they do.

Solving the verb problem

This was a very manual labor-intensive task, completed over a period of several days. It involved looking up each word in the dictionary, a published paper book. When I located a verb, I would then search down the list for its second form, and combine the two forms. Later on, I wrote a script that helped to automate the combining.

The work of looking up the verbs was done within Google Docs. The spreadsheet for verbs can be seen here. Look at the tab named "with page number."

After having filled in this spreadsheet, by looking through the dictionary, I would download the tab, rename it raw.tsv, and run this script.

The script has four phases. Line 1 creates a verbs file from the downloaded spreadsheet, with each line containing the long form, a tab character, and finally the short form of the verb. Lines 2-4 create a subset of the file unique.txt consisting of only the words which are not in the verbs file. Lines 5-8 create a word frequency file for the non-verbs. Lines 9-12 create a word frequency file for the verbs.

cut -f 1,2 raw.tsv >verbs
tr "A-Z\t" "a-z\n" <verbs\
  | sort\
  | comm -i -2 -3 unique.txt - >uwodv.txt
for i in `cat uwodv.txt `; do
  echo -n "$i	"
  grep -i "^$i$" words.txt | wc -l | tr -d " "
done >wwodvfc.tsv
for v in `cat verbs | tr '\t' '|'`; do
  echo -n "$v" | sed -e "s/|/, /" -e "s/$/	/" | tr -d '\n'
  (grep -iE "^($v)	" wfc.tsv | cut -f 2; echo + p) | dc 2>/dev/null
done >verbs.tsv

Retain the first two columns, in the file named verbs.
Convert all letters to lower-case and change the tab (which separated the two verb forms) into a new line character.
Sort all of the verbs.
Compare them to the unique words, and place all unique words which are not verbs into the file named uwodv.txt, a sort of acronym for "unique without double verbs."
For each word in the file uwodv.txt do the processing in lines 6-7.
Output the word itself followed by a tab character, but no new line character.
Output the count of times that word appears in the words.txt file.
The loop ends, and all of the lines are written into a new wwodvfc.tsv file, "words without double verbs frequency count."
For each entry in the verbs file, with the tab character replaced by a vertical bar character, process lines 10-11.
Output the two verb forms, separated by a comma, but no new line character.
Output the count of how many times both forms appear in the wfc.txt file. This is done by doing a search for either form (the forms separated by the vertical bar), taking only the count (there will be zero, one or two count values) and piping these followed by the plus sign character and the letter 'p' into the desk calculator, which will do the addition (with any error messages discarded).
The loop ends, and all of the lines are written into a new verbs.tsv file.

Once this script is run, I would upload the last two files into the same tab of the word frequency spreadsheet, and use the spreadsheet tools to sort the combined list.

This work was completed in the early morning hours of November 28, 2014.

Results

The Nouvo Testaman an Morisien begins with these ten words, Ala lalis bann anset Zezi Kris desandan David ek Abraam , roughly, "This is the list of ancestors of Jesus Christ, descendant of David and Abraham." It ends with these ten words, Vini Segner Zezi Lagras Segner Zezi res avek zot tou , roughly, "Come Lord Jesus. The grace of the Lord Jesus be with you all." Even in this small sample of twenty words, some words are repeated. It is the same for the entire text. There are one hundred and seventy-one thousand and sixty (171,060) words in the text of the Nouvo Testaman. But only four thousand one hundred forty-three (4143) distinct words were used.

Of those twenty words, from the start and end of the entire text, there are only seventeen distinct words. Those seventeen words are used several times in the entire text and their frequencies vary. As an example, the word lalis (list) appears only five times in the entire text. On the other hand, the word bann (plural marker) appears nearly four thousand times in the text of the Nouvo Testaman an Morisien.

The list

I will show here a few of the most frequent words. I chose these thirty-two words which appear at least one thousand times in the entire text. The first thirty-two words, among them account for fully half of all the words in the book! This is typical of word frequency lists (see this Wikipedia page: "Word lists by frequency").

zot		8096
li		7490
ki		5919
pou		5103
ti		4469
bann		3946
finn		3565
dan		3433
enn		3315
mo		3272
e		3117
pa		2704
nou		2259
la		2188
dir		2119
so		2077
Sa		2068
Bondie		1930
pe		1813
ek		1764
fer		1652
Zezi		1487
tou		1382
me		1349
to		1348
lor		1307
dimoun		1293
ena		1221
ar		1203
mwa		1173
kouma		1046
vini, vinn	1007
...

The complete list can be found here (see the tab "raw data combining most verbs"). It has four columns: first each word or pair of verbs; second how many time that word appears in the text; third the count of words of the text so far; finally the percentage of words used up to that point.

Future work

I would like to compare the words with the dictionary. Make a list of words from the text which are not in the dictionary. This project has been done for the book of Matthew (see the tab Matie me pa Diksioner). The problem is that I must look up each word and there are thousands of words. It would be preferable to obtain a digital version of the dictionay and write a program.

Afer writing the blog post, I met with the author of the dictionary, who explained to me that the University of Mauritius, which paid for the work, cannot obtain a copyright because the dictionary is just a list of words. So, they are not releasing a digital version.

ntkm@sanbachs.com

May 23 through 26, 2015