This is based on a related blog post, written in Morisien. This is both a translation and an elaboration, particularly in the method section.
The goal was to produce a list of all of the words in the Nouvo Testaman an Morisien, and to sort them in order of decreasing frequency of use. It turns out that there are 171,060 words in that edition of the New Testament, with only 4,134 distinct words.
First of all, this was a personal project, in the context of learning the Morisien language. It is also an exercise in computational linguistics, like one that I undertook with a colleague forty years ago, described in the paper "A Word Frequency Count of Mormon English". Finally, it's a way to compare the New Testament in Morisien with the latest dictionary of the Morisien language.
The method consisted of: finding the text on the Internet, extracting all of the words, and counting the distinct words together with how many times each one was used in the full text. There were problems with homonyns and with verb forms, since most verbs in Morisien have two forms. These had to be manually looked up in the dictionary and combined into a single entry. Finally, the completed work is shared on the Internet.
By searching, I discovered the content on-line, at https://www.bible.com/versions/344 and from there, looked up the Bible Society of Mauritius, phoned them, and eventually located their office, where I purchased 12 printed copies, the morning of September 29, 2014. A brief look at copyright notices convinced me that I would not be in violaton by counting the words. Any copy I would have of the book would be for personal, and non-commercial use. My finished product will be just a list of words in a particular order, and will be freely available. Anyone with the time and willingness could reproduce the list by following the steps outlined here.
The first thing that I produced was the file
map.txt
which contains three columns,
the three letter abbreviation used by the bible.com
website to refer to the books of the New Testament,
the abbreviation used by the Morisien edition, and
finally the full name of each book in that edition.
I created this file manually, based on observations
of the web pages, and consulting the printed copies.
Along the way, I corrected one error in printing
(the second epistle of John they call simply, Zn,
instead of what it should be, "2 Zn") which is both
in the printed version and on the Internet.
The file map.txt
shown here
and, then the file chapters.txt
,
also built manually, to hold the number of chapters
in each book.
mat Mt Bonn Nouvel dapre Matie
mrk Mk Bonn Nouvel dapre Mark
luk Lk Bonn Nouvel dapre Lik
jhn Zn Bonn Nouvel dapre Zan
act Zis Zistwar Bann Apot
rom Rom Let pou Romin
1co 1 Ko Premie let pou Korintien
2co 2 Ko Deziem let pou Korintien
gal Ga Let pou Galat
eph Ef Let pou Efezien
php Fil Let pou Filipien
col Kol Let pou Kolosien
1th 1 Tes Premie let pou Tesalonisien
2th 2 Tes Deziem let pou Tesalonisien
1ti 1 Tim Premie let pou Timote
2ti 2 Tim Deziem let pou Timote
tit Tit Let pou Tit
phm Flm Let pou Filemon
heb Eb Let pou Ebre
jas Zak Let Zak
1pe 1 Pi Premie let Pier
2pe 2 Pi Deziem let Pier
1jn 1 Zn Premie let Zan
2jn 2 Zn Deziem let Zan
3jn 3 Zn Trwaziem let Zan
jud Zid Let Zid
rev Rev Revelasion
mat 28
mrk 16
luk 24
jhn 21
act 28
rom 16
1co 16
2co 13
gal 6
eph 6
php 4
col 4
1th 5
2th 3
1ti 6
2ti 4
tit 3
phm 1
heb 13
jas 5
1pe 5
2pe 3
1jn 5
2jn 1
3jn 1
jud 1
rev 22
"Scraping" is the technical term for obtaining content
from a web page by downloading the source code of the page
and extracting the desired information.
The following program downloads the text of all of the
chapters.
Rather than fire up a database, I chose to store the text
of each chapter in a file, named for the chapter number,
and stored in a folder named for the book code name
(column one of map.txt
).
for b in `jot '' 1 27`
do
BOOK_COD=`sed -n ${b}p ../map.txt | cut -f 1`
CHAP_CNT=`grep ^$BOOK_COD ../chapters.txt | cut -f 2`
echo "$BOOK_COD $CHAP_CNT"
mkdir -p $BOOK_COD
for c in `jot '' 1 $CHAP_CNT`
do
curl -k "https://www.bible.com/bible/344/$BOOK_COD.$c.ntkm"\
| grep class=.verse.v1 >$BOOK_COD/$c.html
done
done
b
will take on the numbers 1 through 27 in the loop comprising lines 2-12. These lines will be executed 27 times, once for each of the books.BOOK_COD
will be assigned the value in the first column of the bth line of the file map.txt
.CHAP_CNT
will take on the value in the second column of the corresponding line of the file chapters.txt
.mat
, mrk
, ... rev
.c
will take on the numbers 1 through however many chapters there are in the current book, in the loop comprising lines 8-11. This line will be executed once for each chapter in the current book.curl
command gets the specified chapter page from the internet. The first internet page will be mat.1.ntkm
and the very last internet page fetched will be rev.22.ntkm
. These URLs match the ones in use by the bible.com
domain at the time I scraped the pages.grep
, which selects just the one line containing the pattern class=.verse.v1
because inspection of the files showed that all verses of each chapter are contained in that one (very long) line of source code. This line of code will be saved in the folder named by the book code, as a file named by the chapter number with the .html
file extension.I ran this program on October 28, 2014, starting at 09:13:04 and ending at 09:44:08 (date and time stamps for Matthew chapter one and Revelations chapter twenty-two, respectively). I remember wondering at the time if anyone would ever check the logs and notice the sequential pattern of access. It is the same pattern of access that would be seen in the logs if someone set out to read the entire New Testament and visited the chapters one by one, starting at Matthew and on to the end. However, this hypothetical reader would certainly not finish the reading in half an hour!
Now begins the work of extracting each verse from its chapter. I wrote a script to do this work for each downloaded chapter. This script grew gradually, as I learned about the content of the various chapters and their markup. The finished version is presented here.
The job of this script is to start with the single line file downloaded from the internet and produce a corresponding text file with all of the HTML tags removed, and with one line per verse.
case $# in
2) ;;
*) echo Usage $0 book chapter; exit ;;
esac
cat $1/$2.html\
| sed -e 's/^ *<div class="label">/Sapit /'\
| perl -p -e 's/(?=<span.class="verse.v[^>]*><span.class="label)/\n/g'\
| perl -p -e 's/(?=<div.class="q[12]*">)/ /g'\
| sed -e 's/<span class="heading">[^>]*>[^>]*>//g'\
| sed -e 's/̶[01];/"/g'\
| sed -e "s/̵[67];/'/g"\
| sed -e "s/–/-/g"\
| sed -e 's/<[^>]*>//g'\
| sed -e '/^ *$/d'\
| sed -e 's/ *$//'\
| sed -e 's/^[1-9][0-9]*/& /'\
>$1/$2.txt
$1
in the rest of the script, is the code name of the book. The second argument, known as $2
throughout the script is the chapter number.\n
).I ran this script for each of the chapters, and examined the output, adjusting until it worked correctly (except, see below).
Next, I will show the smallest chapter, 1st John chapter 1, as it was downloaded from the internet, and then as it was after running the script. Three of the patterns are highlighted. The one highlighted in light blue is replaced by "Sapit ", and the one highlighted in light green is left intact (to later be removed in line 13) but marks where a line break will occur in the output text file. The one highlighted in pink is simply entirely removed.
This is the file named 1jn/1.html
This is the file named 1jn/1.txt
An alert reader will have noticed that the script fails to dowload the introductory text at the start of each book. This was a deliberate decision, as I wanted to include only the words of the (modern) translation of the ancient text.
Unfortunately, there were other things that caused problems with the script. Thirty-eight of the 260 chapters had one or more footnotes, and these were not consistently coded, so I could not modify the script easily to remove them. I opted to manually edit these HTML files so that they would work with the script shown above.
Hover to see the chapters that were edited manually. mat 6, mat 18, mat 21, mrk 1, mrk 4, mrk 7, mrk 9, mrk 11, mrk 16, luk 8, luk 16, luk 22, luk 23, jhn 8, act 2, act 8, act 15, act 24, act 28, rom 1, rom 16, 1co 1, 1co 11, 2co 6, gal 1, gal 2, eph 1, eph 5, php 3, heb 9, jas 4, rev 1, rev 2, rev 4, rev 7, rev 8, rev 9, rev 17
Once I had a little practice, I was able to edit each chapter in a minute or two.
Hover to see the order in which the edits were completed. php 3, mrk 4, eph 5, mat 18, mat 6, mat 21, mrk 1, mrk 11, mrk 16, mrk 7, mrk 9, heb 9, jas 4, rev 2, rev 4, rev 8, rev 9, rev 1, rev 17, rev 7, eph 1, gal 1, gal 2, 2co 6, 1co 11, 1co 1, rom 16, rom 1, act 28, act 15, act 8, act 2, act 24, jhn 8, luk 8, luk 16, luk 22, luk 23
Other than these manual edits, an ambitious reader who set out to check my work could follow these same steps and come up with the same list of words. The guideline that I used in dealing with the footnotes was to retain all of the verses that correspond to verses in my reference work, the New Testament of the King James version of the Bible.
In retrospect, I see that I could have counted the words without doing this step, but my intention here is to document what I actually did.
case $# in
1) ;;
*) echo Usage $0 book; exit ;;
esac
ls $1/*.txt\
| grep "^$1/[1-9]"\
| sort -n -t / -k 2\
>$1/map.txt
for i in `cat $1/map.txt `
do
grep -H "^[1-9][0-9]*" $i\
| sed -e "s/.txt//"\
-e "s/\// /"\
-e "s/\(:[1-9][0-9]*\)./\1 /";\
done >$1/all.txt
$1
in the rest of the script, is the code name of the book.map.txt
in the book folder.i
to the folder slash chapter file name.-H
option prefixes each line with the file name and a colon.all.txt
in the book folder.
Here is the map.txt
file for book 1jn.
1jn/1.txt
1jn/2.txt
1jn/3.txt
1jn/4.txt
1jn/5.txt
And here is the first line of the first chapter of the book 1jn, as it appears at different times during the script. This shows the line after steps 11-14.
1jn/1.txt:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn/1:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn 1:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
1jn 1:1 Nou anons zot ki seki ti la depi komansman, seki nou finn tande, ...
>all.txt
for b in `cut -f 1 ../map.txt `; do
cat $b/all.txt >>all.txt; done
cut -f 2 all.txt\
| sed -e "s/ -/ /g"\
| tr "])\?,;:\"'.\!([" " "\
| tr -s " " "\n"\
>words.txt
sort -u -f words.txt >unique.txt
for i in `cat unique.txt `; do
echo -n "$i "
grep -i "^$i$" words.txt
| wc -l
| tr -d " "
done >wfc.tsv
all.txt
.b
set to each of the book codes.all.txt
file (created by the previous script) into the all.txt
file, appending to what is already there from the previous books.all.txt
file (that is, everything after the tab character).words.txt
.unique.txt
.unique.txt
file.words.txt
file.wfc.tsv
.
Here are the first couple of lines of the file all.txt
.
This file has 7933 lines in all, one per verse.
mat 1:1 Ala lalis bann anset Zezi Kris, desandan David ek Abraam.
mat 1:2 Abraam ti papa Izaak...
And here are the first ten lines of the files words.txt
, unique.txt
, and wfc.tsv
.
These files have, respectively,
171060,
4455, and
4455 lines.
Hence, 171060 words, and 4455 unique words.
Ala
lalis
bann
anset
Zezi
Kris
desandan
David
ek
Abraam
12000
a
Aaron
Abadon
abandone
abandonn
abat
Abba
Abe
Abel
12000 12
a 16
Aaron 5
Abadon 1
abandone 3
abandonn 25
abat 6
Abba 3
Abe 19
Abel 4
This last script was run in the early morning hours of November 19, 2014, as the four files maintain these timestamps: 03:48:01, 03:49:55, 03:50:21, and 03:58:39. The reader will notice that it took over 8 minutes to count the number of occurrences for each word.
The final file, wfc.tsv
, has file extension "tsv" or "tab separated values".
This is one of the formats which can be imported into a spreadsheet.
I elected to publish the list as a Google Docs spreadsheet.
After creating the spreadsheet, I uploaded the file
wfc.tsv
and named the tab, "Raw data".
Once in spreadsheet form, I could add formulas to do some interesting things, such as summing the counts so as to verify that they add up to 171,060, which they do.
This was a very manual labor-intensive task, completed over a period of several days. It involved looking up each word in the dictionary, a published paper book. When I located a verb, I would then search down the list for its second form, and combine the two forms. Later on, I wrote a script that helped to automate the combining.
The work of looking up the verbs was done within Google Docs. The spreadsheet for verbs can be seen here. Look at the tab named "with page number."
After having filled in this spreadsheet,
by looking through the dictionary,
I would download the tab, rename it raw.tsv
,
and run this script.
The script has four phases.
Line 1 creates a verbs
file from the downloaded spreadsheet,
with each line containing the long form, a tab character, and finally the
short form of the verb.
Lines 2-4 create a subset of the file unique.txt
consisting of
only the words which are not in the verbs
file.
Lines 5-8 create a word frequency file for the non-verbs.
Lines 9-12 create a word frequency file for the verbs.
cut -f 1,2 raw.tsv >verbs
tr "A-Z\t" "a-z\n" <verbs\
| sort\
| comm -i -2 -3 unique.txt - >uwodv.txt
for i in `cat uwodv.txt `; do
echo -n "$i "
grep -i "^$i$" words.txt | wc -l | tr -d " "
done >wwodvfc.tsv
for v in `cat verbs | tr '\t' '|'`; do
echo -n "$v" | sed -e "s/|/, /" -e "s/$/ /" | tr -d '\n'
(grep -iE "^($v) " wfc.tsv | cut -f 2; echo + p) | dc 2>/dev/null
done >verbs.tsv
verbs
.uwodv.txt
, a sort of acronym for "unique without double verbs."uwodv.txt
do the processing in lines 6-7.words.txt
file.wwodvfc.tsv
file, "words without double verbs frequency count."verbs
file, with the tab character replaced by a vertical bar character, process lines 10-11.wfc.txt
file. This is done by doing a search for either form (the forms separated by the vertical bar), taking only the count (there will be zero, one or two count values) and piping these followed by the plus sign character and the letter 'p' into the desk calculator, which will do the addition (with any error messages discarded).verbs.tsv
file.Once this script is run, I would upload the last two files into the same tab of the word frequency spreadsheet, and use the spreadsheet tools to sort the combined list.
This work was completed in the early morning hours of November 28, 2014.
The Nouvo Testaman an Morisien begins with these ten words, Ala lalis bann anset Zezi Kris desandan David ek Abraam , roughly, "This is the list of ancestors of Jesus Christ, descendant of David and Abraham." It ends with these ten words, Vini Segner Zezi Lagras Segner Zezi res avek zot tou , roughly, "Come Lord Jesus. The grace of the Lord Jesus be with you all." Even in this small sample of twenty words, some words are repeated. It is the same for the entire text. There are one hundred and seventy-one thousand and sixty (171,060) words in the text of the Nouvo Testaman. But only four thousand one hundred forty-three (4143) distinct words were used.
Of those twenty words, from the start and end of the entire text, there are only seventeen distinct words. Those seventeen words are used several times in the entire text and their frequencies vary. As an example, the word lalis (list) appears only five times in the entire text. On the other hand, the word bann (plural marker) appears nearly four thousand times in the text of the Nouvo Testaman an Morisien.
I will show here a few of the most frequent words. I chose these thirty-two words which appear at least one thousand times in the entire text. The first thirty-two words, among them account for fully half of all the words in the book! This is typical of word frequency lists (see this Wikipedia page: "Word lists by frequency").
The complete list can be found here (see the tab "raw data combining most verbs"). It has four columns: first each word or pair of verbs; second how many time that word appears in the text; third the count of words of the text so far; finally the percentage of words used up to that point.
I would like to compare the words with the dictionary. Make a list of words from the text which are not in the dictionary. This project has been done for the book of Matthew (see the tab Matie me pa Diksioner). The problem is that I must look up each word and there are thousands of words. It would be preferable to obtain a digital version of the dictionay and write a program.
Afer writing the blog post, I met with the author of the dictionary, who explained to me that the University of Mauritius, which paid for the work, cannot obtain a copyright because the dictionary is just a list of words. So, they are not releasing a digital version.
© Bruce Conrad 2015 ntkm@sanbachs.com May 23 through 26, 2015