Lecture 2, Thursday Sept. 1:
Words, tokenization, tagged text
This lecture will consider
- some basic linguistics concepts related to words
 - the processes of tokenization and normalization
 - tagged text
 
Presentation
Recordings
Mandatory readings
Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 12 Jan. 2022!)
- Ch. 2 Regular expressions, etc.
		
- Sec. 2.0
 - Sec. 2.2 Words
 - Sec. 2.3 Corpora
 - Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
 
 - Ch. 8 Sequence Labelling ...
		
- Sec 8.1 and 8.2
 
 
- Ch. 3, sec. 6 Normalizing Text
 - Ch. 3, sec. 8 Segmentation
 - Ch. 5, sec. 1 Using a tagger
 - Ch. 5, sec. 2 Tagged corpora
 
Wikipedia
Recommended reading
Wikipedia
Probabilities - background and tutorial
The slides of last year and the readings below indicate what we expect with respect to your knowledge of probabilities. Many of you have a background in probabilities, but some of you may lack it. If anybody are interested, we will arrange a tutorial on probabilities sometime between Fri Sept. 2 and Wed Sept. 7. We can decide on time in the lecture Sept. 1. (Sept. 1 at 14 turned out not to be an option.) If you are interested, you may send me (jtl) a mail indicating possible times.
Presentation
Readings
OpenIntro (3. ed.) (In the 4th ed. add one to the chapter numbers)
- Ch. 2, "Probability", sec. 2.1-2.4
 - Ch. 3, "Distributions of random variables":
		
- Sec. 3.3.1 Bernoulli distribution
 - Sec. 3.4.1 Binomial distribution