CLT 255 Kotitentti - s2007

Vastaa suomeksi tai englanniksi.

Please answer in English or Finnish.

Submit your results in PDF format to course teacher and assistant before December 14, 2007.

1 - Discrete and continuous distributions

a) What are discrete and continuous distributions?

b) Describe 3 distributions, and give examples, how they appear in statistical language processing.

c) Describe mean, variance, median and quantile.

d) Is it possible that the mean equals the median? What does it mean?

2 - Data processing and plotting.

Download a sample text corpus in English from Project Gutenberg. (Grimm's Fairy Tales is preferred, others are OK, if they contain approx. 100 000 words). Split the text into 100 chunks.

For each chunk, calculate the frequency (number of occurrences) of words "of", "have" and "old". (You may consider word forms in singular and plural, as well as in upper and lower case, as different words.) For each of these three words, plot the word frequencies per each chunk as a word frequency histogram. (Display the chunk-wise word frequency on the y-axis and the chunk number on the x axis.)

3 - Zipf's law

Use the same data set as in assignment 2. Create a data frame with vectors "word", "rank" and "n", where "word" is the word, "n" is the total frequency, "rank" is the frequency order number of the word, so that rank=1 is the most frequent, rank=2 is the second and so on. For ranks from 1 to 500, demonstrate Zipf's law: plot the logarithmic frequency (log(n)) on the y-axis and logarithmic rank value (log(rank)) on the x-axis.

For a theoretical reference, generate a similar frequency vector m = C / rank and plot it in the same picture with the real frequency vs rank. Pick a reasonable value of C, for instance 7000.

4 - Data manipulation, cross tabulation

Use the same data set as in assignment 3.

For each word, frequency "freq(c,w)" stands for the frequency of word w in the chunk c. The frequency-wise chunk count "nchunks(n,w)" stands for the frequency of chunks containing n times the specified word w.

Assume that the nchunk(n,w) function follows Poisson distribution. For each of the words, compute the lambda value to estimate the average frequency of the word w per chunk. Calculate the theoretical distributions in appropriate vectors and plot them. Next, calculate the text-based values of nchunks(n,w). Plot these values in the same picture with the theoretical values, using the "line" command.

5 - Statistical testing

a) Use the frequency/chunk data from assignment 4. Are the words Poisson-distributed chunk-wise? Compare the real count values and theoretical distribution with ks.test() function. You may need the jitter() function to add Gaussian noise to the real count values.

b) Use the frequency/rank data from assignment 3. Does the distribution follow Zipf's law? Examine the chisq.test() function implemented in R. Calculate a vector p of theoretical probabilities and use the rescale.p=TRUE option. Compare the real word frequency with the theoretical probabilities in rank range from 20 to 40. What is the result for rank range from 10 to 20?

-- SeppoNyrkko - 29 Nov 2007

Topic revision: r2 - 2007-12-15 - SeppoNyrkko
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback