TWiki> KitWiki Web>FinnishActivities>HyClt255s2007>HyClt255s2007homeExam (2007-12-15, SeppoNyrkko) EditAttach

Vastaa suomeksi tai englanniksi.

Please answer in English or Finnish.

Submit your results in PDF format to course teacher and assistant before December 14, 2007.

a) What are discrete and continuous distributions?

b) Describe 3 distributions, and give examples, how they appear in statistical language processing.

c) Describe mean, variance, median and quantile.

d) Is it possible that the mean equals the median? What does it mean?

Download a sample text corpus in English from Project Gutenberg. (Grimm's Fairy Tales is preferred, others are OK, if they contain approx. 100 000 words). Split the text into 100 chunks.

For each chunk, calculate the frequency (number of occurrences) of words "of", "have" and "old". (You may consider word forms in singular and plural, as well as in upper and lower case, as different words.) For each of these three words, plot the word frequencies per each chunk as a word frequency histogram. (Display the chunk-wise word frequency on the y-axis and the chunk number on the x axis.)

Use the same data set as in assignment 2. Create a data frame with vectors "word", "rank" and "n", where "word" is the word, "n" is the total frequency, "rank" is the frequency order number of the word, so that rank=1 is the most frequent, rank=2 is the second and so on. For ranks from 1 to 500, demonstrate Zipf's law: plot the logarithmic frequency (log(n)) on the y-axis and logarithmic rank value (log(rank)) on the x-axis.

For a theoretical reference, generate a similar frequency vector m = C / rank and plot it in the same picture with the real frequency vs rank. Pick a reasonable value of C, for instance 7000.

Use the same data set as in assignment 3.

For each word, frequency "freq(c,w)" stands for the frequency of word w in the chunk c. The frequency-wise chunk count "nchunks(n,w)" stands for the frequency of chunks containing n times the specified word w.

Assume that the nchunk(n,w) function follows Poisson distribution. For each of the words, compute the lambda value to estimate the average frequency of the word w per chunk. Calculate the theoretical distributions in appropriate vectors and plot them. Next, calculate the text-based values of nchunks(n,w). Plot these values in the same picture with the theoretical values, using the "line" command.

a) Use the frequency/chunk data from assignment 4. Are the words Poisson-distributed chunk-wise? Compare the real count values and theoretical distribution with ks.test() function. You may need the jitter() function to add Gaussian noise to the real count values.

b) Use the frequency/rank data from assignment 3. Does the distribution follow Zipf's law? Examine the chisq.test() function implemented in R. Calculate a vector p of theoretical probabilities and use the rescale.p=TRUE option. Compare the real word frequency with the theoretical probabilities in rank range from 20 to 40. What is the result for rank range from 10 to 20?

-- SeppoNyrkko - 29 Nov 2007

Edit | Attach | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions

Topic revision: r2 - 2007-12-15 - SeppoNyrkko

**TWiki Reference**

- ATasteOfTWiki
- TextFormattingRules
- TWikiVariables
- FormattedSearch
- TWikiDocGraphics
- InstalledPlugins
- TWikiReferenceManual

Copyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback