Additional Comments

Additional comments related to material from the class. If anyone wants to convert this to a blog, let me know. These additional remarks are for your enjoyment, and will not be on homeworks or exams. These are just meant to suggest additional topics worth considering, and I am happy to discuss any of these further.

Tuesday, December 13. The last class -- so sad! Lots of fun topics today, from some card tricks to gambling.
- Lots of great card tricks. The one I did first is called the Kruskal Count Card Trick (but I had forgotten the name and boy did it take awhile to find!). Links to explanations about why it works are here. Jeff Lagarias has a nice analysis here. Turns out this has applications in cryptography.
- The second trick is the 5 card trick of Fitch Cheney (scroll down at this great site). A great write-up is here.
- We then talked about the mathematics of blackjack and gambling. Here are some good links.
  - We discussed card counting and blackjack; click here for the basic strategy for blackjack (see also the wikipedia article). It's important to keep the system simple and easy to use and implement. Several papers on the subject are linked to below:
  - Bladwin et al: Optimal Strategy in Blackjack: paper here)
  - Thorpe's original article on blackjack (his book is available here: Thorpe's book)
  - Thorpe's article on the Kelly criterion in blackjack and other gambling situations.
  - classmates' summaries of these articles are available here

Thursday, December 8. We discussed mathematical modeling and probability today. It's a vast subject, and this is meant to just give you the briefest of intros to this enormously important area.
- We started with modeling whale populations. A variant of the whale population is online here (I lectured on this in differential equations, and that day I had a slightly different choice of when whales reproduce).
- We saw how important linear algebra can be in these problems. Further, we saw that products of matrices enter. If the matrices are all the same, life is good. We can often diagonalize and attack quickly. If the matrices have variable entries, however, it's harder (but more realistic). In this case we need more powerful techniques. This leads to Random Matrix Theory, one of my favorite subjects of research. There are lots of surveys; here are two nice ones.
  - Hayes: The Spectrum of Riemannium: a light description of the connection between random matrix theory and number theory (there are a few minor errors in the presentation, basically to simplify the story). This is a quick read, and gives some of the history.
  - Firk and Miller: Nuclei, primes and the Random Matrix connection: a survey paper on the history of the subject, including both the nuclear physics experiments and the theoretical calculations.
- We talked about Markov or stochastic matrices. The wikipedia entry is a nice start. If the matrix has all positive entries, the eigenvector associated to the largest eigenvalue has all positive entries, and this is physically important. This is known as the Perron-Frobenius theorem. Sometimes this is called a transition matrix. These arise in a variety of problems (think orbital states in atoms, for example). The analysis is a lot easier when the matrix elements are fixed (see here for more on random matrices).
- Here's a fun deterministic system that looks random, Conway's The Game of Life (the Wikipedia article is here). This leads to the beautiful theory of Cellular Automata.
- As we talked about Bacon and Erdos numbers, figured I'd put a few bits about these. It's actually a very interesting and challenging problem to compute Bacon numbers!
  - Bacon numbers (also see the Oracle of Bacon)
  - Erdos numbers (my number is 2 through http://www.ams.org/mathscinet-getitem?mr=2815206 and http://www.ams.org/mathscinet-getitem?mr=880469 (though my colleague has multiple papers one could use); MathSciNet has a searchable feature to find this: go to http://www.ams.org/mathscinet/freeTools.html and click on collaboration distance). (My Einstein number is 4.) There's a nice website at Oakland University on Erdos Numbers.
  - Erdos - Bacon numbers

Tuesday, December 6. Today was a payoff day. After developing a lot of the general theory of probability, we were able to use it to solve and analyze problems of practical import, specifically, Benford's law of digit bias. My slides are available here.
- Before describing Benford's law, first, The Van Halen Brown M&M story (see also here). I love this -- it does a great job of showing the importance of checking conditions!
- Several good papers: Hill's The first digit phenomenon; Nigrini's I've got your number.
- We saw that small data sets can be misleading. For example, there were fewer 9s than predicted for the first 60 terms in the sequence {2^n}, but we saw that this was due to the fact that 2^10 is approximately 10^3, and thus the set {leading digit of 2^n base 10} is almost, but not quite, periodic with period 10. We saw periodic behavior in powers of π, due to the fact that π¹⁷⁵ is almost a power of 10. The convergence to Benford's law is controlled by how well approximated an irrational number is by rationals; this is a fascinating topic, and worthy of further study and thought. We measure how well approximated irrationals are by rationals by seeing how large of a denominator we need to get a given order of accuracy. This leads to irrationality exponents or measure; in fact, this idea is used to prove thatLiouville numbers are transcendental numbers. If you would like to know more about these, let me know and I'll provide Chapter 5 of my book.
- The key ingredient in proving many systems are Benford is to show that if x_n is the original data set, then y_n = log_10 x_n is equidistributed modulo 1. How do we prove this? If x_n = a^n for some fixed a, then y_n = n log_10 a. A theorem of Kronecker (generalized by Weyl) states that n alpha mod 1 is equidistributed if and only if alpha is irrational (in addition to the analysis and number theory proofs, there is also an ergodic proof). For some problems, it isn't enough to know that it becomes equidistributed, but we also need to know how rapidly it becomes equidistributed; in many instances this is answered by the theory of linear forms of logarithms. This is frequently related to how well certain irrationals are approximated by rationals. In my paper with Alex Kontorovich on the 3x+1 problem, the key step in proving Benford behavior was showing that log_10 2 had finite irrationality exponent (we bounded it by about 10⁶⁰², a very large but also a very finite number!).
  - Click here for my paper with Alex Kontorovich on 3x+1 and Benford (as well as zeta(s)).
- To determine if the observed data is well described by our prediction, it is common to use a chi-square test (click here for a nice online chi-square calculator). There is a lot of beautiful theory on such tests; my favorite involves structural zeros (what happens when certain events cannot be observed, such as a tie in a non-Selig sanctioned baseball game). If you are interested, let me know and I can send you some papers which discuss the theory; it is briefly mentioned in my baseball paper.
- The proof of denseness of n alpha mod 1 for alpha irrational is significantly easier than equidistribution, involving Dirichlet's Pigeonhole Principle.
- We showed linear recurrence relations are Benford (or we mostly showed this) so long as the largest root of the characteristic polynomial exceeds 1. A nice exercise is to do this calculation rigorously; this is done in Chapter 9 of my book.
- For more on the hydrology data and Benford's law, see my paper with Mark Nigrini (and see the references there for Mark Nigrini's papers on tax fraud). Our newest paper with a new Benford test just appeared (the mathematics is proved in a separate paper, available here).
- Another way to attack Benford's law is via the Central Limit Theorem modulo 1. I prove this in detail in this paper. The proof given of the CLT modulo 1 is not the most general result possible, as we will assume the Y_i's have finite variances -- this is not needed, as is shown in our paper! The proof is a bit harder (not surprisingly), but our friend the Cauchy distribution is not forbidden!
  - There are other generalizations of the central limit theorem. One particularly nice version involves Haar measure. Consider the set of N x N unitary matrices U(N), or its subgroups the orthogonal matricesand the symplectic matrices. It turns out there is a way to define a probability measure on these spaces (this is the Haar measure), and there are generalizations of the central limit theorem in these contexts: The n-fold convolution of a regular probability measure on a compact Hausdorff group G converges to normalized Haar measure in weak-star topology if and only if the support of the distribution not contained in a coset of a proper normal closed subgroup of G.
- For convenience, the following is a collection of the papers I've written on Benford's law. As you can tell, I love the subject. There are many problems that are very amenable to undergraduate investigations; if you want to try your hand at research, let me know.
  - Benford's law, values of L-functions and the 3x+1 problem (with Alex Kontorovich), Acta Arithmetica. (120 (2005), no. 3, 269–297). pdf.
  - Benford's Law applied to hydrology data - results and relevance to other geophysical data (with Mark Nigrini), Mathematical Geology (39 (2007), no. 5, 469--490). pdf
  - The Modulo 1 Central Limit Theorem and Benford's Law for Products (with Mark Nigrini), International Journal of Algebra. (2 (2008), no. 3, 119--130). pdf
  - Order statistics and Benford's law (with Mark Nigrini), International Journal of Mathematics and Mathematical Sciences (Volume 2008 (2008), Article ID 382948, 19 pages, doi:10.1155/2008/382948) pdf
  - Chains of distributions, hierarchical Bayesian models and Benford's Law (with D. Jang, J. U. Kang, A. Kruckman and J. Kudo), Journal of Algebra, Number Theory: Advances and Applications. (volume 1, number 1 (March 2009), 37--60) pdf
  - Data diagnostics using second order tests of Benford's Law (with Mark Nigrini), Auditing: A Journal of Practice and Theory. (28 (2009), no. 2, 305--324. doi: 10.2308/aud.2009.28.2.305) MSWord file

Tuesday, November 29. We discussed sample means and variances, covariances, and the secretary / marriage problem.
- The sample mean should converge to a good estimate of the population mean, at least in many situations; similarly the sample variance should estimate the population variance. Note the last formula there, where there is division by n-1 and not n. Based on our conversation in class, this shouldn't be surprising as we know that the `degrees of freedom' cannot be n. Why? While we can estimate the mean with just one observation, we can't estimate deviations with just one value -- there are no deviations with just one observation! If you want more, read the following comment (an expanded version of this).
- There is a deep and rich theory of sums of normal random variables (and their squares), which is described in greater detail in a statistics class.
  - The sample mean is defined by X = Sum_{i = 1 to N} X_i / N and the sample variance by S² = (Sum_{i = 1 to N} (X_i - X)² / (N-1). The main theorem is that (N-1) S² is a chi-square distribution with N-1 degrees of freedom. It is not immediately clear why we divide by N-1 and not N; after all, there are N data points, and we do divide by N for the variance of a finite set of data. There are valid statistical reasons for this (wanting an unbiased estimator; I strongly urge you to read the wikipedia entry, as there is a nice bit on the proof, using (what else) adding zero; see also Cochran's theorem). I use the following heuristic to explain why it's N-1 and not N; namely, consider the extreme case of N=1. In this case, while one observation can be used to estimate the true mean, it is absurd to think one observation can be used to estimate the true variance! The reason is that we need to look at differences, at fluctuations about the mean, to get a hand on the variance -- how can we do this with just one data point?
  - A major theorem is that the sample mean and sample variance are independent. This is not at all clear from the definition (as the sample variance involves the mean). This leads to studying the statistic t =(X - μ) / (S² / sqrt(N)); this is known as the t-statistic and has the t-distribution with N degrees of freedom (here μ is the mean of the identically distributed normal random variables). As N tends to infinity this converges to the standard normal, but is very useful for finite N when we have independent Gaussian random variables with unknown variance.
- We then talked about covariances, which is needed to compute variances of sums of dependent random variables. We saw how the formula reduced to our old result under independence. We constantly used linearity of expectation, but were very careful never to say the expected value of a product is the product of the expected values, as that only holds under independence. Good upper and lower bounds for the covariance come from the Cauchy - Schwarz inequality. A good statistic is to look at a normalized covariance, called the correlation coefficient (remember correlation does not imply causation!).
- There's a really nice wikipedia article on the secretary / marriage problem. Somewhat related to this is the German tank problem.

Tuesday, November 22. Today was an introduction to mathematical modeling.
Today's lecture serves two purposes. While it does review many of the concepts from probability, more importantly it introduces many of the key ideas and challenges of mathematical modeling. Most students of 342 won't be computing probabilities and/or integrals later in life (though you never know!); however, almost surely you'll have a need to model, to try and describe a complex phenomena in a tractable manner.
- Sabermetrics is the `science' of applying math/stats reasoning to baseball. There is a simple formula known as the log-5 method; a better formula is the Pythagorean Won - Loss formula (someone linked my paper deriving this from a reasonable model to the wikipedia page), the topic of today's lecture. ESPN, MLB.com and all sites like this use the Pythagorean win expectation in their expanded series. My derivation is a nice exercise in multivariable calculus and probability
- In general, it is sadly the case that most functions do not have a simple closed form expression for their anti-derivative. Thus integration is magnitudes harder than differentiation. One of the most famous that cannot be integrated in closed form is exp(-x²), which is related to calculating areas under the normal (or bell or Gaussian) curve. We do at least have good series expansions to approximate it; see the entry on the erf (or error) function.
  - Earlier in the semester we mentioned that the anti-derivative of ln(x) is x ln(x) - x; it is a nice exercise to compute the anti-derivative for (ln(x))ⁿ for any integer n. For example, if n=4 we get 24 x-24 x Ln[x]+12 x Ln[x]²-4 x Ln[x]³+x Ln[x]⁴.
- Another good distribution to study for sabermetrics would be a Beta Distribution. We've seen an example already this semester when we looked at the Laffer curve from economics.I would like to try to modify the Weibull analysis from today's lecture to Beta distributions. The resulting integrals are harder -- if you're interested please let me know.
- Today we discussed modeling, in particular, the interplay between finding a model that captures the key features and one that is mathematically tractable. While we used a problem from baseball as an example, the general situation is frequently quite similar. Often one makes simplifying assumptions in a model that we know are wrong, but lead to doable math (for us, it was using continuous probability distributions in general, and in particular the three parameter Weibull). For more on these and related models, my baseball paper is available here; another interesting read might be my marketing paper for the movie industry (which is a nice mix of modeling and linear programming, which is the linear algebra generalization of Lagrange multipliers).
  - One of the most important applications of finding areas under curves is in probability, where we may interpret these areas as the probability that certain events happen. Key concepts are:
    - Probability distribution
    - Mean or Expected Value
    - Standard Deviation
    - Independence
    - Skewness and kurtosis (for the hypercompetitive students who really want to compare themselves to the class)
  - The more distributions you know, the better chance you have of finding one that models your system of interest. Weibulls are frequently used in survival analysis. The exponential distribution occurs in waiting times in lines as well as prime numbers.
  - In seeing whether or not data supports a theoretical contention, one needs a way to check and see how good of a fit we have. Chi-square tests are one of many methods.
  - Much of the theory of probability was derived from people interested in games of chance and gambling. Remember that when the house sets the odds, the goal is to try and get half the money bet on one team and half the money on the other. Not surprisingly, certain organizations are very interested in these computations. Click here for some of the details on the Bulger case.
  - Any lecture on multivariable calculus and probabilities would be remiss if it did not mention how unlikely it is to be able to derive closed form expressions; this is why we study Monte Carlo integration later. For example, the normal distribution is one of the most important in probability, but there is no nice anti-derivative. We must resort to series expansions; that expansion is so important it is given a name: the error function.
  - I strongly urge you to read the pages where we evaluate the integrals in closed form. The methods to get these closed form expressions occur frequently in applications. I particularly love seeing relations such as 1/c = 1/a + 1/b; you may have seen this in resistors in parallel or perhaps the reduced mass from the two body problem (masses under gravity). Extra credit to anyone who can give me another example of quantities with a relation such as this.
  - Click here for a clip of Plinko on the Price I$ Right, or here for a showcase showdown.

Thursday, November 17. We proved some of the key properties of a chi-square random variable and discussed applications.
- The big items is that the mean of a chisquare random variable with nu degrees of freedom is nu, and the variance is 2nu. There are lots of ways to do this. We did one proof by brute force integration, using the Gamma function to recognize the integral. We then did a slick proof by studying the case where nu=2, which is an exponential distribution. We could easily compute the mean and variance here by integrating by parts. From this, we can get the mean of a chisquare with one degree of freedom by using linearity of expectation; this is a lot cleaner than doing that calculation directly, as there we needed to be very clever in our change of variables (u = sqrt(x)). We then used linearity of expectation again to get general integer nu. Again, while this proof only works for integer nu, it is nice how we can avoid these painful integrals by knowing just the exponential.
- We talked a bit about the exponential distribution and its applications. We discussed how there are two different normalizations, each has its supporters. It comes down to what you view is the primary perspective. Interestingly Wikipedia does the opposite from class, but lists ours as the alternative parametrization. This is similar to some of the issues we have in the definition of the Gamma function and the chisquare density -- we can't have one normalization that's great for all problems, and have to make choices.
- Why spend so much time on chisquare? As mentioned in the last additional comments, a big application of this material is to theChi-square test. Here's a link to a nice online chi-square calculator for such tests. For this test to work, we need to make a lot of assumptions about how the errors are distributed (normal, independent). These assumptions are not always correct, but if they hold do lead to a mathematically tractable problem (and that's what really matters, right)?
- At the end of the day we talked a bit about the Goodness of Fit test. This shows us that we cannot always have independent errors, as sometimes there are constraints (such as the sum of the errors is zero if we're assigning data to a fixed set of boxes). Another example of issues such as this is structural zeros, which we'll discuss on Tuesday. Other similar issues are sample variance woes....

Tuesday, November 15. We discussed more properties of the chi-square distribution, especially ways to prove that we have probability distributions by using the method of normalization to `see' that certain integrals must be 1.
- We could bypass a lot of the theory of normalization constants by knowing properties of the Beta distribution, but of course our approach allows us to prove properties about the Beta distribution!
- We avoid a lot of the difficulties with the Change of Variables formula, in particular, high dimensional spherical coordinates. Consider the hypersphere. The article there includes both the coordinates as well as the formulas for the hypervolume and hyper-surface area.
- A big application of this material is to the Chi-square test. Here's a link to a nice online chi-square calculator for such tests.

Thursday, November 10. The Method of Least Squares is a beautiful method applicable to a variety of problems. By choosing our metric for measuring errors well, we can use tools from calculus and linear algebra to right down closed form expressions for the best fit parameters. If the errors are normally distributed, we can even write down formulas for the distributions of the best fit valules. The best fit value of the parameters depends on how we choose to measure errors. It is very important to think about how you are going to measure / model, as frequently people reach very different conclusions because they have different starting points / different metrics.
- The Method of Least Squares is one of my favorites in statistics (click here for the Wikipedia page, and click here for my notes). The Method of Least Squares is a great way to find best fit parameters. Given a hypothetical relationship y = a x + b, we observe values of y for different choices of x, say (x1, y1), (x2, y2), (x3, y3) and so on. We then need to find a way to quantify the error. It's natural to look at the observed value of y minus the predicted value of y; thus it is natural that the error should be Sum_{i=1 to n} h(yi - (a xi + b)) for some function h. What is a good choice? We could try h(u) = u, but this leads to sums of signed errors (positive and negative), and thus we could have many errors that are large in magnitude canceling out. The next choice is h(u) = |u|; while this is a good choice, it is not analytically tractable as the absolute value function is not differentiable. We thus use h(u) = u2; though this assigns more weight to large errors, it does lead to a differentiable function, and thus the techniques of calculus are applicable. We end up with a very nice, closed form expression for the best fit values of the parameters.
- Unfortunately, the Method of Least Squares only works for linear relations in the unknown parameters. As a great exercise, try to find the best fit values of a and c to y = c/xa (for definiteness you can think of this as the force due to two unit masses that are x units apart). When you take the derivative with respect to a and set that equal to zero, you won't get a tractable equation that is linear in a to solve. Fortunately there is a work-around. If we change variables by taking logarithms, we find ln(y) = ln(c/xa); using logarithm laws this is equivalent to ln(y) = a ln(x) + ln(c); setting Y = ln(y), X = ln(X) and b = ln(c) this is equivalent to Y = a X + b, which is exactly the formulation we need! This example illustrates the power of logarithms; it allows us to transform our data and apply the Method of Least Squares.
- There are many examples of power laws in the world. Many of my favorite are related to Zipf's law. The frequencies of the most common words in English is a fascinating problem (click here for the data; see also this site); this works for other languages as well, for the size of the most populous cities, ...; if you consider more general power laws, you also get Benford's law of digit bias, which is used by the IRS to detect tax fraud (the link is to an article by a colleague of mine on using Benford's law to detect fraud). The power law relation is quite nice, and initially surprising to many. My Mathematica programming analyzing this is available here. See also this paper by Gabaix for Zipf's law and the growth of cities. As a nice exercise, you should analyze the growth of city populations (you can get data on both US and the world from Wikipedia).
- We discussed Kepler's Three Laws of Planetary Motion (the Wikipedia article is very nice). Kepler was proudest (at least for a longtime) of Mysterium Cosmographicum (I strongly urge you to read this; yes, the same Kepler whom we revere today for his understanding of the cosmos also advanced this as a scientific theory -- times were different!).
- Finally, a theme of the past two days is the importance of how we choose to measure things; how we model and how we judge the model's prediction will greatly affect the answer. In a similar spirit, I thought I would post a brief note about Oulipo, a type of mathematical poetry (this is a link to the Wikipedia page, which has links to examples). There was a nice article about this recently in Math Horizons (you can view the article here). This is a nice example of the intersection of math and the arts, and discusses how the structure of a poem affects the output, and what structures might lead to interesting works.
- Since we did Family Feud, here's one of the best people ever for the final.....

Tuesday, November 8. The big concept today is the Gamma function, generalizing the factorial function. We saw several uses, from the moments of the standard normal to definining chi-square random variables.
- We considered the Gamma function, which generalizes the standard factorial function. We gave a proof of its functional equation, Γ(s+1) = sΓ(s); this allows us to take the Gamma function (initially defined only when the real part of s is positive) and extend it to be well-defined for all s other than the non-positive integers. For more on the Gamma function and another proof of the value of Γ(1/2), see my (sadly handwritten) lecture notes. This approach uses the Beta distribution.
- One nice application of the Gamma function and normalization constants is a proof of Wallis' formula,which says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is mostly elementary (see my article in the American Mathematical Monthly). Not surprisingly, the proof uses one of my favorite techniques, the theory of normalization constants (caveat: it does have on advanced ingredient from measure theory, namely Lebesgue's Dominated Convergence Theorem).
- We talked about chi-square random variables (the square of a standard normal is a chi-square with 1 degree of freedom). This is very important in finding best fits to data, and leads to the chi-square test. We talked a bit about polar and spherical coordinates. We'll eventually see how these arise in finding the densities of chi-square random variables with 2 or 3 degrees of freedom, and then discuss how to avoid using them by the theory of normalization constants. You can click here to see how messy the coordinates become!
- There are nice formulas for the volume and area of n-dimensonal spheres. Interestingly, there are connections between how many spheres can be packed into a given space and codes in information theory!

Friday, November 4. All good things must come to an end, and today ends our proofs of the standard Central Limit Theorem. One can generalize it further by weakening the assumptions (we can allow the random variables to have different distributions, though independence is clearly important, as we do not expect X + X + ... + X to converge to a normal distribution in general). Unfortunately the moment generating function need not always exist, which is why it is advantageous to use the Fourier transform approach. In the literature the Fourier transform of a probability density is called the characteristic function of the density, and always exists. If M_X(t) = E[e^(tX)] is the moment generating function and φ_X(t) is the characteristic function, then φ_X(t) = M_X(-2πit), so the two are related.
- We started out by reviewing why the convolution of two densities is the density of the sum of the corresponding random variables. This property is the reason convolutions play such an important role in the theory. The Fourier transform of a convolution is the product of the Fourier transforms. This converts a very difficult integral into the product of two Fourier transforms, and frequently these integrals can be evaluated. The difficulty is that, at the end of the day, we must then invert, and to prove the Fourier Inversion Theorem is no trivial task. Proving our error estimates for the integrals that converge to the convolution involved either Taylor's theorem with remainder or the Mean Value Theorem.
- Additional nice and useful properties of the Fourier transform is that the derivative of the Fourier transform is the Fourier transform of the original function multiplied by -2πix; this is very useful in solvingdifferential equations.. In particular, if p is our density and FT[p](y) is the Fourier transform at y, then FT[p]'(0) = E[X] and FT[p]''(0) = E[X^2]. One formulation of quantum mechanics replaces position and momentum with differential operators; in this interpretation, the famous uncertainty principle is just a statement about a function and its Fourier transform! (See here for the physics explanation of the uncertainty principle.) Note the Taylor series expansion of FT[p] near the origin depends on the mean and the variance; if we normalize those appropriately, the `shape' of the distribution is not seen until we get to the third order term in the expansion. The absence of these shape parameters in the linear and quadratic terms of the Taylor expansion is what is responsible for the universality.
- The Central Limit Theorem has a rich history and numerous applications. What makes it so powerful and applicable is that the assumptions are fairly week, essentially finite mean, finite variance, and something about the higher moments. The natural question is what exactly do we mean by convergence? There are several different notions.
- For us, we are just showing that the moment generating function converges to the moment generating function of the standard normal, with the rate of convergence depending on the third moment (or fourth moment if the third moment vanishes; note the fourth moment is never zero). As many distributions have zero third moment, the fourth moment frequently controls the speed. This is why instead of looking at the kurtosis (fourth moment) we often look at the excess kurtosis, which is the kurtosis of our random variable minus the kurtosis of the standard normal. This is because it is this difference that frequently controls the speed of convergence.
- A classic result about how rapidly we have convergence to the standard normal is the Berry-Esseen Theorem.
- Taylor series played a key role in our proofs; the idea is that we can locally replace a complicated function by a simpler function, so long as we can control the error estimates.
- We discussed the probabilities of the standard normal taking on values in certain ranges (or outside these ranges). There are many different conventions used; click here for one such table.

Tuesday, November 1.We talked about some general properties of moment generating functions, and sketched the proof of the CLT in the special case of sums of Poisson random variables, seeing how important standardization is for convergence issues.
- It is worth thinking about why I made a mistake in class about the variance of the Poisson. The mean and the standard deviation are supposed to be in the same units, so if the mean is λ then shouldn't the standard deviation be λ, because if the variance were λ then the standard deviation would be λ^{1/2} and that would have the wrong units, right? Wrong. For an exponential with density f(x) = λ exp(-λx) the mean and standard deviation are both 1/λ, and we can see that this is the correct λ dependence by scale issues: we exponentiate λx, so λx must be unitless so if x is in meters say then λ is in 1/meters, and thus this is the correct λ dependence for the mean and standard deviaton. What goes wrong for the Poisson? Remember the density there is f(n) = λn eλ / n!; here λ is alone in the exponential and is thus unitless! This means we can't use the unit analysis to say that the standard deviation and the mean have the same λ dependence.
- We computed the moment generating function of the standard normal, seeing that it is exp(t^2/2). The key step in the proof is completing the square (there are lots of nice examples on the Wikipedia entry). It takes awhile to see how to simplify algebra / how to write algebra in a good way. When we have something like -x^2/2 + xt and we know we want the argument of the exponential to be negative, it is natural to write it as (1/2)(x^2 - 2tx), and this is screaming at us to add 0 via t^2 - t^2.
- A classic result about how rapidly we have convergence to the standard normal is the Berry-Esseen Theorem.
- Taylor series played a key role in our proofs; the idea is that we can locally replace a complicated function by a simpler function, so long as we can control the error estimates.
- We discussed the probabilities of the standard normal taking on values in certain ranges (or outside these ranges). There are many different conventions used; click here for one such table.
- Another key ingredient in our proof was the exponential function, in particular its series expansion.

Thursday, October 27.We talked a bit more about generating functions. We saw how they lead to a proof of an explicit formula for any Fibonacci number; this is known as Binet's formula. We've been looking at probability generating functions, and saw the generating function of X+Y is the product of the generating functions of X and Y if the two variables are independent. This suggests how they will be of use in proving the central limit theorem. Interestingly, many of the standard probability functions / random variables have very nice generating functions.

Tuesday, October 25. Well, it is possible to prove the central limit theorem in a special case, but it is painful! Today's lecture is meant to highlight the power of generating functions, and motivate studying them. Our proof was a long calculation involving Stirling's formula, and we saw that reasonable looking approximations had issues. The best way to estimate (N+k)^(N+k) was through logarithms and Taylor series expansions. Wetalked at the end of class that if we choose the numbers a_n and b_n to be the probability that X is n (for a) or Y is m (for b), then G_{X+Y}(s) = G_X(s) G_Y(s) (we'll prove other formulas on Thursday). The proofs have much in common with Calc I and Calc II. Namely, we spend a lot of time doing some algebra to show G_{X+Y}(s) = G_X(s) G_Y(s) once; the advantage is that once we have done it, we can simply use the result in later problems. For example, if asked to differentiate x cos(x) we don't write down the definition of the derivative, but rather we use the product rule. The reason is that it is advantageous to do the calculation once in general, get the result, and then in the future jump directly to that point for the function of interest. It is similar for moment generating functions; we spend the time now doing the calculations so we can just apply these results later.

Thursday, October 20. We saw a remarkable difference between sums of Poisson random variables and sums of normal random variables: sums of independent Poisson are Poisson! We talked a lot about the best way to `see' the algebra (looking at the expression and seeing how to add 0, as we did for E[X^2], or to multiply by 1, as we did for the sum of Poisson is Poisson).
- We looked at sums of independent Poisson random variables (after we learn about moment generating functions, it's a nice calculation to show the Poisson tends to the normal as lambda tends to infinity; click here for a handout with the details of this calculation). The proof technique there used many ingredients in typical analysis proofs. Specifically, we Taylor expand, use common functions, and somehow argue that the higher order terms do not matter in the limit with respect to the main term (though they crucially affect the rate of convergence).
  - The Central Limit Theorem has a rich history and numerous applications. What makes it so powerful and applicable is that the assumptions are fairly week, essentially finite mean, finite variance, and something about the higher moments. The natural question is what exactly do we mean by convergence? There are several different notions.
- We discussed a sketch of the proof of Stirling's formula, which is very useful in estimating binomial coefficients. We gave a poor mathematician's analysis of the size of n!; the best result is Stirling's formula which gives n! is about n^n e^{-n} sqrt(2 pi n) (1 + error of size 1/12n + ...). We obtained our upper and lower bounds by using the comparison method in calculus (basically the integral test); we could get a better result by using a better summation formula, say Simpson's method or Euler-Maclaurin. We might return to Simpson's method later in the course, as one proof of it involves techniques that lead to the creation of low(er) risk portfolios! Ah, so much that we can do once we learn expectation..... Of course, our analysis above is not for n! but rather log(n!) = log 1 + ... + log n; summifying a problem is a very important technique, and one of the reasons the logarithm shows up so frequently. If you are interested, let me know as this is related to research of mine on Benford's law of digit bias.

Tuesday, October 18. We saw applications of memoryless processes today, talked about standardization, saw how Chebyshev's theorem implies the law of averages, and talked a bit about what `large' means.
- We then discussed the geometric series formula. The standard proof is nice; however, for our course the `basketball' proof is very important, as it illustrates a key concept in probability. Specifically, if we have a memoryless game, then frequently after some number of moves it is as if the game began again. This is how we were able to quickly calculate the probability that the first shooter wins, as after both miss it is as if the game just started.
- The geometric series formula only makes sense when |r| < 1, in which case 1 + r + r^2 + ... = 1/(1-r); however, the right hand side makes sense for all r other than 1. We say the function 1/(1-r) is a(meromorphic) continuation of 1+r+r^2+.... This means that they are equal when both are defined; however, 1/(1-r) makes sense for additional values of r. Interpreting 1+2+4+8+.... as -1 or 1+2+3+4+5+... a -1/12 actually DOES make sense, and arises in modern physics and number theory (the latter is zeta(1), where zeta(s) is the Riemann zeta function)!
- We talked about standardizing a random variable, sending X to (X - E[X]) / StDev(X). This allows us to compare apples and apples. Note of course not all random variables can be standardized; the Cauchy distribution for instance cannot. We only compute tables of the standard normal; by standardizing we can deduce the probabilities of any normal random variable from a table of probabilities of the standard normal. This is similar to the change of basis formula for logarithms. Knowing logb(x) = logc(x) / logc(b), if we know logarithms base c we then know them base b, and thus it suffices to create just one table of logarithms.
- To prove the average (X1 + ... + XN) / N of iidrv with finite mean and variance converges to the random variable's mean is not too bad; one can do this by applying Chebyshev's Theorem. If, however, we want to know the rate of convergence, we need more than Chebyshev (ok, if we want to know a bit more on the rate, a better bound on the rate); this is the content of the Central Limit Theorem. We'll discuss rates of convergence in detail later, and we'll see that they are controlled by the third moment (or the fourth moment if the third vanishes). The third moment is called skewness, the fourth is called kurtosis. Actually, when the third moment vanishes it is excess kurtosis that's more useful; we'll see more on this if we look at the Taylor series expansion of the logarithm of the moment generating function.
- We saw some interesting facts about the Cauchy random variable. Another ingredient in the proof of the one-dimensional change of variable formula was that if g(h(y)) = y then h'(y) = 1 / g'(h(y)). This is a nice application of the chain rule to inverse functions (as g(h(y)) = y and h(g(x)) = x, we say g and h are inverses of each other). We used this relation to find the derivative of the arctangent function. When we first encounter such functions in Calc I or Calc II, they seem un-natural, primarily chosen to provide tests of how well you have mastered differentiation. These functions, however, do naturally arise in many applications. My favorite examples are in determining the cumulative distribution function (and hence the normalization constant) for a Cauchy random variable (which has density (π(1+x^2)1/). Distributions such as the Cauchy are terrific for testing how general results are in probability and statistics; I have a nice paper using a distribution which is a variant of the Cauchy to show the limitations of the famous Cramer-Rao inequality for determining optimal statistical tests. I've also seen analogues arise in nuclear physics. The second occurrence of arctangent today was in the change of variable formulas from polar to Cartesian.
- Economics: the standard random walk hypothesis seems to have lost most of its supporters, though there are variants (and I'm not familiar with all); see also the efficient market hypothesis and technical analysis, and all the links there. (There are also many good links on the wikipedia page on Eugene Fama). Two famous books (with different conclusions) are Malkiel's A random walk down wall street and Mandelbrot-Hudson's The (mis)behavior of markets (a fractal view of risk, ruin and reward). Some interesting papers if you want to read more:
  - Mandelbrot: Variation on certain speculative prices (a must read!)
  - Fama: Mandelbrot and Stable Paretian Hypothesis
  - Fama: Random Walks Stock Prices
- For more on randomness, check out The Black Swan by Taleb (amazon.com page here, wikipedia page here).

Friday, October 14. We discussed more on linearity of expectation.
- As you can tell, I love linearity of expectation (this is a link to notes I've written on the subject).
- It is not immediately clear what the right order of magnitude is as to how long you need to wait before you are essentially assured of having two of each prize (or more generally k of each prize). As a nice exercise, prove that as c tends to infinity, with probability tending to 1 you are assured of having at least two of each prize if you wait as long as 2 c Hc, where Hc = 1 + 1/2 + 1/3 + ... + 1/c is the cth harmonic number. Can you replace the constant 2 with something smaller? (We know it must be at least 1 --would 1 + e work for any e?
- We also talked about TeX: I've put the templates and other info online at http://web.williams.edu/Mathematics/sjmiller/public_html/math/handouts/latex.htm. You can also download a video of the lecture I gave: http://web.williams.edu/Mathematics/sjmiller/public_html/math/LaTexMathematica/LaTeXIntroLecture.MP4. Remeber if you download the template file you also need to download the image file yl.eps, or at least comment it out in the TeX code with a %.
- I've also put on the webpage links to TeX presentations, as well as a Mathematica tutorial.

Thursday, October 13. We talkd a bit about moments, and how sadly moments do not always uniquely determine a probabiity distribution. We then talked about variances, Chebyshev's theorem, and the prize problem.
- We discussed the similarities between how Taylor coefficients uniquely determine a nice function and how moments uniquely determine a nice probability distribution. It is sadly not the case that a sequence of moments uniquely determines a probability distribution; fortunately in many applications some additional conditions will hold for our function which will ensure uniqueness. For the non-uniqueness of Taylor series, the standard example to use is f(x) = exp(-1/x^2) if x is not zero and 0 otherwise. To compute the derivatives at 0 we use the definition of the derivative and L'Hopital's rule. We find all the derivatives are zero at zero; however, our function is only zero at zero. We will see analogues of this example when we study the proof of the Central Limit Theorem.
- It is important that the integrals and sums in the moments converge absolutely; if they didn't, then our answers would depend on how we tend to infinity. For example, consider theCauchy distribution 1 / (pi(1+x^2)). Let g be any function such that g(A) is larger than A. Assume A is large so the integrand is basically 1/pi x. If we integrate from -A to g(A) we get essentially Integral_{t=A to g(A)} dx / pi x = (1/pi) log( g(A) / A). If g(A) = 2A then we would get essentially log(2) / pi, but if g(A) = A^2 then we find there is no way to have some finite interpretation.
- We proved Chebyshev's theorem, one of the gems of probability. The natural scale to measure fluctuations about the mean is the standard deviation (the square-root of the variance). Chebyshev's theorem gives us bounds on how likely it is to be more than k standard deviations from the mean. The good thing about this result is that it works for any random variable with finite mean and variance; the bad news is that because it works for all such distributions, its results are understandably much weaker than results tailored to a specific distribution (we will see later that its predictions for binomial(n,p) are magnitudes worse than what is true). It is somewhat similar in spirit between the differences in Divide and Conquer and Newton's Method to find roots of functions; Divide and Conquer is relatively slow (taking about 10 iterations to gain another 3 decimal digits accuracy), while Newton's Method doubles the number of decimal digits each iteration! Why is there such a pronounced difference? The reason is that Divide and Conquer only assumes continuity, while Newton's Method also requires differentiability. Thus it is not surprising that we can do better with stronger assumptions.

Tuesday, October 4. After reviewing CDFs and finding the density function of transforms of a random variable, we proved expectation is linear. We'll see time and time again how important this can be. For more information, here are some notes of mine on the subject: linearity of expectation.
- We did a few more examples of the power of binary indicator random variables and linearity. We used it to derive the formulas for the mean (our discussion of k^2 will later be seen to give the variance) of a binomial(n,p) random variable by writing it as a sum of independent Bernoulli(p) random variables. We can of course derive these values by differentiating identities (the link is to a handout of mine with more examples). It is worth remarking that many of the identities in combinatorics are proved by showing that two different ways of counting the same thing are equivalent, and then if we evaluate one we get the other for free.
- Markov's inequality is great in that we can almost always use it (just need a mean to exist and the random variable to only take on non-negative values), but it is so useable that it's bounds are horrible. It's a nice exercise to see if given an a > 0 you can find a distribution that satisfies the conditions and has a sharp bound (in other words, the inequality is an equality for that a and that distribution). For example, if we look at 10,000 tosses of a fair coin then E[X] (the expected number of heads) is 5000, so if a = 7500 then Markov's inequality means we have a probability of at most 5000/7500 = 2/3rds of having at least that many heads; the actual probability is about 10^(-570). When there is this much of a difference b/w the truth and our bound, we need to return to the blackboard. We'll do this in an upcoming lecture on the Central Limit Theorem.

Friday, September 30. In addition to doing a few problems, we discussed the dangers of interchanging orders of operation (it's not always the case that you can switch orders of integration), and some advice on how to understand what a problem is asking.
- Limit exchange: one of the hardest parts of mathematics is justifying interchanging two operations; today we looked at when the probability of a limit is the limit of the probabilities. To give some sense that we must sometimes be careful, we considered non-negative functions f_n(x) converging to zero pointwise but always integrating to 1 (let f_n(x) be the triangle function from 1/n to 3/n, taking on the value n at 2/n). It is not always permissible to interchange a limit and an integral (see the Dominated Convergence Theorem or the Monotone Convergence Theorem from analysis for some situations where this may be done); similarly it is not always possible to interchange orders of integration (see Fubini's Theorem for when this may be done), and we can only sometimes interchange a derivative and a multidimensional integral (see here for some conditions on when we may). The main take-away is that we must be careful interchanging probabilities and limits, but this shouldn't be surprising. For example, we do not expect to be able to interchange most operations: sqrt(a+b) in general is not sqrt(a) + sqrt(b).
  - Click here for a video by Cameron on how he applies Fubini's theorem to change the order of operations (he does a double sum instead of a double integral, but the principle is the same).
- General comment: it's important to be able to take complex information and sift to the relevant bits. A great example is the song I'm my own Grandpa. Listen to it and try to graph all the relations and see that he really is his own grandfather (with no incest!).
  - A solution is here (don't view this until you try to graph it!). Actually, this is a MUCH better illustration of the relationships.

Thursday, September 29. Today we saw how to find the probability density of a sum of random variables. The formula is a lot simpler if the variables are independent. We looked at the sum of two uniform random variables, and the discrete analogue from rolling two die to build intuition. We ended with an introduction to the mean.
- As a first case, we considered X1 + X2 with each Xi ~ Uniform(0,1). To get a feeling for the answer, we looked at rolling two fair die and the distribution of the resulting sums. We found Prob(R1 + R2 = k) = (6 - |k-7|)/36 for 2 <= k <= 12 and 0 otherwise. This is a triangle, it's symmetric about the mean, the density is largest at the mean, .... It is unlikely that these features depend on the die having 6 sides, and thus it is reasonable to expect X1 + X2 to be a triangle supported in [0,2] with maximum density at the mean of 1.
- We proved this by using convolutions and then brute force integration. Convolutions are incredibly powerful and useful in probability, and provide a very useful way to explore many problems. The convolution is defined by (f1 * f2)(x) = Integral_{t = -oo to oo} f2(t) f2(x-t)dt. If fi is the density of Xi, this is the density of X1+X2. We proved this by using the cumulative distribution function of Y = X1+X2 (which was a double integral) and then differentiating. The key step was interchanging the derivative and the integral. In general we cannot interchange orders of operations (sqrt(a+b) is typically not sqrt(a) + sqrt(b)), but sometimes we're fortunate (click here for a nice article on Wikipedia on when this is permissible).
- There is enormous structure behind convolutions of probability distributions. Let f be the density function for the random variable X, and g the density function for the random variable Y. As X+Y = Y+X, we find f * g = g * f (ie, the operation is commutative), and f * (g * h) = (f *g) * h (the operation is associative). Convolution is also closed (if f and g are densities, so is f * g). Note this is beginning to look like a group; namely, we have a collection of objects (in this case, probability densities or maps from the reals to the reals) and a way to combine them (convolution) that is closed, associative, and even commutative. If we just had an identity element and inverse, we would have a group (a commutative group, in fact). Groups occur throughout the sciences and the world, two of my favorite are the Rubik's cube and theMonster group. As there is a lot of structure in groups, it's natural to ask whether or not we can find an identity element and inverses.
  - The identity element is not hard to find. We define the Dirac delta functional δ(x) as follows: for any probability density f(x), Integral_{x = -oo to oo} f(x)δ(x) dx = f(0). One may view δ(x) as the density corresponding to a unit point mass located at 0; similarly we would have Integral_{x = -oo to oo} f(x) δ(x-a) dx = f(a), corresponding to a unit point mass at a. We have actually seen Dirac delta functionals before. For example, let X be Bernoulli(p). This means Prob(X=1) = p, Prob(X=0) = 1-p and any other x has Prob(X=x) = 0. If we let f(x) denote the probability mass function, we have f(x) = p δ(x-1) + (1-p) δ(x). It turns out that the Dirac delta functional (which does integrate to 1, which can be seen by taking f(x) = 1 in Integral_{x = -oo to oo} f(x) δ(x-a) dx) acts as the identity. We now show f * δ = f. We have (f * δ)(x) = Integral_{t = -oo to oo} f(t) δ(x-t) dt = f(x).
  - Thus the only obstacle in whether or not we have a group (with group operation given by convolution) is whether or not there is an inverse. Is there? Perhaps there is an inverse if we restrict the types of probability distributions we study (for example, maybe we only look at densities defined on a compact interval).
- The convolution of two densities is the density of the sum of the corresponding random variables. This property is the reason convolutions play such an important role in the theory. Later, when we study the Central Limit Theorem, we'll talk about how one can get tractable integrals from these convolutions. The trick is to use the Fourier Transform, as the Fourier transform of a convolution is the product of the Fourier transforms. This converts a very difficult integral into the product of two Fourier transforms, and frequently these integrals can be evaluated. The difficulty is that, at the end of the day, we must then invert, and to prove the Fourier Inversion Theorem is no trivial task.
- We ended the day by introducing the concept of expectation or expected value of a random variable (also called the mean or the average value). This is one of the central concepts in the course, and it is amazing how many problems reduce to understanding expectations of random variables. We will see in Tuesday's class how properties of expectation aid us greatly in applications. For example, consider a Binomial(n,p) random variable X (so X is the number of heads in n tosses of a coin which is heads with probability p). The sum we MUST evaluate for the average is Sum_{k = 0 to n} k (n choose k) p^k (1-p)^{n-k}. While it should be clear that this must be just np (each coin has a p% chance of landing on heads, and we have p of them), this must be proved. We'll discuss two different techniques to do this on Tuesday (differentiating identities and linearity).

Tuesday, September 27.We continued our analysis of random variables. In particular, we discussed the CDF method to find the probability density function of Y = g(X) in terms of the pdf of X. A key ingredient is the fundamental theorem of calculus, which allows us to bypass doing some difficult (if not impossible) integration in general. We also discussed joint distributions, and how to take a non-negative function and rescale it to make it a density. This last concept is extremely valuable in probability, especially in an area I study, Random Matrix Theory, which is a nice intersection of math, nuclear physics and probability. Click here for a survey article I wrote with my physics mentor.
- The cumulative distribution function is one of the key tools of the subject, and gives a sense of why continuous random variables are easier to analyze than discrete; namely, for continuous we have the Fundamental Theorem of Calculus at our disposal to pass from a cumulative distribution function to a density; we do not have differentiation available in the discrete case. Note that a cumulative distribution function does not determine a unique density; however, it almost does so, as any two densities must integrate to the same value on any interval. (The technical jargon is to say that the density is determined up to a function which is zero almost everywhere.) If there is interest, let me know and I'll talk a bit about the basics of measure theory (and show that almost no numbers are rational in the sense of measure).
- As most integrals cannot be evaluated in closed form, it's worth mentioning some of the powerful numerical techniques. One of the most important is very probabilistic, and is called Monte Carlo integration, which has been hailed by some as one of the (if not the) most influential papers in the 20th century. It gives really good results on numerically evaluating integrals. Specifically, if N is large and we choose N points uniformly, we can simultaneously assert that with extremely high probability (such as at most 1 - N^{-1/2}) the error is extremely small (at most N^{-1/4}). If you want to know more, please see me -- there are a variety of applications from statistics to mathematics to economics to .... Below are links to two papers on the subject to give you a little more info:
  - Metropolis: The Beginning Of The Monte Carlo Method
  - Metropolis and Ulam: The Monte Carlo Method
Friday, September 23. In the review class we discussed Russell's paradox and special relativity. Some more items.
- Russell's paradox shows that we didn't even understand what it meant to be a set or an element of a set! Another famous paradox is the Banach - Tarski paradox, which tells us that we don't understand volumes! It basically says if you assume the Axion of Choice, you can cut solid sphere into 5 pieces, and reassemble the five pieces to get two completely solid spheres of the same size as the original! While it is rare to find these paradoxes in mathematics, understanding them is essential. It is in these counter-examples that we find out what is really going on. It is these examples that truly illuminate how the world is (or at least what our axioms, imply). Most people use the Zermelo-Fraenkel axioms, abbreviated ZF. If you additionally assume the Axiom of Choice, it's called ZFC or ZF+C. Not all problems in mathematics can be answered yea or nay within this structure. For example, we can quantify sizes of infinity; the natural numbers are much smaller than the reals; is there any set of size strictly between? This is called the Continuum Hypothesis, and my mathematical grandfather (one of my thesis advisor's advisor), Paul Cohen, proved it is independent (ie, you may either add it to your axiom system or not; if your axioms were consistent before, they are still consistent). Cohen was a student of Zygmund, the namesake of Room 416. In a real analysis course, one develops the notation and machinery to put calculus on a rigorous footing. In fact, several prominent people criticized the foundations of calculus, such as Lord Berkeley; his famous attack, The Analyst, is available here. It wasn't until decades later that a good notion of limit, integral and derivative were developed. Most people are content to stop here; however, see also Abraham Robinson's work in Non-standard Analysis. He is one of several mathematicians we'll encounter this semester who have been affiliated with my Alma Mater, Yale. Another is the great Josiah Willard Gibbs.
- We also discussed faster than light travel and special relativity. I may have just been proved wrong. A student just emailed me about a potentially groundbreaking, 3 year experiment at CERN where some neutrinos appear to have travelled faster than light. Click here or here for some stories.
Thursday, September 22. Today we started with a fun example where whomever chose an integer from 0 to 100 that was closest to half the class average wins a prize. This is a nice way to see dependent random variables in action. There's a lot of interplay here b/w math and psychology, trying to figure out how other people will play based on how you'll play. This leads to Game Theory, and even more connections with math! This is also a good way to think about reasonableness of answers.
- We then studied Bernoulli and Binomial random variables. A big result is that a binomial random variable is the sum of Bernoulli random variables. We thought a bit about what is a reasonable answer for the mean or expected number of successes in a Binomial process. The binomial distribution is a special case of the more general multinomial distribution; many of the properties of the multinomial can be obtained by repeated applications of the binomial distribution. For example, say we have the unimaginatively named candidates A, B, C and D running for office. We may initially break them into two groups: A and not A; we then further divide not A into B and not B, then not B is divided into C and not C. The binomial coefficients are replaced with multinomial coefficients: here (n | k1, k2, ..., kj) means n! / k1! k2! *...* kj!, with each ki a non-negative integer such that k1+...+kj = n. As we're not going to say much about the multinomial definition in class, I thought it would be a good choice for the additional comments.
  - One application (but by no means the most important!) of multinomials is figuring out how many different words you can make when you rearrange the letters of MISSISSIPPI. If you feel this isn't important, consider instead base pairs from biology -- this tells us how many different strands we can have!
  - We proved that the multinomial probabiities do give us a density -- they are clearly non-negative, but do they sum to 1? The proof is quite nice, and it uses one of my favorite techniques, multiplying by 1, MANY times. It is important to get a sense of how these results are proved. The trick is to look for binomial or multinomial coefficients -- this is why we multiplied by (n-t)!/(n-t)!. We then had Sum_{e = 0 to n-t| (n-t choose e); we rewrote this by multiplying by 1^e 1^{n-t-e} and then recognized this as (x+y)^m where x=y=1 and m=n-t. Thus we could evaluate the e sum by using the binomial theorem, and then another application of the binomial theorem completed the job. Remember how important it was to have the sums correct -- t was independent of e and the t! could be brought out of the sum; however, h was not as h = n-t-e. There are many symbolic programs available to prove binomial identies; if you would like a copy of a Mathematica program that does this, just let me know (click here for some of the theory).
- We will go over distribution functions and finding the distribution function and density of random variables that are functions of other random variables in greater detail on Tuesday. The idea is that if G is a nice function and we know the (cumulative) distribution function of X, then we should know the (cumulative) distribution function of Y = G(X); similarly, if we know the probability density of X then we should know the probability density of Y = G(X). We will do all this again slowly for our exponential example and in general. The key input in the analysis is the Fundamental Theorem of Calculus; for us, the version we need is: Let F(x) = Int_{t = -oo to x} f(t) dt; then F'(x) = f(x). While we have talked about how the anti-derivative is not unique, there is a `natural' choice of a continuous density f.

Tuesday, September 20. We continued our discussion of the basic relations in probability. The big theorem is Bayes' Theorem, which shows us how certain conditional probabilities can be used to compute harder conditional probabilities. Bayesian inference is a huge field in statistics. Here's a nice medical example (similar to the one we did in class). See here for more comments on Bayes in medicine. We also talked about independence.
- The study of Independence is one of the central themes in probability. While many real world or mathematical processes are not independent, frequently one can build a good model by assuming independence. Later in the semester we'll see how we can use this to model iterates of the 3x+1 map or to predict the answers to many problems in number theory (such as the number of distinct prime factors certain special numbers have). Other examples include the probability a number is square-free. For independence it is essential that all combinations be independent; as we saw in class, pairwise independence does not imply independence.
- We talked a bit about trying to find the definition of independence. The correct definition states that events {A_i}_{i in I} are independent if Prob( intersection_{j in J} A_j) = prod_{j in J} Prob(A_j) for any J a subset of I. For example, if I = {1,2,3} then J could be {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, or {1,2,3}. We can rephrase the question to: assume we have events such that Prob(A intersect B intersect C) = Prob(A) Prob(B) Prob(C), and all these events have positive probability. Must A and B be independent? Extra credit for a proof or disproof.
Thursday, September 15. We discussed probabillities of various events, induction, inclusion/exclusion, basic combinatorics, ... A lot for one class!
- We discussed the inclusion / exclusion principle, one of my favorite methods in general and especially important in probability as it is very easy to accidentally double count events. One of the more interesting uses of this principle is in Brun's sieve, where he uses inclusion-exclusion to show that there cannot be too many twin primes. Perhaps the strangest application of this is that this is how the famous Pentium Bug was discovered! This was essentially a half a billion dollar whoopsie (and this was back in 1995). Here's a nice link to a story about it (let me know if you can't view it from JStor).
- One way to derive inclusion / exclusion is to use Mathematical Induction (one common image for induction is that of following dominoes); see also the appendix from my probability book.
- Combinatorics: we discussed (n choose r), Most of the combinatorics we'll do involves this and n!.One nice application from today is proving the Binomial Theorem (I must admit to remembering its mention in a Holmes story)
- We say an ordering of n objects is a derangement if nothing returns to where it started. The probability an ordering is a derangement is about 1/e if n is large.
- We mentioned the QWERTY keyboard (see also this article on other common items around us and how they came to be). There are many applications to knowing letter frequencies, especially the probability that given one letter that the next letter takes on each value. These frequencies are used to break simple cryptographic cyphers that involve permutting the 26 letters. See for instance the wikipedia article on frequency analysis, as well as a downloadable program to perform the analysis.

Tuesday, September 13. Today we discussed some of the basic concepts of probability. We started by reviewing some of the definitions (σ-field (many books use the word algebra instead of field), probability measure and probability space). The point is that not every subset is an admissible event (in other words, not all subsets are assigned a probability). For the most part this is no problem, as points, intervals, squares et cetera provide a rich theory. The general case requires advanced analysis, in particular measure theory / Lebesgue integration. These technicalities are important in avoiding the Banach-Tarski paradox, which is due to the Axiom of Choice (which allows us to construct non-measurable sets); it is for this reason that I only believe in the Countable Axiom of Choice. For the specific points of today's class and related topics, here are some additional comments / readings. We talked a bit about what it means to choose an element uniformly from random on a circular or square dart board. We cannot deal with uncountable unions (see the wikipedia entries on countable anduncountable sets). If you want to learn even more about countable and uncountable, see the draft appendix from the probability book. For the purposes of our class, we really only need to worry about finite and countable. We have good intuition on what a finite set is; the quick definition of countable is that it can be placed in a one-to-one correspondence with the positive integers. In other words, we have a first element, a second element, and so on. It turns out that almost every real number is irrational; further, almost no numbers are algebraic (solving a finite polynomial with integer coefficients). The standard proof is Cantor's diagonalization argument.

Thursday, September 8. Here are some additional links to topics discussed today.

I strongly urge you to read the graduation speech here -- wonderful advice!
We discussed the Birthday Problem (Wikipedia gives the Taylor expansion argument from taking logarithms) and its generalization to Pluto. This is but one of many possible generalizations. What if we ask for how many people we need to have at least a 50% chance that at least three will share a birthday? Or that there will be at least two pairs of people sharing birthdays? Questions like these are great extra credit / challenge problems: if you're interested, just let me know.
The double-plus-one strategy is but one of many overlaps between probability and gambling. Other famous ones (recently) include card counting in blackjack. There are many references; see Thorpe's original article as well as his book. Another fun read is Bringing Down The House.
Benford's law of digit bias is one of my favorite research topics (if anyone is interested, I might also have accessible projects here). If time and interest permit, I'll show you how you can prove this digit bias in a variety of interesting systems. I was interviewed by the Wall Street Journal about applying Benford's law to detect fraud in the Iranian elections (click here for articles on the Iranian elections).
Click here if you want to know more about the log5 method, namely which of the (p +/- pq) / (p + q +/- 2pq) models the probability that team A beats team B. The `derivation' is a nice exercise in elementary probability theory, if you buy the modeling assumptions. As you'll see throughout the course and beyond, one of the most difficult issues in the real world is deciding what are the important and irrelevant factors.
If you want to see details about the paper for the movie industry, click here, while for my sabermetrics paper (which we may discuss in the class), click here.
Again, click here for additional comments about the objectives for the course, including some entertaining and educational videos about the times we live in and the importance of asking the right questions.