Additional Comments

Additional comments related to material from the class. If anyone wants to convert this to a blog, let me know. These additional remarks are for your enjoyment, and will not be on homeworks or exams. These are just meant to suggest additional topics worth considering, and I am happy to discuss any of these further.

Thursday, December 10.
- We discussed card counting and blackjack; click here for the basic strategy for blackjack (see also the wikipedia article). It's important to keep the system simple and easy to use and implement. Several papers on the subject are linked to below:
  - Bladwin et al: Optimal Strategy in Blackjack: paper here)
  - Thorpe's original article on blackjack (his book is available here: Thorpe's book)
  - Thorpe's article on the Kelly criterion in blackjack and other gambling situations.
  - classmates' summaries of these articles are available here
- We discussed the birthday problem (which has numerous generalizations and applications, including the birthday attack in cryptography).
Tuesday, December 8. We finished using geometric random variables to model baseball games. This is the first example of a rich field, sabermetrics (the art/science of applying math to baseball). It is quite difficult to obtain closed form expressions; it is quite rare, and one should celebrate one's good fortunate whenever this occurs. If we don't have closed form expressions, we are forced to run the simulations again.
- We used a geometric random variable to model runs scored and and runs allowed. There are some problems with this model, but amazingly it does lead to a clean answer at the end of the day. The model is defined in terms of a decay parameter; the probability of scoring n+1 runs is always a factor of p less than the probability of scoring n runs. We end up with a complicated answer in terms of p and q (the decay probabilities of each team). Amazingly, after some algebra we get a very clean formula in terms of the means of the two geometric random variables, RS for X (runs scored) and RA for Y (ie, how many runs team X allows, which is how many runs team Y scores). There are many ways to try and simplify the algebra; the key observation is to note that RS = p/(1-p) and RA = q/(1-q). Noting this, we multiply our answer by 1/(1-p)(1-q) as this allows us to obtain expressions involving RS and RA. This is one of the most important things to learn, namely how to multiply by 1 or add zero to clean up the algebra. While Mathematica can simplify the `ugly' expression obtained by replacing all p's with RS/(1+RS) and q's with RA/(1+RA), it is much faster to multiply by 1 as stated above, and I feel more illuminating.
- It is also worth noting that the final formula does have many properties we would like: it's between 0 and 1, as RS increases so does the winning percentage, there is no chance of winning if RS = 0 or RA goes to infinity, if RS=RA the winning percentage is 50%, .... You should ALWAYS try to do these simple heuristics to get a sense of a formula's reasonableness.
- A better estimator of a team's winning percentage is RS^γ/(RS^γ + RA^γ). Originally γ was taken to be 2 (and sadly ESPN still uses that!); however, recent research shows taking γ to be about 1.8 does a better job (this is what MLB.com) uses. This is called the Pythagorean Won-Loss Formula, and is very useful in terms of predicting future performance. (For example, if a team is playing below their predicted ability, they might not write off the season and trade prospects for a veteran to help with a playoff push; if they are not underperforming, they may write off the season and save their rookies for next year).
- For more on the subject, see the links below:
  - Check out my paper on the Pythagorean Won-Loss Theorem
  - Another approach is the binary good day / bad day model, called the log5 method; click here for a paper by me on the subject.
  - Many more links on my independent study's homepage.
- Economics: the standard random walk hypothesis seems to have lost most of its supporters, though there are variants (and I'm not familiar with all); see also the efficient market hypothesis and technical analysis, and all the links there. (There are also many good links on the wikipedia page on Eugene Fama). Two famous books (with different conclusions) are Malkiel's A random walk down wall street and Mandelbrot-Hudson's The (mis)behavior of markets (a fractal view of risk, ruin and reward). Some interesting papers if you want to read more:
  - Mandelbrot: Variation on certain speculative prices (a must read!)
  - Fama: Mandelbrot and Stable Paretian Hypothesis
  - Fama: Random Walks Stock Prices
- For more on randomness, check out The Black Swan by Taleb (amazon.com page here, wikipedia page here). Several members of the class have recommended this book highly, and from reading excerpts on the web I understand why.
- For more on fractal geometry, click here. We did the Koch snowflake; another popular one is the Cantor set. See here for fractal dimensions. To actually compute pictures of items like the Mandelbrot set, one needs to iterate polynomials. This can lead to the fascinating subject of efficient algorithms; when I wrote such programs years ago on what would now be considered `slow' computer, I had to use Horner's algorithm to get things to run in a reasonable time.
Thursday, December 3. One reason I enjoy additive number theory so much is that many of its problems are simply stated (though frequently the techniques to analyze them are quite involved). Today's problem on comparing the size of sumsets to difference sets is typical of many of the problems in the field. There are certain regions where the analysis can be handled with standard techniques encountered in courses early in undergraduate study. It's worth reviewing the techniques used to study this problem, as it's a great summary of what we've done and how they can be used.
- We consider a binomial model, where each integer in {1,...,N} is in A with probability p(N). We assumed in class that p(N) = N^-δ, δ in (1/2, 1); we'll see later why this assumption was needed. To see the phase transition we need to study all choices of δ in (0,1) and not just in (1/2, 1); unfortunately those other regions require recent advanced bounds towards strong concentration (this is a link to a great recent paper on the subject by Van Vu), and thus cannot be covered in a first course on probability. (For more on this subject, see the Wikipedia entry on Chernoff bounds.)
- The first step of the proof was to estimate the size of A. We used binary indicator random variables to study the size of a randomly chosen A. We have X_i = 1 if i is in A (which happens with probability N^-^δ) and X_i is 0 otherwise. By linearity of expectation (this is a link to notes I've written on the subject), if X = X_1 + ... + X_N then E[X] = N E[X_i] for any i, or E[X] = N^1-δ_.The variance is Sqrt(N^1-δ).
- Thus a typical A has about N^1-δelements. We need to quantify `how close'. We could use the Central Limit Theorem, as we have a binomial with a large N; however, Chebyshev's inequality more than suffices. We have Prob(|X - N^1-δ| > .5 N^1-δ) < 1 / N^1-δ. To see this, note that the standard deviation is sqrt(N^1-δ), and thus we are a HUGE number of standard deviations away, namely sqrt(N^1-δ). For example, if N = 10^100 and δ = 4/5, then we are 10,000,000,000 standard deviations away, and thus the probability is quite negligible.
- The next step was to compute how many candidates we have for new sums and new differences; this is counting the number of pairs (m,n) with m < n. We excude the diagonal case of pairs (m,m) as there are few of these.
- The final step is showing that very few of the pairs give the same sum or difference. This required some way to count how many m, n', m, n' there are such that all are in A and n-m = n'-m' say. We proceeded using binary indicator random variables again, and this time we had to use covariance as the variables were dependent. A nice exercise is to prove the claim Var(U+V) <= 2 Var(U) + 2Var(V).
- For more details on these questions, see the following papers: When almost all sets are difference domianted Constructing MSTD sets
- We ended today by seeing an application of geometric random variables to model run production in baseball. I personally don't believe that this is the right model, but it is mathematically tractable and leads to a nice prediction (which we'll see on Tuesday). I think using Weibulls is better, as I do in the following paper.
- Finally, as an aside I mentioned fast primality testing. A deterministic, fast primality test was developed a few years ago by a computer scientist and his two undergraduates; this is one of the only examples I know of low fruit being missed for so long. See the references at the end of the link above for more information. The original paper is available here; I believe this link is to the version published in the Annals. If anyone wants to know some interesting stories about the paper, its publication and its impact, let me know.

Tuesday, December 1. The theme of today and Thursday's lecture is going to be approximation theory. The goal is to replace complex expressions with simpler ones which are readily evaluated; in order to have a result and not just a heuristic, though, we must be able to control the error terms. You've seen examples along these lines before; we use Taylor series to replace complicated functions with simple polynomials (usually constant, linear or quadratic) (a special version of Taylor's theorem is the Mean Value Theorem). This only works, of course, if we can control the error. In our analysis today, there were several places where said the terms were so small, even when summed, that they could be ignored. In proving the Modulo 1 Central Limit Theorem (this is a link to the paper), one of the key steps was played by Poisson Summation, which allowed us to replace a slowly converging sum with a rapidly converging one. We went from summands of size exp(-πx²/N) to summands of size exp(-πNx²). Note that the latter sum is quite small once n does not equal zero, and leads to an error that can be dominated by the geometric series.
- The details of the error estimates can be found in two places. See Chapter 9 of my book (An Invitation to Modern Number Theory) (it's page 36 of the handout, which is page 232) for the calculation of the probability of |X| > σ^1+δ when X ~ N(0,σ). Chebyshev's inequality or theorem says that since this is σ^δ standard deviations away from the mean, the probability is at most 1/σ^δ. The actual probability is significantly less. It isn't surprising that the probability is so much smaller than what Chebyshev gives; the normal has extremely rapid decay, and Chebyshev is supposed to hold for any distribution. In our proof, we had two change of variables. The first was to to let u = x/σ. This converted the problem to finding the area under the standard normal that is at least σ^δ. The second was to let w = x - σ^δ or x = w + σ^δ. This allowed us to exploit the fact that we are integrating over large x. We could have used the Cauchy-Schwartz inequality to do a little better, but we already have a good estimate which suffices for many applications.
- Notice that in the argument we used Taylor's theorem to replace the complicated exp(-π(x+n)²/N) with the simpler exp(-πn²/N), we then showed the error term had a miniscule contribution, and then used Poisson Summation to finish the argument.
- The additive number theory topic is a fascinating, accessible subject. I particularly enjoy the fact that there are two different heuristics one can use to try and decide if there should be more sum-dominated or difference-dominated sets. One argument is that x+x and y+y give different sums but x-x and y-y both give 0; this supports the fact that sets should be sum-dominated; on the other hand, additive is commutative and subtraction is not, and thus x+y = y+x but x-y and y-x are distinct. The question becomes: for a randomly chosen A, are we more likely to have diagonal terms like x+x (there are n choose 1 or n of these if A has n elements) or non-diagonal terms such as x-y (there are n choose 2 or about n²/2 of these); clearly it is the latter that should win. I will discuss an open problem related to difference equations and constructing explicit examples of sum-dominated sets on Thursday.
- My paper with Hegarty explores other models for these questions, where the probability of choosing a k in {1,...,N} is independent of k but depends on N. Depending on how fast the probability decays with N, we see different behavior, and there is a critical threshold (or perhaps a phase transition is a better phrasing) where fascinating behavior happens (the wikipedia article has several examples of these).
- Phase transitions are frequently hard to study, but they are where the action is and are extremely important. Examples range from population dynamics to the solid-liquid-gas charts we grew up on to the birth of the large component in graph theory (the paper linked here is one of the most important in the field; see also this paper by Erdos and Renyi). If you are interested in seeing wonderful applications of probabilistic methods, read or skim these papers! If you want to write a paper with me on this, you can have an Erdos number of 4 (which should be lowerable to 3 when I get a moment to finish a project with a senior colleague).
- Erdos numbers are lots of fun to compute (MathSciNet (choose collaboration distance under free tools) will do this), and lead to fascinating questions about how to search complex spaces for answers. It's similar to the Kevin Bacon number (both are based on the small world phenomenon / six degrees of separation) (there's also the Erdos-Baker number, where very few people have this number finite). An interesting paper is here; you can play the Kevin Bacon game at the Oracle of Bacon.

Tuesday, November 24. Today was a payoff day. After developing a lot of the general theory of probability, we were able to use it to solve and analyze problems of practical import, specifically, Benford's law of digit bias.
- Several good papers: Hill's The first digit phenomenon; Nigrini's I've got your number.
- We saw that small data sets can be misleading. For example, there were fewer 9s than predicted for the first 60 terms in the sequence {2^n}, but we saw that this was due to the fact that 2^10 is approximately 10^3, and thus the set {leading digit of 2^n base 10} is almost, but not quite, periodic with period 10. We saw periodic behavior in powers of π, due to the fact that π¹⁷⁵ is almost a power of 10. The convergence to Benford's law is controlled by how well approximated an irrational number is by rationals; this is a fascinating topic, and worthy of further study and thought. We measure how well approximated irrationals are by rationals by seeing how large of a denominator we need to get a given order of accuracy. This leads to irrationality exponents or measure; in fact, this idea is used to prove that Liouville numbers are transcendental numbers. If you would like to know more about these, let me know and I'll provide Chapter 5 of my book.
- The key ingredient in proving many systems are Benford is to show that if x_n is the original data set, then y_n = log_10 x_n is equidistributed modulo 1. How do we prove this? If x_n = a^n for some fixed a, then y_n = n log_10 a. A theorem of Kronecker (generalized by Weyl) states that n alpha mod 1 is equidistributed if and only if alpha is irrational (in addition to the analysis and number theory proofs, there is also an ergodic proof). For some problems, it isn't enough to know that it becomes equidistributed, but we also need to know how rapidly it becomes equidistributed; in many instances this is answered by the theory of linear forms of logarithms. This is frequently related to how well certain irrationals are approximated by rationals. In my paper with Alex Kontorovich on the 3x+1 problem, the key step in proving Benford behavior was showing that log_10 2 had finite irrationality exponent (we bounded it by about 10⁶⁰², a very large but also a very finite number!).
  - Click here for my paper with Alex Kontorovich on 3x+1 and Benford (as well as zeta(s)).
- To determine if the observed data is well described by our prediction, it is common to use a chi-square test (click here for a nice online chi-square calculator). There is a lot of beautiful theory on such tests; my favorite involves structural zeros (what happens when certain events cannot be observed, such as a tie in a non-Selig sanctioned baseball game). If you are interested, let me know and I can send you some papers which discuss the theory; it is briefly mentioned in my baseball paper.
- The proof of denseness of n alpha mod 1 for alpha irrational is significantly easier than equidistribution, involving Dirichlet's Pigeonhole Principle (the proof is sketched in the accompanying slides for today).
- We showed linear recurrence relations are Benford (or we mostly showed this) so long as the largest root of the characteristic polynomial exceeds 1. A nice exercise is to do this calculation rigorously; this is done in Chapter 9 of my book.
- For more on the hydrology data and Benford's law, see my paper with Mark Nigrini (and see the references there for Mark Nigrini's papers on tax fraud). Our newest paper with a new Benford test just appeared (the mathematics is proved in a separate paper, available here).
- Finally, we ended with a discussion of what the Central Limit Theorem modulo 1 looks like. I prove this in detail in this paper. We will discuss the proof of Poisson Summation on Tuesday, but will not prove it. (If you want to see a proof, let me know and I'll give you the relevant sections from my book on Fourier analysis). The proof we'll give of the CLT modulo 1 is not the most general result possible, as we will assume the Y_i's have finite variances -- this is not needed, as is shown in our paper! The proof is a bit harder (not surprisingly), but our friend the Cauchy distribution is not forbidden!
  - There are other generalizations of the central limit theorem. One particularly nice version involves Haar measure. Consider the set of N x N unitary matrices U(N), or its subgroups the orthogonal matrices and the symplectic matrices. It turns out there is a way to define a probability measure on these spaces (this is the Haar measure), and there are generalizations of the central limit theorem in these contexts: The n-fold convolution of a regular probability measure on a compact Hausdorff group G converges to normalized Haar measure in weak-star topology if and only if the support of the distribution not contained in a coset of a proper normal closed subgroup of G.
- For convenience, the following is a collection of the papers I've written on Benford's law. As you can tell, I love the subject. There are many problems that are very amenable to undergraduate investigations; if you want to try your hand at research, let me know.
  - Benford's law, values of L-functions and the 3x+1 problem (with Alex Kontorovich), Acta Arithmetica. (120 (2005), no. 3, 269–297). pdf.
  - Benford's Law applied to hydrology data - results and relevance to other geophysical data (with Mark Nigrini), Mathematical Geology (39 (2007), no. 5, 469--490). pdf
  - The Modulo 1 Central Limit Theorem and Benford's Law for Products (with Mark Nigrini), International Journal of Algebra. (2 (2008), no. 3, 119--130). pdf
  - Order statistics and Benford's law (with Mark Nigrini), International Journal of Mathematics and Mathematical Sciences (Volume 2008 (2008), Article ID 382948, 19 pages, doi:10.1155/2008/382948) pdf
  - Chains of distributions, hierarchical Bayesian models and Benford's Law (with D. Jang, J. U. Kang, A. Kruckman and J. Kudo), Journal of Algebra, Number Theory: Advances and Applications. (volume 1, number 1 (March 2009), 37--60) pdf
  - Data diagnostics using second order tests of Benford's Law (with Mark Nigrini), Auditing: A Journal of Practice and Theory. (28 (2009), no. 2, 305--324. doi: 10.2308/aud.2009.28.2.305) MSWord file

Thursday, November 19. All good things must come to an end, and today ends our proofs of the standard Central Limit Theorem. One can generalize it further by weakening the assumptions (we can allow the random variables to have different distributions, though independence is clearly important, as we do not expect X + X + ... + X to converge to a normal distribution in general). We will discuss another variant of the Central Limit Theorem when we study Benford's law later. Our previous proofs involved either directly working with the moment generating function (if it had a nice closed form expression) or Taylor expanding the moment generating function. Unfortunately the moment generating function need not always exist, which is why it is advantageous to use the Fourier transform approach. In the literature the Fourier transform of a probability density is called the characteristic function of the density, and always exists. If M_X(t) = E[e^tX] is the moment generating function and φ_X(t) is the characteristic function, then φ_X(t) = M_X(-2πit), so the two are related.
- We started out by reviewing why the convolution of two densities is the density of the sum of the corresponding random variables. This property is the reason convolutions play such an important role in the theory. The Fourier transform of a convolution is the product of the Fourier transforms. This converts a very difficult integral into the product of two Fourier transforms, and frequently these integrals can be evaluated. The difficulty is that, at the end of the day, we must then invert, and to prove the Fourier Inversion Theorem is no trivial task. Proving our error estimates for the integrals that converge to the convolution involved either Taylor's theorem with remainder or the Mean Value Theorem.
- Additional nice and useful properties of the Fourier transform is that the derivative of the Fourier transform is the Fourier transform of the original function multiplied by -2πix; this is very useful in solving differential equations.. In particular, if p is our density and FT[p](y) is the Fourier transform at y, then FT[p]'(0) = E[X] and FT[p]''(0) = E[X²]. One formulation of quantum mechanics replaces position and momentum with differential operators; in this interpretation, the famous uncertainty principle is just a statement about a function and its Fourier transform! (See here for the physics explanation of the uncertainty principle.) Note the Taylor series expansion of FT[p] near the origin depends on the mean and the variance; if we normalize those appropriately, the `shape' of the distribution is not seen until we get to the third order term in the expansion. The absence of these shape parameters in the linear and quadratic terms of the Taylor expansion is what is responsible for the universality.
- It is worth emphasizing that, yet again, we needed to interchange an integration and a differentiation; click here for conditions on when this is permissible.
- We reduced the problem to understanding ( FT[p](y / sqrt(N)) )^N; from one point of view it should be close to 1 (as we are evaluating at almost 0, and FT[p](0) = 1), and from another point it should be large (as we are raising it to the Nth power). We Taylor expanded FT[p] and used the compound interest definition of exp(x).
- The proof was completed by showing that the result was the Fourier transform of the standard normal. It would be nice to see if this can be done by integrating by parts. One way to compute it is to note it equals Int_{-∞ to ∞}(1/sqrt(2π)) Exp(-t²/2) Exp(-2πity). As Exp(-t²/2) is even and Exp(-2πity) = cos(2πty) - i sin(2πty), only the integral against the cosine piece contributes. We can compute the contribution by Taylor expanding cos(2πty) and doing some algebra, using in particular the definition of the factorial and double factorial. There is a slicker proof that avoids algebra by appealing to complex analysis. We know the moment generating function of the standard normal is M_X(t) = E[e^tX] = exp(t²/2). But φ_X(t) = M_X(-2πit); as the moment generating function agrees with exp(t²/2) for real t, the functions must equal for all values by results from complex analysis. Plugging -2πit in, we get M_X(2πit) = exp(-2π²t²) as claimed.
- We ended the day by starting to discuss Benford's law. We'll talk about this in far greater detail on Tuesday; however, see the paper by Mark Nigrini.
- Finally, in the previous class we mentioned the harmonic sum 1 + 1/2 + 1/3 + 1/4 + .... There are lots of proofs that it diverges, ranging from 1 + 1/2 + (1/3+1/4) + (1/5 + ... + 1/7) + ..../ With a little work we see each quantity in parentheses is at least 1/2, and so the sum diverges.

Tuesday, November 17. We finally gave a proof of the Central Limit Theorem! Our initial proof was for the special situation of sums of independent Poisson random variables (click here for a handout with the details of this calculation). The proof technique there used many ingredients in typical analysis proofs. Specifically, we Taylor expand, use common functions, and somehow argue that the higher order terms do not matter in the limit with respect to the main term (though they crucially affect the rate of convergence).
- The Central Limit Theorem has a rich history and numerous applications. What makes it so powerful and applicable is that the assumptions are fairly week, essentially finite mean, finite variance, and something about the higher moments. The natural question is what exactly do we mean by convergence? There are several different notions.
- These types of convergence are explained in detail in Chapter 7 of our book, especially section 7.2. Almost sure convergence and convergence in the rth mean imply convergence in probability which implies weak convergence. The Borel-Cantelli problem from Chapter 1 is quite useful in proving almost sure convergence. For us, we are just showing that the moment generating function converges to the moment generating function of the standard normal, with the rate of convergence depending on the third moment (or fourth moment if the third moment vanishes; note the fourth moment is never zero). As many distributions have zero third moment, the fourth moment frequently controls the speed. This is why instead of looking at the kurtosis (fourth moment) we often look at the excess kurtosis, which is the kurtosis of our random variable minus the kurtosis of the standard normal. This is because it is this difference that frequently controls the speed of convergence.
- A classic result about how rapidly we have convergence to the standard normal is the Berry-Esseen Theorem.
- Taylor series played a key role in our proofs; the idea is that we can locally replace a complicated function by a simpler function, so long as we can control the error estimates.
- We discussed the probabilities of the standard normal taking on values in certain ranges (or outside these ranges). There are many different conventions used; click here for one such table.
- Another key ingredient in our proof was the exponential function, in particular its series expansion.
- We also summified our expression by using the identity P = exp(log P); this is very useful whenever P is a product as logarithms convert products to sums. This is a great way to do nothing! We saw how well this worked to understand quantities such as P = lim_{N --> ∞} (1 + x / N²)^N. We took the logarithm and log P_N= N log(1 + x / N²); we then Taylor expanded the logarithm and found log P_N= x / N + terms of size N², N³, .... Exponentiating gives us P_N= exp(x / N) exp(terms of size N², N³, ...), and we thus obtain information on the speed of convergence.
- The proof for the Poisson random variable was very similar to the proof for arbitrary random variables whose moment generating functions exist in a neighborhood of t = 0. The difference, of course, is that while we always want to summify, it is particularly simple for the Poisson case as its moment generating function is a double exponential, specifically exp( λ (exp(t) - 1) ). This is a particularly nice function to take a logarithm of, and in fact this is why I always do this example.
  - It is worth thinking about why we (I) made a mistake in class about the variance of the Poisson. The mean and the standard deviation are supposed to be in the same units, so if the mean is λ then shouldn't the standard deviation be λ, because if the variance were λ then the standard deviation would be λ^1/2 and that would have the wrong units, right? Wrong. For an exponential with density f(x) = λ exp(-λx) the mean and standard deviation are both 1/λ, and we can see that this is the correct λ dependence by scale issues: we exponentiate λx, so λx must be unitless so if x is in meters say then λ is in 1/meters, and thus this is the correct λ dependence for the mean and standard deviaton. What goes wrong for the Poisson? Remember the density there is f(n) = λⁿ e^λ /n!; here λ is alone in the exponential and is thus unitless! This means we can't use the unit analysis to say that the standard deviation and the mean have the same λ dependence.
- One can prove the CLT directly in the case of Bin(N, 1/2). As a binomial random variable is the sum of Bernoulli random variables, we see that Bin(N,1/2) should become normally distributed as N tends to infinity. This can be proved directly, and uses Stirling's formula to estimate the binomial coefficients.

Thursday, November 12. Today we finally applied our results from complex analysis to analyze the moment problem, namely how many moments must two distributions share to force them to be the same? We've already seen an example of two distinct densities that have the same integral moments, so more is needed. In fact, those two densities agree for all half-integral moments as well. One answer turns out to involve accumulation points; namely, if our densities are sufficiently nice then if they agree for a sequence of moments that accumulates, then the densities are equal. The proof uses our accumulation theorem from complex analysis, and the fact that there is a unique inverse Fourier transform of a Schwartz function.
- Looking at the two densities with the same integral moments, we find they also have the same half-integral moments, but that's where the agreement ends.
- In general this is called The Moment Problem; there are lots of variants. One of my favorites, possibly due to the name, is the Hamburger Moment Problem, which asks us when is a given sequence of numbers the integral moments of a probability density.
- A key step in our proof was that there is a unique inverse Fourier transform of a Schwartz function. This is similar to the following: if we consider the map f(x) = x²defined on the real numbers, then there are two x's that are mapped to 1, and hence there is no inverse. If instead, however, we restrict the map to be just on the interval [0, ∞) then there is a unique inverse. Restricting our functions to be Schwartz is similar to this.
- Another key step was interchanging differentiation and integration. It is very important to check to make sure we can do this interchange; it is frequently referred to as differentiating under the integral sign. While these theorems are stated for derivatives with respect to real variables, we can modify these to hold for differentiating with respect to a complex variable z by using the Cauchy-Riemann equations (the derivative with respect to z is related to a linear combination of derivatives with respect to x and with respect to y).
- Another key step was seeing that x^z log(x) h(x) was integrable; the difficulty is that log(x) tends to negative infinity as x tends to zero; fortunately the presence of the x^z factor saves the day, as x to any positive power decays faster to zero than log(x) grows to minus infinity (as x tends to 0). One way to see this is to let y = 1/x and use L'Hopital's rule.
- We next talked about standardizing a random variable, sending X to (X - E[X]) / StDev(X). This allows us to compare apples and apples. Note of course not all random variables can be standardized; the Cauchy distribution for instance cannot. We only compute tables of the standard normal; by standardizing we can deduce the probabilities of any normal random variable from a table of probabilities of the standard normal. This is similar to the change of basis formula for logarithms. Knowing log_b(x) = log_c(x) / log_c(b), if we know logarithms base c we then know them base b, and thus it suffices to create just one table of logarithms.
- To prove the average (X1 + ... + X_N) / N of iidrv with finite mean and variance converges to the random variable's mean is not too bad; one can do this by applying Chebyshev's Theorem. If, however, we want to know the rate of convergence, we need more than Chebyshev; this is the content of the Central Limit Theorem. We saw some numerics today from the rates of convergence of standardized uniforms, Laplaces (two-sided exponentials), normals and Millered Cauchy's. We'll discuss rates of convergence in detail later, and we'll see that they are controlled by the third moment (or the fourth moment if the third vanishes). The third moment is called skewness, the fourth is called kurtosis. Actually, when the third moment vanishes it is excess kurtosis that's more useful; we'll see more on this when we look at the Taylor series expansion of the logarithm of the moment generating function.
- We ended today by computing the moment generating function of the standard normal, seeing that it is exp(t²/2). The key step in the proof is completing the square (there are lots of nice examples on the Wikipedia entry). It takes awhile to see how to simplify algebra / how to write algebra in a good way. When we have something like -x²/2 + xt and we know we want the argument of the exponential to be negative, it is natural to write it as (1/2)(x² - 2tx), and this is screaming at us to add 0 via t² - t².

Tuesday, November 10. Today we continued our quick tour of complex analysis, and the results we stated today will be used on Thursday to get a better sense of why we can have the ridiculous situation of two probability distributions being unequal yet having the same integral moments.
- We stated one of the truly amazing results from complex analysis, namely that if the zeros of a complex function defined on an open set U have an accumulation point in U, then the function is identically zero on U. This is profoundly different than real analysis. For example, we saw that the function x³ sin(1/x) is differentiable as a function of a real variable and vanishes at 0 and all points 1/ πn for n an integer; however, this function is not complex differentiable.
- We tried to compute the complex derivative of z³ sin(1/z), but saw that it was not differentiable as the limit depended on how we approached the origin. In general, it is very hard to show a limit exists without getting something nice like h⁴/h, as we have to investigate all possible paths; however, it frequently isn't too bad to show a limit doesn't exist by taking two cleverly chosen paths. It is a very strong condition to assume a function is complex differentiable; this is why, unlike real analysis, the existence of one complex derivative implies that the function is infinitely differentiable and equals its Taylor series.
- We briefly discussed again the 3x+1 problem (see Lagarias' bibliographies on the subject, part 1 and part 2, for a summary of much of what is known). My paper (with Alex Kontorovich) connecting the 3x+1 problem to Benford's law is available here.
- We discussed two of the most important integral transforms, the Laplace Transform and the Fourier Transform; these two transforms are related to each other and to another one, the Mellin transform (we've seen the Mellin transform when studying the Gamma function, as the Gamma function is the Mellin transform of the exponential function). These are all integral transforms, which are frequently used to solve a variety of problems. The ones we are studying have the wonderful property that they can be expressed as integrating against a fixed function (called the kernel); for many important applications this is true, but not always (see Picard's iteration method to solve first order differential equations). Each of these transforms has its advantages and disadvantages; depending on the problem you are studying, some make the algebra easier and some make it harder. Note it is not always the case that the transform exists; for example, the moment generating function of X is E[e^tX] = ʃ e^tx f(x) dx, which does not make sense in a neighborhood of the origin for a Cauchy random variable (we have many wonderful proofs allowing us to pass from knowledge of moment generating functions to knowledge of the density when the moment generating function converges in a neighborhood of the origin). The Fourier transform of a probability distribution, however, always exists for all values; this is called the characteristic function, and as it always exists, one can see why this would be of use and interest. In general it isn't too bad to compute these integral transforms, but it is hard to invert them. Frequently we must restrict the space of functions we're studying in order to have a nice inversion statement. One space often studied is the Schwartz space. This leads to a nice formula for the Inverse Fourier Transform.
- When talking about the difficulty of inverting a transform, we briefly mentioned how a similar situation is beautifully exploited in cryptography. Many cryptosystems are based on a trap-door algorithm, namely taking some process that is easy one way but hard to invert unless you know a key or trap-door or some extra bit of information not publically available. The standard, but by no means only, example is the that it is easy to multiply two numbers, but currently it is hard to factor numbers. Many of these cryptosystems use just elementary math to state how they work, but very advanced math to discuss their security. Two of my favorites are RSA and elliptic curve systems. See also the homepage for my winder study on cryptography: Math 10: LQWURGXFWLRQ WR FUBSWRJUDSKB.
- One can actually multiply two numbers, or two matrices, much faster than you'd expect. Below is a summary of some very efficient algorithms, which allow us to do some basic operations much faster than you might expect.
  - Horner's algorithm to evaluate a polynomial quickly: 4x^3 + 5x^2 - 3x + 8 = ((4x+5)x - 3)x + 8 (saves a few multiplications!). Saving multiplications is very important; one application of evaluating polynomials quickly is in constructing Mandelbrot sets.
  - Telescoping sums: often re-arranging the algebra leads to a significantly easier computation.
  - Fast matrix multiplication: Naively we expect it takes N^3 multiplications to find all N^2 entries of A^2 or AB when A and B are NxN matrices. The Strassen algorithm (see also the Mathworld entry here, which I think is a bit more readable) does it in about N^(log_2 7); the reason for this savings is that they can multiply two 2x2 matrices with seven and not 8 multiplications (3 = log_2 8). The best known algorithm is the Coopersmith-Winograd algorithm, which is of the order of N^2.376 multiplications. See also this paper for some comparison analysis, or email me if you want to see some of these papers.
    - Some important facts. The Strassen algorithm has some issues with numerical stability.
    - One can ask similar questions about one dimension matrices, ie, how many bit operations does it take to multiply two N digit numbers. It can be done in less than N^2 bit operations (again, very surprising!). One way to do this is with the Karatsuba algorithm (see also the Mathworld entry for the Karatsuba algorithm).

Thursday, November 5. In today's lecture we developed some more of the theory of generating functions, seeing the connections with probability. This is a very rich and powerful theory, and what we've seen is only some of its tremendous applications.
- We proved that G_X+Y(s) = G_X(s) G_Y(s) and M_X+Y(s) = M_X(s) M_Y(s), as well as additional properties, such as a formula for G_aX+b(s). These proofs have much in common with Calc I and Calc II. Namely, we spend a lot of time doing some algebra G_X+Y(s) = G_X(s) G_Y(s) once; the advantage is that once we have done it, we can simply use the result in later problems. For example, if asked to differentiate x cos(x) we don't write down the definition of the derivative, but rather we use the product rule. The reason is that it is advantageous to do the calculation once in general, get the result, and then in the future jump directly to that point for the function of interest. It is similar for moment generating functions; we spend the time now doing the calculations so we can just apply these results later.
- Earlier we showed by brute force that the sum of two independent Poissons is a Poission with parameter equal to the sum of the parameters. We can now provide an alternative, shorter proof with moment generating functions, as the moment generating function of a discrete random variable taking on values in {0, 1, 2, ...} is unique. The reason the algebra is so much simpler in using the MGF is that we did the hard work in proving M_X+Y(s) = M_X(s) M_Y(s), and are now just reaping the rewards.
- We gave an example of two densities that have the same moments but are not equal; this is the analogue of the pathological function from real analysis. A really good extra credit problem is to compute their integral moments (i.e., their kth moments for positive integer k) and see that these agree. Do you think any of the non-integral moments agree?
- We then introduced much of the terminology in complex analysis, include a complex variable, complex differentiability (which implies that our function satisfies the Cauchy-Riemann equations), open sets and closed sets, and the major theorem that f is a holomorphic function if and only if f is analytic (in other words, if a function has even one complex derivative than it has infinitely many and it equals its Taylor series expansion!). Much of this language (such as open and closed sets) is required for advanced discussions in analysis and topology. One of my favorite applications of all of this is Furstenberg's celebrated proof of the infinitude of primes through a topological argument (ie, through open and closed sets!).
- We ended by looking at a plot of x³sin(1/x). When I traced out the top part of the plot and asked what its shape looked like, many in the class responded that it looked like a parabola. This is a terrific example of how the way a question is framed influences our answer. The correct way to look at the plot is to look at half of the bottom and then half of the top, and you see a cubic. We are frequently not aware of how things around us are being framed and thus how we are being forced / guided to a given answer or world view -- it is worth stopping and thinking about this every now and then. If you are interested in these topics, I recommend the following two videos:
  - Dan Pink on Motivation
  - Malcolm Gladwell on spaghetti sauce
- Speaking of videos and being mislead, you might enjoy listening to the song I'm my own grandpa (text is available here). It's a good exercise to work through the lyrics and see that it is correct -- frequently in math we are given theorems where if a condition is removed one of two things happens: (1) the result is now false; (2) the proof is now harder. (For this example, see the Wikipedia page on I'm my own grandpa for an analysis). For example, today we showed that M_X+Y(s) = M_X(s) M_Y(s) if X and Y are independent random variables; it's a good exercise to show that this need not be true if we remove the assumption that X and Y are independent.
  - Occasionally, though, proofs become easier if we remove conditions, as these conditions are getting us to look at the problem in the wrong way. For example, look up the definition of algebraic numbers and transcendental numbers.A wonderful result is that e and π are both transcendental numbers. Further, we can prove that at least one of e+π and e π is transcendental (though we believe both are). Seeing this result, it is natural to think that properties of e and π enter into the proof. In fact, there is nothing special about e and π; if x and y are any two transcendental numbers than at least one of x+y and xy are transcendental! Thus, even though we might think the proof involves special formulas / properties of e and π, such as perhaps the relation exp(πi) = -1), it does not!
Tuesday, November 3. In today's lecture we saw another example of divine inspiration in solving difference equations. We then turned to sums of independent normal random variables, and ended by discussing different types of generating functions.
- We showed earlier in the semester how to solve difference equations using the method of divine inspiration. Today we discussed an application to a random walk problem with two absorbing boundaries (at 0 and N), namely, if we start at k how long do we expect to walk until we hit a boundary? The difference equation that arises is close to, but slightly different than, the one we encountered before for the probability of winning. This complication sadly means our original guess of the solution does not work, nor does the next most natural choice. For more details, including the solution, look at the solution to Wentao's second proposed problem (Section 57, page 49). A nice challenge problem is to derive the solution in the special case that p = 1/2 (obviously the most important solution, which makes it annoying that the method in class fails there!). For more on `guessing' how to be divinely inspired, see here (especially Section 3).
- There is a deep and rich theory of sums of normal random variables (and their squares), which is described in greater detail in a statistics class. Two items from today of special note are the definition of the sample variance and the independence of the sample mean and sample variance.
  - The sample mean is defined by X = Sum_{i = 1 to N} X_i / N and the sample variance by S² = (Sum_{i = 1 to N} (X_i - X)² / (N-1). The main theorem is that (N-1) S² is a chi-square distribution with N-1 degrees of freedom. It is not immediately clear why we divide by N-1 and not N; after all, there are N data points, and we do divide by N for the variance of a finite set of data. There are valid statistical reasons for this (wanting an unbiased estimator; I strongly urge you to read the wikipedia entry, as there is a nice bit on the proof, using (what else) adding zero; see also Cochran's theorem). I use the following heuristic to explain why it's N-1 and not N; namely, consider the extreme case of N=1. In this case, while one observation can be used to estimate the true mean, it is absurd to think one observation can be used to estimate the true variance! The reason is that we need to look at differences, at fluctuations about the mean, to get a hand on the variance -- how can we do this with just one data point?
  - A major theorem is that the sample mean and sample variance are independent. This is not at all clear from the definition (as the sample variance involves the mean). This leads to studying the statistic t = (X - μ) / (S² / sqrt(N)); this is known as the t-statistic and has the t-distribution with N degrees of freedom (here μ is the mean of the identically distributed normal random variables). As N tends to infinity this converges to the standard normal, but is very useful for finite N when we have independent Gaussian random variables with unknown variance.
- We discussed generating functions / moment generating functions / characteristic functions. These functions encode information about problems of interest; for wonderful applications to number theory, see the final section of the course notes (these techniques can be applied to attack Waring's Problem and Goldbach's Problem, among others). One of the biggest uses of these is that they simplify the application of algebra, as they are significantly easier to work with. In many cases we can find closed form expressions, and the derivatives of these are then related to means, variances, and moments. It is typically very rare to be able to get a nice, closed form expression of something in the real world (for some nice examples of where this is possible, see some of my sabermetrics papers: the Weibull approach to winning percentages and the log5 method (for a more marketing / economics example, see my paper with Eric Bradlow and Kevin Dayaratna; this paper appeared in the journal of Quantitative Marketing and Economics, and you might notice the cookie problem in the appendix!).
- In our analysis of generating functions, we reiterated the warning that analysis is hard. Namely, the function f(x) = exp(-1/x²) if x is not zero and 0 otherwise has all of its derivatives vanish at 0, but its Taylor series agrees with the original function only at x=0 (which is nothing to be proud of!). Complex analysis is quite different; there if a function is complex differentiable once then it is infinitely complex differentiable, and it equals its Taylor series in a neighborhood of the point. This fact is one reason why we frequently use characteristic functions instead of generating or moment generating functions.

Thursday, October 29. Unquestionably one of the gems of probability and statistics is the Central Limit Theorem. The proof and applications involve understanding the sum of independent random variables, often identically distributed. This leads to the following fundamental, natural question: Given random variables Xi with densities fi, is there a nice formula for the density of X1 + ... + Xn in terms of f1 through fn?
- As a first case, we considered X1 + X2 with each Xi ~ Uniform(0,1). To get a feeling for the answer, we looked at rolling two fair die and the distribution of the resulting sums. We found Prob(R1 + R2 = k) = (6 - |k-6|)/36 for 2 <= k <= 12 and 0 otherwise. This is a triangle, it's symmetric about the mean, the density is largest at the mean, .... It is unlikely that these features depend on the die having 6 sides, and thus it is reasonable to expect X1 + X2 to be a triangle supported in [0,2] with maximum density at the mean of 1.
- We proved this by using convolutions and then brute force integration. Convolutions are incredibly powerful and useful in probability, and provide a very useful way to explore many problems. The convolution is defined by (f1 * f2)(x) = Integral_{t = -oo to oo} f2(t) f2(x-t)dt. If fi is the density of Xi, this is the density of X1+X2. We proved this by using the cumulative distribution function of Y = X1+X2 (which was a double integral) and then differentiating. The key step was interchanging the derivative and the integral. In general we cannot interchange orders of operations (sqrt(a+b) is typically not sqrt(a) + sqrt(b)), but sometimes we're fortunate (click here for a nice article on Wikipedia on when this is permissible).
- There is enormous structure behind convolutions of probability distributions. Let f be the density function for the random variable X, and g the density function for the random variable Y. As X+Y = Y+X, we find f * g = g * f (ie, the operation is commutative), and f * (g * h) = (f *g) * h (the operation is associative). Convolution is also closed (if f and g are densities, so is f * g). Note this is beginning to look like a group; namely, we have a collection of objects (in this case, probability densities or maps from the reals to the reals) and a way to combine them (convolution) that is closed, associative, and even commutative. If we just had an identity element and inverse, we would have a group (a commutative group, in fact). Groups occur throughout the sciences and the world, two of my favorite are the Rubik's cube and the Monster group. As there is a lot of structure in groups, it's natural to ask whether or not we can find an identity element and inverses.
  - The identity element is not hard to find. We define the Dirac delta functional δ(x) as follows: for any probability density f(x), Integral_{x = -oo to oo} f(x)δ(x) dx = f(0). One may view δ(x) as the density corresponding to a unit point mass located at 0; similarly we would have Integral_{x = -oo to oo} f(x) δ(x-a) dx = f(a), corresponding to a unit point mass at a. We have actually seen Dirac delta functionals before. For example, let X be Bernoulli(p). This means Prob(X=1) = p, Prob(X=0) = 1-p and any other x has Prob(X=x) = 0. If we let f(x) denote the probability mass function, we have f(x) = p δ(x-1) + (1-p) δ(x). It turns out that the Dirac delta functional (which does integrate to 1, which can be seen by taking f(x) = 1 in Integral_{x = -oo to oo} f(x) δ(x-a) dx) acts as the identity. We now show f * δ = f. We have (f * δ)(x) = Integral_{t = -oo to oo} f(t) δ(x-t) dt = f(x).
  - Thus the only obstacle in whether or not we have a group (with group operation given by convolution) is whether or not there is an inverse. Is there? Perhaps there is an inverse if we restrict the types of probability distributions we study (for example, maybe we only look at densities defined on a compact interval).
- We introduced the Fourier Transform today. Be careful: there are at least three natural definitions; I prefer f^(ξ) = Integral_{x = -oo to oo} f(x) e^(-2πixξ) dx. There are many great properties of the Fourier transform; one of the most important properties is that the Fourier transform of a convolution is the product of the Fourier transforms, or (f *g)^(ξ) = f^(ξ) g^(ξ). The proof required us to use Fubini's theorem to interchange the order of integrations, and some basic facts of complex analysis (which we'll review again below). For those familiar with group theory, what we have looks a lot like a group homomorphism (we have to say a lot like as we haven't proved that there are inverses).
- The reason the Fourier transforms are so useful is the following: imagine there is an inverse Fourier transform for every nice function. If we want to study the sum X1 + ... + Xn, we know its density is f1 * ... * fn; assuming the Xi are independent, identically distributed random variables then the fi are all equal, say f. The Fourier transform converts convolution to multiplication, and thus (f * ... * f)^(ξ) = f^(ξ)^n. If Finv denotes the inverse Fourier transform, then (f * ... * f)(x) = Finv(f^(ξ)^n)(x). Thus, if we can invert the Fourier transform of f^(ξ)^n, then we have a formula for the density of the sum!
- It is not immediately clear that to understand real functions of real variables that we need to study complex numbers. If i = sqrt(-1) and z = x + i y, then the complex conjugate of z is defined by x - i y. The length of a complex number z is defined by |z| = sqrt((a+ib)(a-ib)) = a^2 + b^2. Recall the exponential function exp is defined by e^z = exp(z) = sum_{n = 0 to oo} z^n/n!. This series converges for all z. The notation suggests that e^z e^w = e^(z+w); this is true, but it needs to be proved. (What we have is an equality of three infinite sums; the proof uses the binomial theorem.) Using the Taylor series expansions for cosine and sine, we find e^(iθ) = cos θ + i sin θ. From this we find |e^(iθ)| = 1; in fact, we can use these ideas to prove all trigonometric identities! For example:
  - Inputs: e^(iθ) = cos θ + i sin θ and e^(iθ) e^(iφ) = e^(i (θ+φ))
  - Identity: from e^(iθ) e^(iφ) = e^(i (θ+φ)) we get, upon substituting in the first identity, that (cos θ + i sin θ) (cos φ + i sin φ) = cos(θ+φ) + i sin(θ+φ). Expanding the left hand side gives (cos θ cos φ - sin θ sin φ) + i (sin θ cos φ + cos θ sin φ) = cos(θ+φ) + i sin(θ+φ). Equating the real parts and the imaginary parts gives the identities
    - cos(θ+φ) = cos θ cos φ - sin θ sin φ
    - sin(θ+φ) = sin θ cos φ + cos θ sin φ
  - One can prove other identities along these lines....
- Finally, a common theme in mathematics is the need to simplify tedious algebra. Frequently we have claims that can be proven by long and involved computations, but these often leave us without a real understanding of why the claim is true. If you want, let me know and I'll show you my 40-50 page proof of Morley's theorem; Conway has a beautiful proof which you can read here (it's after the irrationality of sqrt(2)).

Thursday, October 27. Today's lecture was a mix of applications of old material and a sales pitch of things to come.
- The main theme of the first part was the Change of Variable Formula. The key (and most difficult ingredient) is the Jacobian, which tells us how the volume element changes. We did the calculation in great detail for polar coordinates, though of course the argument holds in greater generality. One strange application of our analysis today was a formula for the sum of two independent random variables X₁, X₂ which are Exponential(λ). We let Y₁ = X₁ + X₂ and Y₂ = X₁/X₂ (and Y₃ = X₁ - X₂). We obtained a joint density in each case for Y₁ and Y₂ or Y₃, and by integrating out Y₂ or Y₃ we were left with the density of Y₁! When we study convolutions we'll find better, simpler, more tractable formulas for the density of sums of random variables, but it is fascinating to see what we get here. The general, big picture idea that's floating around all of this is that we're transforming functions to functions, be it through Jacobians, convolutions, or integral transforms (such as the Laplace and Fourier transforms, which we'll meet soon).
- In the proof of the one-dimensional change of variable formula, one of the key ingredients was the Fundamental Theorem of Calculus. We needed this to find the cumulative distribution function of Y = g(X); we then differentiated to get the density. Thus, while at the end of the day we do not need to know F_X, it was important to have it for an intermediate step of the calculations.
- Another ingredient in the proof of the one-dimensional change of variable formula was that if g(h(y)) = y then h'(y) = 1 / g'(h(y)). This is a nice application of the chain rule to inverse functions (as g(h(y)) = y and h(g(x)) = x, we say g and h as inverses of each other). We used this relation to find the derivative of the arctangent function. When we first encounter such functions in Calc I or Calc II, they seem un-natural, primarily chosen to provide tests of how well you have mastered differentiation. These functions, however, do naturally arise in many applications. My favorite examples are in determining the cumulative distribution function (and hence the normalization constant) for a Cauchy random variable (which has density (π(1+x²))^-1). Distributions such as the Cauchy are terrific for testing how general results are in probability and statistics; I have a nice paper using a distribution which is a variant of the Cauchy to show the limitations of the famous Cramer-Rao inequality for determining optimal statistical tests. I've also seen analogues arise in nuclear physics. The second occurrence of arctangent today was in the change of variable formulas from polar to Cartesian.
- One wonderful application of the one-dimensional change of variable formula is to generating random variables given a uniform random number generator. There is a huge industry that tries to construct random number generators from different distributions; it becomes much harder when we have dependent, multivariate joint densities (ie, we have several random variables and the joint density does not factor). Random.org is nice website collecting various algorithms for different types of randomness, ranging from cards to jazz to numbers (I strongly urge you to check out this website, which generates postmodern papers randomly; if you enjoy that, you should also see the most famous essay in the subject, which is by the physicist Alan Sokal: "Transgressing the Boundaries: Toward a Transformative Hermeneutics of Quantum Gravity" -- to get an idea of how absurd it is, go to the html file and search for "In 1982, when Irigaray's essay").
- We talked a bit about how many shuffles are needed to randomize a deck of cards. The classic paper is by David Bayer and Persi Diaconis (if you cannot read it, let me know and I'll get it for you). If you want, I can also share some illegal bridge bidding conventions that involve encrypting your bid so that only your partner can decode it!
- We talked a bit about the limiting behavior for sums of random variables. A natural thing to do to any random variable is to normalize or standardize it. Thus, instead of studying Y one should study (Y - mean(Y)) / StDev(Y) (provided the mean and standard deviation exist). This new quantity has mean 0 and variance 1, and thus we should be able to compare it to other similar quantities (ie, we're now comparing apples and apples, not apples and oranges).
- We ended with a discussion of Pepys' problem. This is perhaps our first example leading towards the Central Limit Theorem. A terrific challenge problem is to prove, elementarily, that as n tends to infinity we have a 50% chance of winning, significantly less than the approximately 66% chance when n is 1.

Thursday, October 22. The multi-dimensional Cauchy-Schwartz inequality is proved in an entirely analogous way as the one-dimensional case. The key idea is that we again get a quadratic polynomial in one unknown variable b, look at its discriminant, and then the inequality pops out. This is just one of many useful inequalities; another very powerful one is the arithmetic mean - geometric mean (for more proofs, see my handout here).
- We proved the correlation coefficient was at most 1 in absolute value by applying the Cauchy-Schwartz inequality. The proof technique (for us) is more important than the result. Namely, we do not believe Integral_{-oo to oo} x² dx should be finite; it needs to be hit with the density of a random variable X. The Cauchy-Schwartz inequality takes two functions say A and B as input and relates the integral of AB to integrals of A² and B². What's nice about this is we can write our density f_X,Y(x,y) as (f_X,Y(x,y))^1/2 (f_X,Y(x,y))^1/2. We give one factor to each, and as we square we now hit our quantities of interest against a probability density, and therefore there is a chance that the integrals will be finite. Another place where this technique can be used is in proving the Cramer-Rao inequality (see here for my proof). If you are interested in statistics, you should read up on the Cramer-Rao inequality; one application is it can sometimes tell you when you've found a least variance unbiased estimator for a given population parameter. I have a paper on a situation where, unfortunately, the Cramer-Rao provides no useful information, though typically in practice it does provide some information on the system under consideration.
- The three person hat problem we discussed is one of my favorites. It has powerful connections to error correcting codes. See also the slides from M. Bernstein's talk at the SUMS conference at Brown a few years ago. I strongly urge you to read / skim her slides -- there is terrific animation and discussion of what is going on. This is one of the nicest applications of joint mass functions, marginals, and dependence that I know, and the result is quite surprising. This problem is also covered in the optional book for the course, Impossible (Chapter 6: Buckling the Odds, page 50); if you don't have the book and want to read it, let me know and you can borrow my copy. It is well worth the time to carefully study and ponder this problem. Note the expected value of each person's guessing is that they are correct 50% of the time and wrong 50% of the time, exactly as you would predict. The interesting thing is that we are able to congregate the wrong answers and spread out the right ones. What's really going on here is a nice conditional probability. There are 8 possible outcomes for the distribution of the hats: WWW, WWB, WBW, BWW, WBB, BWB, BBW, BBB. Each of these happens 1/8th of the time. Let's assume we see two hats of the same color; without loss of generality, let's say those hats are white. Is the probability of our having a white hat equal to 1/2 (as our hat color is independent) or is it 3/4, as now the only possibilities are WWW, WWB, WBW, BWW? It is important to note that until we open our eyes, we don't know that we will see two hats of the same color. This is the key observation.
- The final item for the day was a discussion of Exercise 3.3.8. The most important part of this problem was going from infinitely many possible strategies to a small,finite number (in this case, five!). For example, one strategy could be take 6 on the first time toss, otherwise take a 5 or 6 on the second, otherwise take a 3, 4, 5 or 6 on the third, and so on. It was important to eliminate these possibilities. This is somewhat similar to what happens in the drowning swimmer problem (I have a Mathematica notebook on the problem here). There are three nice additional asides related to this.
  - The first is the Principle of Least Action, or variational approaches to all of modern physics.
  - The second is the natural question: Do dogs know calculus? This leads to another question: Do dogs know bifurcations? (with thanks to Tim Pennings). This is connected to the drowing swimmer problem.
  - The third is from economics and psychology: risk averseness of people and how that affects strategies. For those of you in math/econ (or interested in economics), there is A LOT that can be done with this topic. One instance is modern portfolio theory, where many of the concepts of this course are applied. See also the section on the Arrow-Pratt measure on risk aversion.

Tuesday, October 20. Today's lecture was devoted to building some of the background theory and results we will need for later in the semester. Note much of today's class uses a key result from Calculus III, namely integrals in the plane can be evaluated by iterated integrals. This material is standard and should be in any textbook for Calc III. WIkipedia has a good entry on iterated integrals. The main idea is that we want to convert an integral over a two-dimensional region (which we can evaluate with Riemann sums and upper and lower bounds and limits) into iterated integrals. If our function and region is nice, this can be done. See the entry on order of integration for precise statements of the theorems and conditions. Sometimes one needs to use the multidimensional change of variables formula: see the link on substitution of variables (click on the entry on Jacobians for more information about this important ingredient).
- The first item of the day was to determine the normalization constant for normal distributions. One of the simplest ways to compute the normalization constant is to square the integral and convert to polar coordinates. The main ingredients are: the area element dxdy transforms to rdrdθ, and the integrand is radial (it becomes exp(-r²/2)r).
- We next considered the Gamma function, which generalizes the standard factorial function. We gave a proof of its functional equation, Γ(s+1) = sΓ(s); this allows us to take the Gamma function (initially defined only when the real part of s is positive) and extend it to be well-defined for all s other than the non-positive integers. For more on the Gamma function and another proof of the value of Γ(1/2), see my (sadly handwritten) lecture notes. This approach uses the Beta distribution.
- One nice application of the Gamma function and normalization constants is a proof of Wallis' formula,which says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is mostly elementary (see my article in the American Mathematical Monthly). Not surprisingly, the proof uses one of my favorite techniques, the theory of normalization constants (caveat: it does have on advanced ingredient from measure theory, namely Lebesgue's Dominated Convergence Theorem).
- Many functions in mathematical physics initially exist only for some values of the parameters but can be continued elsewhere; my favorite is the Riemann zeta function (and the extension uses the Gamma function). What is amazing (and not initially apparent) is that the following frequently occurs. We have some function and we only care about its values at the real numbers (or maybe even just the integers); nevertheless,it is often easier to study it as a function of a complex variable (z = x + iy), as then we have all the tools and techniques of complex analysis at our disposal. A terrific example is the Prime Number Theorem (which says that, to first order, the number of primes at most x is about x/log x). This is a statement about integers, yet the `easiest' and `best' proofs all use the Riemann zeta function at complex arguments (and, as you may reasonably ask, why should we need to use complex numbers to count integers!). What follows is an aside on an aside -- this is clearly not needed for the course!
  - The complex analytic proof of the Prime Number Theorem uses several key facts. We need the functional equation of the Riemann zeta function (which follows from Poisson summation and properties of the Gamma function), the Euler product (namely that zeta(s) is a product over primes), and the important fact that the Riemann zeta function does not have a zero on the line Re(s) = 1! If this happened, then the main term of x from integrating zeta'(s)/zeta(s) * x^s/s arising from the pole of zeta(s) at s=1 would be cancelled by the contribution from this zero! Thus it is essential that there be no zero of zeta(s) on Re(s) = 1. There are many proofs of this result. My favorite proof is based on a wonderful trig identity: 3 + 4 cos(x) + cos(2x) = 2 (1 - cos(x))^2 >= 0 (many people have said that w^2 >= 0 for real w is the most important inequality in mathematics). If people are interested I'm happy to give this proof in class next week (or see Exercise 3.2.19 in our textbook; this would make a terrific aside if anyone is still looking for a problem). There is an elementary proof of the prime number theorem (ie, one without complex analysis). For those interested in history and some controversy, see this article by Goldfeld for a terrific analysis of the history of the discovery of the elementary proof of the prime number theorem and the priority dispute it created in the mathematics community. We mentioned Riemann computed zeros of zeta(s) but didn't mention his achievement; the method only came to light about 70 years later when Siegel was looking at Riemann's papers. Click here for more on the Riemann-Siegel formula for computing zeros of zeta(s). Finally, terrific advice given to all young mathematicians (and this advice applies to many fields) is to read the greats. In particular, you should read Riemann's original paper. In case your mathematical German is poor, you can click here for the English translation of Riemann's paper. The key passage is on page 4 of the paper: One now finds indeed approximately this number of real roots within these limits, and it is very probable that all roots are real. Certainly one would wish for a stricter proof here; I have meanwhile temporarily put aside the search for this after some fleeting futile attempts, as it appears unnecessary for the next objective of my investigation.
- We then turned to determining when two random variables are independent or dependent. The key lemma is that two random variables X and Y are independent if and only if their joint density f_X,Y(x,y) factors as the product of their marginals, namely f_X(x) f_Y(y). If the density factors the proof is straightforward; if not, our book leaves it as an exercise to the reader. As this is the first serious proof class for many, I wanted to go through the argument. The proof basically follows from the definition of continuity. Let g(x,y) = f_X,Y(x,y) - f_X(x) f_Y(y). We assume g(x₀,y₀) > e > 0. By continuity, if (x,y) is close to (x₀,y₀) then g(x,y) is close to g(x₀,y₀). Continuity says we can always find a δ such that if the distance from (x,y) to (x₀,y₀) is at most δ then |g(x,y) - g(x₀,y₀)| < e/2. We have two natural ways to measure the distance: dist((x,y), (x₀,y₀)) = |x - x₀| + |y-y₀| or sqrt((x-x₀)² + (y-y₀)²). Regardless, we choose our δ so that our small square centered at (x₀,y₀) with sides of length 2δ have |g(x,y) - g(x₀,y₀)| < e/2, which means on this square g(x,y) - g(x₀,y₀) > e/2. Integrating completes the proof. We then did an example: f_X,Y(x,y) =(e-2)^-1 x exp(xy) for x, y in [0,1] and 0 otherwise. The density does not factor, and we see that the random variables are in fact dependent. It is a very good exercise to do the details of this computation.
- We then moved to a proof of the Cauchy-Schwartz inequality. There are many proofs of this important result; see here for some lecture notes I wrote years ago for another class giving another one of the standard proofs. We will discuss the applications of the Cauchy-Schwartz inequality in greater detail on Thursday.
- Finally, we discussed generalizations of the coupon or prize problem from the homework. It is not immediately clear what the right order of magnitude is as to how long you need to wait before you are essentially assured of having two of each prize (or more generally k of each prize). As a nice exercise, prove that as c tends to infinity, with probability tending to 1 you are assured of having at least two of each prize if you wait as long as 2 c H_c, where H_c = 1 + 1/2 + 1/3 + ... + 1/c is the c^th harmonic number. Can you replace the constant 2 with something smaller? (We know it must be at least 1 --would 1 + e work for any e?

Thursday, October 15. We covered an enormous amount of theory and applications today, and it's worth reflecting on the advantages and disadvantages of all we did.
- We did a few more examples of the power of binary indicator random variables and linearity. We used it to derive the formulas for the mean and variance of a binomial(n,p) random variable by writing it as a sum of independent Bernoulli(p) random variables. We can of course derive these values by differentiating identities. It is worth remarking that many of the identities in combinatorics are proved by showing that two different ways of counting the same thing are equivalent, and then if we evaluate one we get the other for free. We did another example of using binary indicator random variables and linearity of expectation in modeling how often Fermat numbers are prime. (See the additional comments from Thursday, October 8 for more on Fermat numbers.) One must be careful when using such models to predict properties of prime numbers and numbers, as these models miss arithmetic (for example, if we are too crude we'll predict there are infinitely many triples such that n, n+2 and n+4 are all prime, which is clearly absurd as at least one of these three must be divisible by 3). These models can be improved and some of the arithmetic can be incorporated -- if you want to know more, let me know.
- We proved Chebyshev's theorem, one of the gems of probability. The natural scale to measure fluctuations about the mean is the standard deviation (the square-root of the variance). Chebyshev's theorem gives us bounds on how likely it is to be more than k standard deviations from the mean. The good thing about this result is that it works for any random variable with finite mean and variance; the bad news is that because it works for all such distributions, its results are understandably much weaker than results tailored to a specific distribution (we will see later that its predictions for binomial(n,p) are magnitudes worse than what is true). It is somewhat similar in spirit between the differences in Divide and Conquer and Newton's Method to find roots of functions; Divide and Conquer is relatively slow (taking about 10 iterations to gain another 3 decimal digits accuracy), while Newton's Method doubles the number of decimal digits each iteration! Why is there such a pronounced difference? The reason is that Divide and Conquer only assumes continuity, while Newton's Method also requires differentiability. Thus it is not surprising that we can do better with stronger assumptions.
- We ended by discussing Monte Carlo integration, which has been hailed by some as one of the (if not the) most influential papers in the 20th century. We only touched the briefest part of the theory here. We showed how it can be combined with Chebyshev's inequality to give really good results on numerically evaluating integrals. Specifically, if N is large and we choose N points uniformly, we can simultaneously assert that with extremely high probability (such as at most 1 - N^{-1/2}) the error is extremely small (at most N^{-1/4}). If you want to know more, please see me -- there are a variety of applications from statistics to mathematics to economics to .... Below are links to two papers on the subject to give you a little more info:
  - Metropolis: The Beginning Of The Monte Carlo Method
  - Metropolis and Ulam: The Monte Carlo Method

Thursday, October 8. As there are a lot of advanced, technical comments related to today's lecture, I texed the additional comments so that the formulas will appear nicely. Topics covered include portfolio theory, consequences of independence of random variables, Fubini's theorem, the power of linearity of expectation, the Erdos-Kac theorem (see the comments about it in the additional comments from Tuesday, October 6th's lecture) and a link to the differentiating identies handout.

Tuesday, October 6. Today we saw the power of binary indicator random variables and expected values. We use random variables and probability to model deterministic systems. The reason for this is that frequently it is very hard to compute exactly what happens, but such modeling does a very good job. For more on these methods, see the following handout by Professor Rosen of Brown University (and the references therein).
- Counting the number and distribution of distinct prime factors or prime factors as n varies is a beautiful problem. This is described in great detail in Hardy and Wright's classic `Theory of Numbers'. Many of these elementary functions are briefly described here; Mathworld has a good article on distinct prime divisors. A beautiful result is that the number of distinct prime divisors is, in some sense, normally distributed under an appropriate limit. This is the Erdos-Kac theorem (see also the Wikipedia entry). A key ingredient is that Sum_{p < x} 1/p is about log log x. While this follows from the Prime Number Theorem (which says Sum_{p < x} log p is about log x) and partial summation (the discrete version of integration by parts), as discussed in class it also follows from a careful analysis of the sum and product expressions (whose equivalence is basically equivalent to the property of unique factorization or the Fundamental Theorem of Arithmetic) for the Riemann zeta function. (Note: if you want to know why it is natural to count primes with a logarithmic weight, let me know and I can give you a handout from my book.)
- It is believed that there are only finitely many Fermat primes. The Fermat numbers F_n = 2^(2^n) + 1 have many interesting properties. One is that no two Fermat numbers share a common factor, which as a nice exercise gives another proof of the infinitude of primes! Fermat primes also arise in determining which regular n-gons can be constructed with a straightedge and a compass.
- We also used indicator random variables and expectation to model probability problems, such as how many kings we expect to get in 7 cards from a well-shuffled deck. It is incredible how powerful these ideas are -- there are versions of probability theory which have expectation as the fundamental concept (there is a comment along these lines in our book).
- It is worth emphasizing that, when modeling answers with indicator random variables, we do not need the variables to be independent if we are only concerned with calculating the expected value; if we want some idea of the scale of fluctuations, then it's very different.
- We also discussed the similarities between how Taylor coefficients uniquely determine a nice function and how moments uniquely determine a nice probability distribution. It is sadly not the case that a sequence of moments uniquely determines a probability distribution; fortunately in many applications some additional conditions will hold for our function which will ensure uniqueness. For the non-uniqueness of Taylor series, the standard example to use is f(x) = exp(-1/x^2) if x is not zero and 0 otherwise. To compute the derivatives at 0 we use the definition of the derivative and L'Hopital's rule. We find all the derivatives are zero at zero; however, our function is only zero at zero. We will see analogues of this example when we study the proof of the Central Limit Theorem.
- Finally, we mentioned the importance that the integrals and sums in the moments converge absolutely; if they didn't, then our answers would depend on how we tend to infinity. For example, consider the Cauchy distribution 1 / (pi(1+x^2)). Let g be any function such that g(A) is larger than A. Assume A is large so the integrand is basically 1/pi x. If we integrate from -A to g(A) we get essentially Integral_{t=A to g(A)} dx / pi x = (1/pi) log( g(A) / A). If g(A) = 2A then we would get essentially log(2) / pi, but if g(A) = A^2 then we find there is no way to have some finite interpretation.

Thursday, October 1. Today we discussed joint distributions as well as common densities, meromorphic continuation, proof techniques, ....
- The binomial distribution is a special case of the more general multinomial distribution; many of the properties of the multinomial can be obtained by repeated applications of the binomial distribution. For example, say we have the unimaginatively named candidates A, B, C and D running for office. We may initially break them into two groups: A and not A; we then further divide not A into B and not B, then not B is divided into C and not C. The binomial coefficients are replaced with multinomial coefficients: here (n | k1, k2, ..., kj) means n! / k1! k2! *...* kj!, with each ki a non-negative integer such that k1+...+kj = n.
- One application (but by no means the most important!) of multinomials is figuring out how many different words you can make when you rearrange the letters of MISSISSIPPI. If you feel this isn't important, consider instead base pairs from biology -- this tells us how many different strands we can have!
- We proved that the multinomial probabiities do give us a density -- they are clearly non-negative, but do they sum to 1? The proof is quite nice, and it uses one of my favorite techniques, multiplying by 1, MANY times. It is important to get a sense of how these results are proved. The trick is to look for binomial or multinomial coefficients -- this is why we multiplied by (n-t)!/(n-t)!. We then had Sum_{e = 0 to n-t| (n-t choose e); we rewrote this by multiplying by 1^e 1^{n-t-e} and then recognized this as (x+y)^m where x=y=1 and m=n-t. Thus we could evaluate the e sum by using the binomial theorem, and then another application of the binomial theorem completed the job. Remember how important it was to have the sums correct -- t was independent of e and the t! could be brought out of the sum; however, h was not as h = n-t-e. There are many symbolic programs available to prove binomial identies; if you would like a copy of a Mathematica program that does this, just let me know (click here for some of the theory).
- We then discussed the geometric series formula. The standard proof is nice; however, for our course the `basketball' proof is very important, as it illustrates a key concept in probability. Specifically, if we have a memoryless game, then frequently after some number of moves it is as if the game began again. This is how we were able to quickly calculate the probability that the first shooter wins, as after both miss it is as if the game just started.
- The geometric series formula only makes sense when |r| < 1, in which case 1 + r + r^2 + ... = 1/(1-r); however, the right hand side makes sense for all r other than 1. We say the function 1/(1-r) is a (meromorphic) continuation of 1+r+r^2+.... This means that they are equal when both are defined; however, 1/(1-r) makes sense for additional values of r. Interpreting 1+2+4+8+.... as -1 or 1+2+3+4+5+... a -1/12 actually DOES make sense, and arises in modern physics and number theory (the latter is zeta(1), where zeta(s) is the Riemann zeta function)!
- We have only discussed a few of the myriad distributions that arise in modeling the world: Bernoulli, Binomial, Poisson, Exponential, Uniform. There are many others, such as the Normal, the Cauchy, as well as one of my favorites, the Weibull. The more distributions you know, the more you can model the world. If time and interest permit, we'll talk about how I used the three parameter Weibull to model baseball games.
- We ended the day by introducing the concept of expectation or expected value of a random variable (also called the mean or the average value). This is one of the central concepts in the course, and it is amazing how many problems reduce to understanding expectations of random variables. We will see in Tuesday's class how properties of expectation aid us greatly in applications. For example, consider a Binomial(n,p) random variable X (so X is the number of heads in n tosses of a coin which is heads with probability p). The sum we MUST evaluate for the average is Sum_{k = 0 to n} k (n choose k) p^k (1-p)^{n-k}. While it should be clear that this must be just np (each coin has a p% chance of landing on heads, and we have p of them), this must be proved. We'll discuss two different techniques to do this on Tuesday (differentiating identities and linearity).

Tuesday, September 29. We discussed the definition of cumulative distribution functions (CDFs) and the associated densities, called the probability mass function in the discrete case and the probability density function in the continuous case. We showed that if we know the CDF then we know the mass/density function, and vice-versa. The big theorem is that in the continuous case, the mass function is the derivative of the CDF. Our proof used either Taylor series expansions or the Mean Value Theorem; it is possible to prove the claim with significantly less at the cost of more analysis. We see in the proof that we really want our probability density function to be either continuous, piecewise continuous or bounded. We showed how to use the Fundamental Theorem of Calculus to quickly calculate the density of Y = phi(X) given X has CDF F_X with density f_X. Namely, letting X = h(Y) we have the density is f_Y(y) = f_X(y) h'(y).
- e^x e^y = e^{x+y} is one of the most beautiful and important formulas in math; it is NOT trivial to prove, and requires some real combinatorics. Again, it would be horrible notation if this were false. The purpose of the second clicker question is to illustrate the dangers of generalizing from numbers to matrices -- the lack of commutativity leads to very different behavior. For matrices, e^A e^B in general is NOT e^{A+B} unless A and B commute. We define the commutator by [A,B] = AB - BA; this measures how far A and B are from commuting (note some places write the commutator differently, so my apologies if other people don't use this notation!). The Baker-Campbell-Hausdorff formula describes what e^A e^B; see also the Zassenhaus formula for a nice explicit formula.
- We compared sizes of functions. We write f(x) << g(x) to mean there is a constant C such that, for all x sufficiently large, |f(x)| <= C g(x). We showed x^r << e^x for any r > 0, and log(x) << x^r as well (using the previous results with now x = e^(y/r)). To get x log(x) --> 0 as x --> 0 we wrote x as 1/n and then used the previous results. This example illustrates lazy mathematicians at our best, reducing to previous problems.
- We gave a poor mathematician's analysis of the size of n!; the best result is Stirling's formula which gives n! is about n^n e^{-n} sqrt(2 pi n) (1 + error of size 1/12n + ...). We obtained our upper and lower bounds by using the comparison method in calculus (basically the integral test); we could get a better result by using a better summation formula, say Simpson's method or Euler-Maclaurin. We will return to Simpson's method later in the course, as one proof of it involves techniques that lead to the creation of low(er) risk portfolios! Ah, so much that we can do once we learn expectation..... Of course, our analysis above is not for n! but rather log(n!) = log 1 + ... + log n; summifying a problem is a very important technique, and one of the reasons the logarithm shows up so frequently. If you are interested, let me know as this is related to research of mine on Benford's law of digit bias.
- Finally, we mentioned the QWERTY keyboard (see also this article on other common items around us and how they came to be). There are many applications to knowing letter frequencies, especially the probability that given one letter that the next letter takes on each value. These frequencies are used to break simple cryptographic cyphers that involve permutting the 26 letters. See for instance the wikipedia article on frequency analysis, as well as a downloadable program to perform the analysis.

Thursday, September 24. Today we finished the definitions from Chapter 2, in particular random vectors, joint distributions and marginal densities. Later in the semester we will spend a lot of time looking at the joint distribution of random variables. The key result is that, if the random variables are independent, the joint density is the product of the individual densities; obviously this is not necessarily the case if the random variables are not independent (we will of course define what it means for random variables to be independent).
- We saw how difficult it can be to code, let alone efficiently code, a problem which is simply stated. There is a lot of trouble if the number of variables is also varying -- it is easy to work with this theoretically, but harder to implement. (If you've taken linguistic classes, this might be similar in spirit to quantifiers on quantifiers). I'll write up code that handles this case with lots of comments -- this won't be the only way to attack the problem, but it will be one. The important fact to note is that we are often only able to observe small values, and thus there is a danger that we may extrapolate incorrectly. Click here for the mathematica code.
- We will go over distribution functions and finding the distribution function and density of random variables that are functions of other random variables in greater detail on Tuesday. The idea is that if G is a nice function and we know the (cumulative) distribution function of X, then we should know the (cumulative) distribution function of Y = G(X); similarly, if we know the probability density of X then we should know the probability density of Y = G(X). We will do all this again slowly for our exponential example and in general. The key input in the analysis is the Fundamental Theorem of Calculus; for us, the version we need is: Let F(x) = Int_{t = -oo to x} f(t) dt; then F'(x) = f(x). While we have talked about how the anti-derivative is not unique, there is a `natural' choice of a continuous density f.
- The card `trick' we did today is explained in great detail in the optional book for the course, Impossible. If you don't have that book but want to see the details, let me know and I'll provide it. There is a lot of good math in this problem, plus of course it's a fun trick! For more on the Amazing James Randi, click here.
- We also discussed Buffon's needle. We'll analyze this problem in greater detail later; if one wants to see a truly elegant proof, let me know and I'll provide a copy of the proof from THE Book (if you haven't heard of THE Book, click this link!). We didn't solve it today, but instead used it as a way to discuss joint random variables. Our partial solution is a nice application of dimensional analysis, which allows us to see how the solution must depend on the parameters without actually solving it! This is a hard but worthwhile skill to cultivate.

Tuesday, September 22. In today's lecture we continued learning the language (random variables, continuous and discrete, probability mass functions and densities). The key fact is that random variables must be real valued. This is so that we can add them or take averages et cetera. Thus we never have X_i(omega) be

binary indicator variable

For our probability spaces (Ω, F, P), we typically take the σ-field F to be 2^Ω if Ω is either finite or countable; recall that 2^Ω means the set of all subsets of Ω. This is not the only σ-field we may look at, but it is the most useful for these problems. For example the following is always a σ-field: {Ø, Ω}. Another possibility is to take, for any set A, F to be {Ø, A, A^c, Ω}. The point is we want our σ-field to be as large as possible (i.e., we want to define the probability of as many subsets of Ω as we can). If Ω is infinite, say [0,1] or the real line (-∞, ∞), we take the σ-field to be what is generated by open intervals (a,b). In other words, we start with all open intervals and see what sets we can form by going through the definitions of a σ-field. For example, countable intersections belonging means [a, b] is in the σ-field because it equals the intersection of (a - 1/n, b + 1/n). Click here to get a sense of what kind of sets we can form by these processes. For our purposes, we will only be assigning probabilities to finite sets, countable sets, or intervals, squares and similar figures; however, it is good to be aware of the advanced analysis.
The cumulative distribution function is one of the key tools of the subject, and gives a sense of why continuous random variables are easier to analyze than discrete; namely, for continuous we have the Fundamental Theorem of Calculus at our disposal to pass from a cumulative distribution function to a density; we do not have differentiation available in the discrete case. Note that a cumulative distribution function does not determine a unique density; however, it almost does so, as any two densities must integrate to the same value on any interval. (The technical jargon is to say that the density is determined up to a function which is zero almost everywhere.) If there is interest, let me know and I'll talk a bit about the basics of measure theory (and show that almost no numbers are rational in the sense of measure).
Gambler's ruin: We solved the problem using difference equations. If there is a repeated root, however, our method breaks down and we need to be divinely inspired again. You are not responsible for knowing how to solve these problems, but if you are interested here are some facts. For the general relation, say a_{n+1} = 3 a_n + 10 a_{n-1}, we guess a_n = r^n. We find that this is a solution if r^{n+1} - 3 r^n - 10 r^{n-1} = 0 or r^2 - 3r - 10 = 0, which holds if (r-5)(r+2) = 0, ie, r = 5 or -2. Simple algebra shows that c_1 r_1^n + c_2 r_2^n is a solution for any c_1, c_2. If we specify two boundary conditions that determines the c_i's, and we're done. If the two roots happen to be equal, we need to be a bit more clever (or divinely inspired); see the final page of my handout from Math 209. I prefer the solution we discussed in class, using symmetry to solve it when we start at $k (0 < k < N) with N = 2^m for some integer m. As a good challenge problem, see if you can come up with an elementary proof when N is not a power of 2. I can do this for some (but as of right now not all) N.. See here for an elementary proof of the prime number theorem.
Finally, we mentioned the Riemann zeta function briefly: ζ(s) = sum_{n = 1 to∞} 1/n^s= (1 - 1/p^s)². This is intimately tied to the distribution of the primes (which isn't surprising as it related something we want to know about (the primes) to something very well understood (the integers). Key in the analysis is the distribution of zeros of ζ(s); the famous Riemann Hypothesis (about to turn 150 (there will be festivities on campus, and one of the most casual asides you'll ever see!) asserts all the non-trivial zeros have real part 1/2. The Riemann zeta function arose earlier in the probability a generically chosen odd number is square-free and is 1/ζ(2) = 6/π². (See also the wikipedia entry and the references at the end for a proof of the value of this sum / product.) This is the answer to our problem as we may interpret it as the probability that our number isn't divisible by 4, by 9, by 25.... The formula I mentioned is the Riemann-Siegel formula.

Thursday, September 17.
- We started with computing the number of poker hands with at least two aces. The danger in problems like this is double counting. Note that ncr[4, 2] ncr[50, 3] / ncr[52, 5] is very close to the correct answer of (ncr[4, 2] ncr[48, 3] + ncr[4, 3] ncr[48, 2] + ncr[4, 4] ncr[48, 1]) / ncr[52, 5] (.0452 vs .0417); here ncr[x,y] is x! / y! (x-y)! (n choose r).. The double counting is a lower order term, but it is enough to lead to a noticeable difference. It's natural to think the answer is ncr[4,2] ncr[50,3], as this means choose two of the four aces, and then choose any three of the remaining 50 cards. The problem is that if we choose an ace in the last three cards, we have double counted it. Thus the correct answer should be, and is, slightly lower. The tops sum to 52 and the bottoms sum to 5; this is a good, quick rule to help make sure you are looking at the problem the right way.
- The study of Independence is one of the central themes in probability. While many real world or mathematical processes are not independent, frequently one can build a good model by assuming independence. Later in the semester we'll see how we can use this to model iterates of the 3x+1 map or to predict the answers to many problems in number theory (such as the number of distinct prime factors certain special numbers have). Other examples include the probability a number is square-free. For independence it is essential that all combinations be independent; as we saw in class, pairwise independence does not imply independence. We did a very good job as a class in terms of choosing numbers randomly from 1 to 9; the second part, where we were shooting for half the class average, is different. This belongs to social science and game theory. I would say the random variables are still independent; however, your answer is governed by a different rule depending on whom is in the class.
- The answer to Nick's question, as correctly pointed out by a classmate, is that the definition of independence states that events {A_i}_{i in I} are independent if Prob( intersection_{j in J} A_j) = prod_{j in J} Prob(A_j) for any J a subset of I. For example, if I = {1,2,3} then J could be {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, or {1,2,3}. We can rephrase the question to: assume we have events such that Prob(A intersect B intersect C) = Prob(A) Prob(B) Prob(C), and all these events have positive probability. Must A and B be independent?
- We calculated the solution to the roulette problem by using difference equations. The largest root of the characteristic polynomial for 5 consecutive blacks (with red and black equally likely) is about .982974. For more on solving difference equations, see pages 2 and 16 my lecture notes from Math 209 (Differential Equations), as well as the Wikipedia entry. While solving a problem such as this is hard in general (we have to compute the roots of the characteristic polynomial), it is possible to get some sense of the properties of the solution. The trick we discussed of marching down in blocks is similar to the Murphy's law problem in the homework.

Tuesday, September 15. We started by reviewing some of the definitions (σ-field (many books use the word algebra instead of field), probability measure and probability space). The point is that not every subset is an admissible event (in other words, not all subsets are assigned a probability). For the most part this is no problem, as points, intervals, squares et cetera provide a rich theory. The general case requires advanced analysis, in particular measure theory / Lebesgue integration. These technicalities are important in avoiding the Banach-Tarski paradox, which is due to the Axiom of Choice (which allows us to construct non-measurable sets); it is for this reason that I only believe in the Countable Axiom of Choice. For the specific points of today's class, here are some additional comments / readings.
- Limit exchange: one of the hardest parts of mathematics is justifying interchanging two operations; today we looked at when the probability of a limit is the limit of the probabilities. To give some sense that we must sometimes be careful, we considered non-negative functions f_n(x) converging to zero pointwise but always integrating to 1 (let f_n(x) be the triangle function from 1/n to 3/n, taking on the value n at 2/n). It is not always permissible to interchange a limit and an integral (see the Dominated Convergence Theorem or the Monotone Convergence Theorem from analysis for some situations where this may be done); similarly it is not always possible to interchange orders of integration (see Fubini's Theorem for when this may be done), and we can only sometimes interchange a derivative and a multidimensional integral (see here for some conditions on when we may). The main take-away is that we must be careful interchanging probabilities and limits, but this shouldn't be surprising. For example, we do not expect to be able to interchange most operations: sqrt(a+b) in general is not sqrt(a) + sqrt(b).
- We talked a bit about what it means to choose an element uniformly from random on a circular or square dart board. We cannot deal with uncountable unions (see the wikipedia entries on countable and uncountable sets). If you want to learn even more about countable and uncountable, see Chapter 5 of my book (An Invitation to Modern Number Theory). For the purposes of our class, we really only need to worry about finite and countable. We have good intuition on what a finite set is; the quick definition of countable is that it can be placed in a one-to-one correspondence with the positive integers. In other words, we have a first element, a second element, and so on. It turns out that almost every real number is irrational; further, almost no numbers are algebraic (solving a finite polynomial with integer coefficients). The standard proof is Cantor's diagonalization argument (this and many other items are included in Chapter 5 of my book).
- We discussed the inclusion / exclusion principle, one of my favorite methods in general and especially important in probability as it is very easy to accidentally double count events. We used this to show that the probability a number is square-free converges to 6/π²; more generally, the probability that it is k-power free for k at least 2 is 1/zeta(k), where zeta(s) = Sum_{n = 1 to oo} 1 / n^s = Product_{p prime} (1 - 1/p^s)^{-1} (if Re(s) > 1) is the Riemann zeta function. If you complete the inclusion-exclusion calculation we did, you find that it can be written as the product above (with s=2 and the product truncated); talk to me if you want more details. Sadly these arguments cannot be used to prove results about how many primes there are (it comes down to dealing with the error terms in dropping the floor function, though this has not stopped lots of amateurs from using this to `prove' some of the big open problems in number theory). One of the more interesting uses of this principle is in Brun's sieve, where he uses inclusion-exclusion to show that there cannot be too many twin primes. Perhaps the strangest application of this is that this is how the famous Pentium Bug was discovered! The homework problem asks you to find the probability that when we reorder n people, at least one is correct. The textbook also handles the more general case, namely when we reorder and have at least r correct.
- We also talked about conditional probability and the surprising problem about how likely it is for you to have a rare disease if you test positive. If you have taken statistics before, this is similar to Type I and Type II errors. Depending on what you are concerned with will affect what you want to improve.
- The last part of the class dealt with combinatorics. Our solution to the cookie problem is quite elegant, and in some respects reminiscent of geometry class (remember all those proofs where the teacher cleverly adds auxiliary lines; the difference here is we just add more cookies). While it is possible to solve many combinatorial problems by brute force in principle, in practice this is not a good way to go -- it is time consuming, and quite likely that one makes a mistake. Typically one finds a way to interpret a given quantity two ways; we can compute one of them and thus we obtain a formula for the other. For example, we showed the number of ways of dividing C cookies among P people is (C + P - 1 choose P-1); here all the identical cookies are divided. What if we don't assume all the cookies are divided -- what is the answer now? It is just Sum_{c = 0 to C} (c + P - 1 choose P - 1); this is because we are just going through all the cases (we give out no cookies, 1 cookie, ...). What does this sum equal? Imagine now we have another person, say the Cookie Monster (this is one of Cameron's favorite clips), who gets all the remaining cookies. Then dividing at most C cookies among P people is the same as dividing exactly C cookies among P+1 people, and hence our sum equals (C + P+1 - 1 choose P+1 - 1).
- Finally, we ended with the lottery problem. If we cannot use any of the 50 numbers more than once, there are (50 choose 6) = 15,890,700 ways. What if we can use the same number multiple times -- how many combinations are there now? Writing the answer cleanly would give it away, so I'll just say that if we have to choose 6 numbers from {1,...,50} and we can use each number up to 6 times, and if order doesn't matter, then the number of combinations is 28,989,675, which is less than a factor of two more! For comparison, note that (300 choose 6) is the significantly larger 962,822,846,700, which is over 60,000 times larger than (50 choose 6)! If you want to see the solution, let me know.
Thursday, September 10. First off, click here for additional comments about the objectives for the course, including some entertaining and educational videos about the times we live in and the importance of asking the right questions. We mentioned just some of the many places where probability is applicable.
- Click here if you want to know more about the log5 method, namely which of the (p +/- pq) / (p + q +/- 2pq) models the probability that team A beats team B. The `derivation' is a nice exercise in elementary probability theory, if you buy the modeling assumptions. As you'll see throughout the course and beyond, one of the most difficult issues in the real world is deciding what are the important and irrelevant factors.
- A terrific example of this is the Clinton - Obama tie in Syracuse; click here for the article from the `By the numbers' guy from the Wall Street Journal (a great site to read). There are several ways to do the calculation. The way that was reported in the press assumed that the statewide percentages (57% Clinton, 40% Obama) should also successfully model the distribution in Syracuse. How significantly different are these from 50-50? Note that 50-50 led to a probability of 1/137 while the other is one in a million. The `By the numbers' article also mentions another way to try and solve this: Say there are 12002 voters, then there are 12003 possibilities, with each candidate ranging from 0 to 12002 votes, and thus the probability of a tie is 1/12002. The flaw in this argument is that not all outcomes are equally likely. For example, if we roll a pair of fair die there is only one way to roll a 2, but six ways to roll a 7. The number of ways with 2n people for n to choose Clinton is (2n choose n); the number of ways for them all to choose Clinton is (2n choose 2n) = 1.
- The 3x+1 problem is one of my favorites in mathematics (Jeff Lagarias has excellent annotated bibliographies: see here and here). It has a lot of the features you'd like a problem to have: you can state it easily, a high school or junior high school student can understand it, yet to make progress requires real mathematical sophistication and machinery. If anyone is interested in research on the 3x+1 problem, I have a work in progress that is very accessible and should be doable with what you know.
- Benford's law of digit bias is one of my favorite research topics (if anyone is interested, I might also have accessible projects here). If time and interest permit, I'll show you how you can prove this digit bias in a variety of interesting systems. I was interviewed by the Wall Street Journal about applying Benford's law to detect fraud in the Iranian elections (click here for articles on the Iranian elections).
- If you want to see details about the paper for the movie industry, click here, while for my sabermetrics paper (which we may discuss in the class), click here.
- We discussed the Birthday Problem (Wikipedia gives the Taylor expansion argument from taking logarithms) and its generalization to Pluto. This is but one of many possible generalizations. What if we ask for how many people we need to have at least a 50% chance that at least three will share a birthday? Or that there will be at least two pairs of people sharing birthdays? Questions like these are great extra credit / challenge problems: if you're interested, just let me know.
- The double-plus-one strategy is but one of many overlaps between probability and gambling. Other famous ones (recently) include card counting in blackjack. There are many references; see Thorpe's original article as well as his book. Another fun read is Bringing Down The House.
- Combinatorics: we discussed (n choose r), Most of the combinatorics we'll do involves this and n!.One nice application from today is proving the Binomial Theorem (I must admit to remembering its mention in a Holmes story)
- Again, click here for additional comments about the objectives for the course, including some entertaining and educational videos about the times we live in and the importance of asking the right questions.