General takeaways (all classes)
MATH 341: Additional comments related to material from the class. If anyone wants to convert this to a blog, let me know. These additional remarks are for your enjoyment, and will not be on homeworks or exams. These are just meant to suggest additional topics worth considering, and I am happy to discuss any of these further.
Here are the slides from today's talk: Theory and applications of Benford's law to fraud detection, or: Why the IRS should care about number theory! (video of a version of the talk that I gave at Brown is available here)
Here is a Mathematica program for sums of standardized Poisson random variables. The manipulate feature is very nice, and allows you to see how answers depend on parameters.
We proved the CLT in the special case of sums of independent Poisson random variables (click here for a handout with the details of this calculation, or see our textbook). The proof technique there used many ingredients in typical analysis proofs. Specifically, we Taylor expand, use common functions, and somehow argue that the higher order terms do not matter in the limit with respect to the main term (though they crucially affect the rate of convergence). We also got to take the logarithm of a product.
Here is a nice video on the Fibonacci numbers in nature: http://www.youtube.com/watch?v=J7VOA8NxhWY
There are many ways to prove Binet's formula for an explicit, closed form expression for the n-th Fibonacci number. One is through divine inspiration, the second through generating functions and partial fractions. Generating functions occur in a variety of problems; there are many applications near and dear to me in number theory (such as attacking the Goldbach or Twin Prime Problem via the Circle Method). The great utility of Binet's formula is we can jump to any Fibonacci number without having to compute all the intermediate ones. Even though it might be hard to work with such large numbers, we can jump to the trillionth (and if we take logarithms then we can specify it quite well).
We will do a lot more with generating functions. It's amazing how well they allow us to pass from local information (the \(a_n\)'s) to global information (the \(G_a\)'s) and then back to local information (the \(a_n\)'s)! The trick, of course, is to be able to work with \(G_a\) and extract information about the \(a_n\)'s. Fortunately, there are lots of techniques for this. In fact, we can see why this is so useful. When we create a function from our sequence, all of a sudden the power and methods of calculus and real analysis are available. This is similar to the gain in extrapolating the factorial function to the Gamma function. Later we'll see the benefit of going one step further, into the complex plane!
Today we saw more properties of generating functions. The miracle continues -- they provide a powerful way to handle the algebra. For example, we could prove the sum of two independent Poisson random variables is Poisson by looking at the generating function and using our uniqueness result; we sadly don't have something similar in the continuous case (complex analysis is needed). We saw how to get a closed form expression for Fibonacci numbers, and next class will do \(\sum_{m=0}^n \left({n \atop m}\right) = \left({2n \atop n}\right)\). We compared probability generating functions and moment generating functions, and talking about where the algebra is easier.
The main item to discuss is that if \(X\) is a random variable taking on non-negative integer values then if we let \(a_n = {\rm Prob}(X = n)\) we can interpret \(G_a(s) = E[s^X]\). This is a great definition, and allowed us to easily reprove many of our results.
The idea of noticing a given expression can be rewritten in an equivalent way for some values of the parameter, but that expression means something else for other values, is related to the important concept of analytic or meromorphic continuation, one of the big results / techniques in complex analysis. The geometric series formula only makes sense when |r| < 1, in which case 1 + r + r^2 + ... = 1/(1-r); however, the right hand side makes sense for all r other than 1. We say the function 1/(1-r) is a (meromorphic) continuation of 1+r+r^2+.... This means that they are equal when both are defined; however, 1/(1-r) makes sense for additional values of r. Interpreting 1+2+4+8+.... as -1 or 1+2+3+4+5+... a -1/12 actually DOES make sense, and arises in modern physics and number theory (the latter is zeta(1), where zeta(s) is the Riemann zeta function)!
For analytic continuation we need some ingredient to let us get another expression. It's thus worth asking what the source of the analytic continuation is. For the geometric series, it's the geometric series formula. For the Gamma function, it's integration by parts; this led us to the formula Gamma(s+1) = s Gamma(s). For the Riemann zeta function, it's the Poisson summation formula, which relates sums of a nice function at integer arguments to sums of its Fourier transform at integer arguments. There are many proofs of this result. In my book on number theory, I prove it by considering the periodic function \(F(x) = \sum_{n = -\infty}^\infty f(x+n)\). This function is clearly periodic with period 1 (if f decays nicely). Assuming f, f' and f'' have reasonable decay, the result now follows from facts about pointwise convergence of Fourier series.
Briefly, the reason generating are so useful is that they build up a nice function from data we can control, and we can extract the information we need without too much trouble. There are lots of different formulations, but the most important is that they are well-behaved with respect to convolution (the generating function of a convolution is the product of the generating functions).
The Weak Law of Large Numbers is a nice application of Chebyshev's inequality. It says the sample mean converges to the random variable's mean. More explicitly, the probability of being a fixed amount epsilon from the mean tends to zero at a rate \(\sigma^2 / (\epsilon^2 n)\). This gives us some freedom to let \(\epsilon\) depend on \(n\); if \(\epsilon = n^{1/4}\) then we get a very tight interval for the sample mean with probability \(1\) minus something of the order \(1/\sqrt{n}\). We had a nice instance of multiplying by \(1\) here; instead of looking at it as \(|Y - \mu| \ge \epsilon\) we write it as \(\ge \epsilon/\sigma_Y \cdot \sigma_Y\).
A big part of the Weak Law is how we have convergence. There are four main types:
It's often worth reading about the famous mathematicians and their theorems: Markov Chebyshev Weak Law
Poisson Random Variables
The Poisson random variable often models the number of events in a window of time. Also, frequently normalized spacings between events converge to Poissonian (a great example is to look at the primes). Another is the spacings between the ordered fractional parts of \(n^k \alpha\) (click here for more).
General advice: to differentiate an identity, you need an identity. Seems silly to state but it's essential. Often the hardest part of these problems is figuring out how to do the algebra in a clean way. For us, we saw that frequently we want to move the normalization constant over to the other side; it allows us to avoid a product or quotient rule. We also saw sometimes it's easier to computer \(E[X(X-1)]\) than \(E[X^2]\), and then do algebra. It all comes down to whether or not it's easier to apply \(d/dx\) or \(x d/dx\). For the Poisson distribution, it helps to move the exponential to the other side and write \(e^\lambda = \sum_{n=0}^\infty \lambda^n/n!\).
We proved the sum of independent Poisson random variables is a Poisson random variable, and the parameter is the only thing it can be (as expectation is linear, it must be the sum of the parameters). We were able to see this by using a convolution to get the probability, and then doing some algebraic gymnastics to see that our expression was the probability of a Poisson random variable. IF you have an idea of what an answer is, that can often be helpful and suggest a method of proof. We'll see proofs of this result again when we reach generating functions; in addition to our textbook you can also find this result online; see for example http://www.stat.wisc.edu/courses/st311-rich/convol.pdf.
We ended with how the CLT gives Stirling's formula: If \(X_i \sim {\rm Poiss}(1)\) and these random variables are independent, then \(Y_n = X_1 + \cdots + X_n \sim {\rm Poiss}(n)\), which by the CLT converges to being \(N(n,n)\). Thus \({\rm Prob}(Y_n = m) = n^m e^{-n}/m! \approx \int_{n-1/2}^{n+1/2} (2\pi n)^{-1/2} \exp(-(x-n)^2/2n)dx \approx \exp(-(m-n)^2/2n) / \sqrt{2\pi n}\). Taking \(m=n\) and cross multiplying gives Stirling's formula. Note the issue of continuous versus discrete; we solve this by associating the area under the continuous distribution from \(m-1/2\) to \(m+1/2\) to \(m\) on the discrete.
We proved Chebyshev's theorem, one of the gems of probability. The natural scale to measure fluctuations about the mean is the standard deviation (the square-root of the variance). Chebyshev's theorem gives us bounds on how likely it is to be more than k standard deviations from the mean. The good thing about this result is that it works for any random variable with finite mean and variance; the bad news is that because it works for all such distributions, its results are understandably much weaker than results tailored to a specific distribution (we will see later that its predictions for binomial(n,p) are magnitudes worse than what is true).
Stirling's Formula
We gave a poor mathematician's analysis of the size of n!; the best result is Stirling's formula which gives \(n!\) is about \(n^n e^{-n} \sqrt{2 \pi n} (1 +\) error of size \(1/12n + \cdots)\). The standard way to get upper and lower bounds is by using the comparison method in calculus (basically the integral test); we could get a better result by using a better summation formula, say Simpson's method or Euler-Maclaurin; we'll do all this on Wednesday. We might return to Simpson's method later in the course, as one proof of it involves techniques that lead to the creation of low(er) risk portfolios! Ah, so much that we can do once we learn expectation..... Of course, our analysis above is not for \(n!\) but rather \(\log(n!) = \log 1 + \cdots + \log n\); summifying a problem is a very important technique, and one of the reasons the logarithm shows up so frequently. If you are interested, let me know as this is related to research of mine on Benford's law of digit bias.
It wasn't too hard to get a good upper bound; the lower bound required work. We first just had \(n < n!\), which is quite poor. We then improved that to \(2^{n-1} < n!\), or more generally eventually \(c^n < n!\) for any fixed \(c\). This starts to give a sense of how rapidly \(n!\) grows. We then had a major advance when we split the numbers \(1, \dots, n\) into two halves, and got \(2^{n/2-1} (n/2)^{n/2 - 1}\), which gives a lower bound of essentially \(n^{n/2} = (\sqrt{n})^n\). While we want \(n/e\), \(\sqrt{n}\) isn't horrible, and with more work this can be improved.
Instead of approximately all numbers in \(n/2, \dots, n\) with \(n\) we saw we could do much better by using the `Farmer Brown' problem, and noting that if we pair things so that the sums are constant, the largest product comes from the middle, and thus each pair is dominated by \(((3n/4)^2)^{n/4}\). By splitting into four intervals we got an upper bound of approximately \(n^n 2.499^{-n}\), pretty close to \(n^n e^{-n}\).
There are other approaches to proving Stirling; the fact that \(\Gamma(n+1) = n!\) allows us to use techniques from real analysis / complex analysis to get Stirling by analyzing the integral. This is the Method of Stationary Phase (or the Method of Steepest Descent), very powerful and popular in mathematical physics. See Mathworld for this approach, or page 29 of my handout here.
Chi-square distribution: We did a lot of calculations; all the details are in the book. The goal is not to be able to crank these out line by line, but to understand the logic behind how I chose to attack them. There are tricks to make our lives easier, ways to arrange the algebra. That's the goal: seeing this. We saw this again today.
Our proof of Markov's inequality started by looking at special cases. Not surprisingly, the formula is useless if \(a \le E[X]\). It's always good to play with a statement to get a feel of what it gives.
A huge part of this class is trying to give you a sense of how to prove results. We organically flowed to the proof today. It seemed reasonable to write down the desired probability, and since Markov's inequality involves \(E[X]\), it makes sense to write that down. We then had to be a little clever in the algebra to manipulate it to the desired expression. I want you to walk away from this class with some comfort in proving results, and in figuring out what one should try to prove and investigate. This is why I felt it was so valuable to slowly work up to Markov's inequality.
There is a combinatorial interpretation of the double factorial, \((2m-1)!! = (2m-1) (2m-3) \cdots 3 \cdot 1\), is the number of ways to split \(2m\) people into \(m\) pairs of 2. It's very important not to add extra order; while we can assault the problem by adding labels, we must remove them at the end. Remember, while we can define whatever we want in mathematics, it's important to define useful expressions. In particular, the factorial of the factorial doesn't occur that often, but as we saw combinatorially multiplications of every other integer do.
The highpoint of the day is using a convolution to calculate the sum of two independent chi-square random variables with parameters \(\nu_1, \nu_2\) and seeing it was chi-square with parameter \(\nu_1+\nu_2\). We again did integration without integrating -- had \(Y = X_1 + X_2\) led to \(f_Y(y) = \int_0^\infty f_{X_1}(t) f_{X_2}(y-t) dt\), we then pulled out a lot of constants and noticed that if we changed variables to \(t = uy\) we got \(f_Y(y) = c_1 c_2 e^{-y/2} y^{(\nu_1+\nu_2)/2 - 1} \int_{u=0}^1 g(u) du\) for some function \(g\); what's nice is that we don't need to know exactly what \(g\) is or what its integral is. It's some constant \(c\) and thus \(f_Y(y) = c c_1 c_2 e^{-y/2} y^{(\nu_1+\nu_2)/2 - 1}\); this has the functional form of a chi-square random variable with parameter \(\nu_1 + \nu_2\), and thus \(c c_1 c_2\) must be the corresponding normalization constant!
Chi-square distribution
Yesterday was `onslaught of calculus' day. We did a lot of calculations; all the details are in the book. The goal is not to be able to crank these out line by line, but to understand the logic behind how I chose to attack them. There are tricks to make our lives easier, ways to arrange the algebra. That's the goal: seeing this. We saw this again today.
There's a really nice wikipedia article on the secretary / marriage problem. Somewhat related to this is the German tank problem. What I like most about this problem is that it's related to a lot of concepts we've studied (conditional probabilities, breaking a complicated problem into a lot of simpler ones), as well as the need to understand what a formula says and re-express it in a more meaningful way. We saw the harmonic numbers hiding in our expression, which we can approximate using the integral test. In a better approximation one meets the Euler-Mascheroni constant.
It is absolutely shocking that we can do so well in the marriage / secretary problem (ok, mathematically, not necessarily in practice). While assuming we know the number of applicants might sound overly restrictive, in some situations it's actually not so unreasonable. For example, the Math/Stats department is hiring a mathematician this year. Based on previous hiring searches in the past few years, I expect there'll be around 700 applications, and almost surely between 600 and 800. It's shocking that our final winning percentage is positive and not decaying to zero with n. As a nice exercise, try to compute the probability we end up with one of the top two candidates. Try to come up with a strategy that will have you `settle' as you get older (ie, start running out of candidates!).
Key input in the analysis was the sum of the harmonic series: http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)
See here for the growth of the partial sums: http://en.wikipedia.org/wiki/Harmonic_number
India bride walks out when groom doesn't know math: http://bigstory.ap.org/article/3267cd38925e46828ddb0b623fad9ead/groom-fails-math-test-indian-bride-walks-out-wedding
The Poisson random variable often models the number of events in a window of time. Also, frequently normalized spacings between events converge to Poissonian (a great example is to look at the primes). Another is the spacings between the ordered fractional parts of \(n^k \alpha\) (click here for more).
General advice: to differentiate an identity, you need an identity. Seems silly to state but it's essential. Often the hardest part of these problems is figuring out how to do the algebra in a clean way. For us, we saw that frequently we want to move the normalization constant over to the other side; it allows us to avoid a product or quotient rule. We also saw sometimes it's easier to computer \(E[X(X-1)]\) than \(E[X^2]\), and then do algebra. It all comes down to whether or not it's easier to apply \(d/dx\) or \(x d/dx\).
The toy prize problem is great, highlighting so many great parts of math. We can get a rough sense of the answer (if there are N prizes the answer should lie b/w N and N!, then improve to N to N^2). We use linearity of expectation to write the random variable we want, X (the total wait time), as a sum of random variables we know (wait times for success for Geometric Random Variables). We end with the harmonic sum.
Finally, some interesting applications of probability: is there really a need to punt with 10 seconds left? (Video is here, go to 1:01.) ESPN is listing Michigan State's chance of winning at .2% (a little surprised it was that high, but of course this leads to a great question: how likely are unlikely events -- can we we stimate their probabilities well?). Or, if you want even more bad sports calls: http://www.bostonglobe.com/sports/2015/10/19/patriots-should-expect-more-what-colts-tried/jCXNZK3cJTdtVNf3DYXVFL/story.html?p1=Article_Recommended_ArticleText
If X and Y are random variables, \(f(x,y)\) (or perhaps \(f_{X,Y}(x,y)\)) is the joint density function. If the random variables are independent, \(f(x,y) = f_X(x) f_Y(y)\), which greatly simplifies the analysis. If we integrate out one of the variables we are left with what's called the marginal. We use this in our proof of linearity of expectation. As you'll find out, I love linearity of expectation (this is a link to notes I've written on the subject, which we'll get to later).
We have to be very careful in interchanging orders of operations. We concentrated on interchanging two integrals, but one can interchange a derivative and an integral (click here for conditions on when this is permissible; this is called differentiating under the integral sign). In general we cannot interchange orders of operations (\(\sqrt{a+b}\) is typically not \(\sqrt{a} + \sqrt{b}\)), but sometimes we're fortunate (click here for a nice article on Wikipedia on when this is permissible).
It is not always possible to interchange orders of integration (see Fubini's Theorem for when this may be done). The main take-away is that we must be careful interchanging.
We introduced convolutions formally, though we had seen them earlier. We saw why the convolution of two densities is the density of the sum of the corresponding random variables. This property is the reason convolutions play such an important role in the theory. We informally remarked that if \(Z = X+Y\) then \(f_Z(z) = \int_{x=-\infty}^\infty f_X(x)f_Y(z-x)dx\), but saw a longer justification of that from \(F_Z(z) = {\rm Prob}(Z \le z) = \int_{x=-\infty}^\infty \int_{y=-\infty}^{z-x} f_X(x) f_Y(y) dy dx\). We then take the derivative with respect to \(z\) of both sides, and note \(\frac{d}{dz}F_Z(z) = f_Z(z)\); this is the advantage of good notation - using capital letters for cdfs reminds us that their derivatives equal the pdfs. We pass the derivative past the \(x\)-integration, and get \(\frac{d}{dz}\int_{y=-\infty}^{z-x} f_Y(y)dy = \frac{d}{dz}\left[F_Y(z-x) F_Y(-\infty)\right] = f_Y(z-y) \cdot 1\) by the Chain Rule. What's nice is we never need to know what \(F_Y\) is explicitly, as we immediately take its derivative!
The Fourier transform of a convolution is the product of the Fourier transforms. This converts a very difficult integral into the product of two Fourier transforms, and frequently these integrals can be evaluated. The difficulty is that, at the end of the day, we must then invert, and to prove the Fourier Inversion Theorem is no trivial task.
The first step in any investigation is to figure out what questions to ask. Here are the two standard ones: (1) does the Taylor series exist (or for what x does it converge and equal the original function), and (2) is the Taylor series unique? The answers were surprising; a Taylor series must converge at the expansion point, but it's possible to only converge there; it's also possible for two different, infinitely differentiable functions to have the same Taylor series!
Analysis is hard. The function f(x) = exp(-1/x2) if x is not zero and 0 otherwise has all of its derivatives vanish at 0, but its Taylor series agrees with the original function only at x=0 (which is nothing to be proud of!). Complex analysis is quite different; there if a function is complex differentiable once then it is infinitely complex differentiable, and it equals its Taylor series in a neighborhood of the point. This fact is one reason why we frequently use characteristic functions instead of generating or moment generating functions (which we'll cover later in the semester). We also discussed the similarities between how Taylor coefficients uniquely determine a nice function and how moments uniquely determine a nice probability distribution. It is sadly not the case that a sequence of moments uniquely determines a probability distribution; fortunately in many applications some additional conditions will hold for our function which will ensure uniqueness. For the non-uniqueness of Taylor series, the standard example to use is f(x) = exp(-1/x^2) if x is not zero and 0 otherwise. To compute the derivatives at 0 we use the definition of the derivative and L'Hopital's rule. We find all the derivatives are zero at zero; however, our function is only zero at zero. We will see analogues of this example when we study the proof of the Central Limit Theorem.
Winner takes it all: song: https://www.youtube.com/watch?v=92cwKCU8Z5c Wikipedia article: https://en.wikipedia.org/wiki/The_Winner_Takes_It_All
We talked about analyzing perfect deals (fully perfect versus partially perfect). Be skeptical when rare events are reported; a great test is to see if less rare events are reported as well. Here's a nice article about such deals.
Here is a nice video on the Fibonacci numbers in nature: http://www.youtube.com/watch?v=J7VOA8NxhWY
There are many ways to prove Binet's formula for an explicit, closed form expression for the n-th Fibonacci number. One is through divine inspiration, the second through generating functions and partial fractions. Generating functions occur in a variety of problems; there are many applications near and dear to me in number theory (such as attacking the Goldbach or Twin Prime Problem via the Circle Method). The great utility of Binet's formula is we can jump to any Fibonacci number without having to compute all the intermediate ones. Even though it might be hard to work with such large numbers, we can jump to the trillionth (and if we take logarithms then we can specify it quite well).
We will do a lot more with generating functions. It's amazing how well they allow us to pass from local information (the \(a_n\)'s) to global information (the \(G_a\)'s) and then back to local information (the \(a_n\)'s)! The trick, of course, is to be able to work with \(G_a\) and extract information about the \(a_n\)'s. Fortunately, there are lots of techniques for this. In fact, we can see why this is so useful. When we create a function from our sequence, all of a sudden the power and methods of calculus and real analysis are available. This is similar to the gain in extrapolating the factorial function to the Gamma function. Later we'll see the benefit of going one step further, into the complex plane!
Lecture online here: Counting and probability (tic-tac-toe, poker hands, bridge), binomial theorem: https://youtu.be/pJ7RXimgBBo
Binomial Coefficients
We talked about tic-tac-toe today as a counting problem: how many `distinct' games are there. We are willing to consider games that are the same under rotation or reflection as the same game; see http://www.btinternet.com/~se16/hgb/tictactoe.htm for a nice analysis, or see the image here for optimal strategy.
Probably the most famous movie occurrence of tic-tac-toe is from Wargames; the clip is here (the entire movie is online here, start around 1:44:17; this was a classic movie from my childhood).
A math conundrum in 2012 involves tic-tac-toe and a fun generalization: Tic-Tac-Toe. Consider ‘Russian Doll’ tic-tac-toe. Each person has two large, two medium and two small pieces; the large can swallow any medium or small, the medium can swallow any small. If someone gets 3 in a row they win, else it’s a tie. If blue goes first, do they have a winning strategy (can they make sure that they win, no matter how orange responds)? If not, can blue at least ensure that they do no worse than tie? Feel free to come to my office (Bronfman 202) to ‘test’ your theories on a board.
There are several other interesting variants of tic-tac-toe. See Develin and Payne: bidding tic-tac-toe analysis for a great one.
In analyzing games like tic-tac-toe, it is imperative that we exhaust all possibilities. Certain games have been `solved'; checkers has been solved; chess and go are still open (though see the Deep Blue versus Kasparov).