Additional Comments

General takeaways (all classes)

MATH 341: Additional comments related to material from the class. If anyone wants to convert this to a blog, let me know. These additional remarks are for your enjoyment, and will not be on homeworks or exams. These are just meant to suggest additional topics worth considering, and I am happy to discuss any of these further.

Friday, December 9: M&M Game: Memoryless Processes, Geometric Series, Double Recurrences, Hypergeometric, ...: https://youtu.be/CZGj5hV84Ok (slides) (paper)
- Lots of great topics introduced:
  - Recurrence relations: http://en.wikipedia.org/wiki/Recurrence_relation
  - Recurrence relations with multiple indices: http://www.researchgate.net/profile/Toufik_Mansour/publication/230802660_Recurrence_relations_with_two_indices_and_Even_trees/links/02bfe5110ea59d3674000000.pdf
  - Binet's formula: http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html
  - Catalan numbers: http://en.wikipedia.org/wiki/Catalan_number
  - Memoryless processes: http://en.wikipedia.org/wiki/Memorylessness
  - Geometric Series Formula: http://en.wikipedia.org/wiki/Geometric_series
  - Hypergeometric series: http://en.wikipedia.org/wiki/Hypergeometric_function
  - Log-log plots: http://en.wikipedia.org/wiki/Log%E2%80%93log_plot
  - Least Squares Fit: http://en.wikipedia.org/wiki/Least_squares
- Research of mine (mostly with undergrads) of Fibonacci numbers and recurrences (lots of opportunities for new research here):

Wednesday, December 9: Benford's law of Digit Bias: https://youtu.be/ATqiJUUcWFg (slides here)
- Today was a payoff day. After developing a lot of the general theory of probability, we were able to use it to solve and analyze problems of practical import, specifically, Benford's law of digit bias.
Monday, December 7: Zeckendorf Decompositions: Cookie Monster Meets the Fibonacci Numbers. Mmmmmm -- Theorems! (slides here): Zeckendorf Decompositions: https://youtu.be/CjwDcsoigdE
- Today we discussed Zeckendorf decompositions, a beautiful topic for tying together many of the concepts of the semester and something students of mine will study in SMALL '16, '17, ....Today's lecture had several purposes. One of course was to get a sense of what math research is like, including seeing that it can be done by undergraduates with your level of training. At least two papers in this series has appeared in one of the top journals of mathematics. The other was to review or introduce you to many of the concepts from the semester, including partitions, generating functions, correlations, moments, differentiating identities, double counting, special densities, standardization, .... The list goes on and on; some of these were not covered in lecture but are available in the papers.
  - Zeckendorf talk: Video and slides: 2014 version of the talk (I haven't looked at that, but if you want to see the differences....)
    - Cookie Monster Meets the Fibonacci Numbers. Mmmmmm -- Theorems!: http://youtu.be/5e6HsfxqVSE (slides here).
  - Paper by Schilling on the distribution of the longest run of heads. This is a fascinating topic, and a strange limiting behavior. It's rare to have some process with such small fluctuations in the limiting behavior without a normalization.
  - Some papers by myself and my SMALL students:
    - On the number of summands in Zeckendorf decompositions (with Gene S. Kopp, Murat Koloğlu and Yinghui Wang), Fibonacci Quarterly. (49 (2011), no 2, 116--130) pdf ((this is a readable version, as is the next one))
    - Gaussian Behavior in Generalized Zeckendorf Decompositions (with Yinghui Wang). To appear in the Conference Proceedings of CANT 2011. pdf ((this is a readable summary of results))
    - From Fibonacci Numbers to Central Limit Type Theorems (with Yinghui Wang), Journal of Combinatorial Theory, Series A. (7 (2012), 1398-1413) pdf (expanded arxiv version) ((this is the technical paper which handles the general case))
    - The Average Gap Distribution for Generalized Zeckendorf Decompositions (with Olivia Beckwith, Amanda Bower, Louis Gaudet, Rachel Insoft, Shiyu Li, Steven J. Miller, And Philip Tosteson), to appear in the Fibonacci Quarterly. pdf . This is a readable version of the distribution of gaps; the second paper, which includes the measures specific for each $m \in [F_n, F_{n+1})$.
  - And, of course, the clip from Sesame Street where the cookie monster meets the count.
- Stirling's formula: http://en.wikipedia.org/wiki/Stirling%27s_approximation (we'll do more on this later in the semester).
Wednesday, December 2: Introduction to Random Matrix Theory: https://youtu.be/qsalf_YH1qs (slides here)
- Good survey articles:
  - Hayes: Riemannium: http://www.americanscientist.org/issues/pub/the-spectrum-of-riemannium
  - Firk-Miller: RMT Survey: http://web.williams.edu/Mathematics/sjmiller/public_html/math/papers/sym1010064.pdf (very readable survey on the history)
  - Conrey: RMT and L-Function survey: http://arxiv.org/pdf/math/0005300v1.pdf
- Lots of great links: (or go to http://web.williams.edu/Mathematics/sjmiller/public_html/406/index.htm and search there for RMT and L-Functions)
- See the article by Brian Hayes for a bit of the history of the connection between Random Matrix Theory and Number Theory (though there are a few math mistakes in the article!). We use the Moment Technique to prove Wigner's Semicircle law; see the article by Jacob Christiansen for an introduction to the moment problem (given a sequence of non-negative numbers, do they represent the moments of a probability distribution and if so, is there only one distribution with these moments?); the interested reader is strongly encouraged to read this article to get a sense of the problem of how moments may or may not specify a probability distribution. The semicircle law is what one obtains for the density of eigenvalues from real symmetric matrices with each independent entry chosen independently from a mean 0, variance 1 distribution with finite higher moments; if we look at other sets of matrices with different structure, very different behavior is seen. Terrific examples are the densities for d-regular graphs or for Toeplitz matrices (see me for more details).
- Some of my papers in the subject, almost all with students (most are accessible, happy to chat):
  - Distribution of eigenvalues for the ensemble of real symmetric Toeplitz matrices (with Chris Hammond). Journal of Theoretical Probability (18 (2005), no. 3, 537-566). pdf
  - Distribution of eigenvalues of real symmetric palindromic Toeplitz matrices and circulant matrices (with Adam Massey and John Sinsheimer), Journal of Theoretical Probability. (20 (2007), no. 3, 637--662.) pdf
  - The distribution of the second largest eigenvalue in families of random regular graphs (with Tim Novikoff and Anthony Sabelli), Experimental Mathematics. (17 (2008), no. 2, 231--244.) pdf
  - Nuclei, primes and the random matrix connection (with Frank W. K. Firk), invited paper to Symmetry (1, (2009), 64--105; doi:10.3390/sym1010064) pdf (very readable survey on the history)
  - Distribution of eigenvalues for highly palindromic real symmetric Toeplitz matrices (with Steven Jackson and Thuy Pham), Journal of Theoretical Probability. (25 (2012), 464--495) pdf
  - The Limiting Spectral Measure for Ensembles of Symmetric Block Circulant Matrices (with Gene S. Kopp Murat Koloğlu, Frederick Strauch, Wentao Xiong). Journal of Theoretical Probability (26 (2013), no. 4, 1020--1060) pdf
  - The expected eigenvalue distribution of large, weighted $d$-regular graphs (with Leo Goldmakher, Cap Khoury and Kesinee Ninsuwan). Random Matrices: Theory and Applications. (3 (2014), no. 2, 1450015, 22 pages) pdf
  - Distribution of eigenvalues of weighted, structured matrix ensembles (with Olivia Beckwith, Victor Luo, Karen Shen and Nicholas Triantafillou), INTEGERS. (15 (2015), paper A21, 28 pages) pdf
  - Limiting Spectral Measures for Random Matrix Ensembles with a Polynomial Link Function (with Kirk Swanson, Kimsy Tor and Karl Winsor), Random Matrices: Theory and Applications (4 (2015), no. 2, 1550004, 28 pages) pdf
  - From Quantum Systems to L-Functions: Pair Correlation Statistics and Beyond (with Owen Barrett, Frank W. K. Firk and Caroline Turnage-Butterbaugh), to appear in Open Problems in Mathematics (editors John Nash Jr. and Michael Th. Rassias, Springer-Verlag). pdf (readable survey on the history)
Monday, November 30: Pythagoras at the Bat: https://youtu.be/PIQelk32SKQ (Slides)
- General links:
  - Price Is Right Link: http://www.cnn.com/2013/11/18/showbiz/price-is-right-strategy/index.html?hpt=hp_t3
  - Williams vs Dimaggio Streak: http://www.thepostgame.com/features/201107/does-ted-williams-own-more-impressive-streak-joe-dimaggio
  - Consecutive times reaching first safely (without causing an out or by error): http://en.wikipedia.org/wiki/List_of_Major_League_Baseball_individual_streaks#Consecutive_plate_appearance_records
  - Interesting arXiv post -- the first author is a computer (his 'master' chooses where to put him based on whom he believes did the most work on the paper!): http://arxiv.org/pdf/1504.02513.pdf
- Today's lecture serves two purposes (click here for the slides). Most importantly it introduces many of the key ideas and challenges of mathematical modeling. I give this lecture Calc III and Probability. Most students there won't be taking partial derivatives or integrals later in life (though you never know!); however, almost surely you'll have a need to model, to try and describe a complex phenomena in a tractable manner.
  - Sabermetrics is the `science' of applying math/stats reasoning to baseball. The formula I mentioned at the start of the semester is known as the log-5 method; a better formula is the Pythagorean Won - Loss formula (someone linked my paper deriving this from a reasonable model to the wikipedia page), the topic of today's lecture. ESPN, MLB.com and all sites like this use the Pythagorean win expectation in their expanded series. My derivation is a nice exercise in multivariable calculus and probability
  - In general, it is sadly the case that most functions do not have a simple closed form expression for their anti-derivative. Thus integration is magnitudes harder than differentiation. One of the most famous that cannot be integrated in closed form is $\exp(-x^2)$, which is related to calculating areas under the normal (or bell or Gaussian) curve. We do at least have good series expansions to approximate it; see the entry on the erf (or error) function.
    - The anti-derivative of $\ln(x)$ is $x \ln(x) - x$; it is a nice exercise to compute the anti-derivative for $(\ln(x))^2$ for any integer $n$. For example, if $n=4$ we get $24 x - 24 x \ln(x) + 12 x (\ln x)^2 - 4 x (\ln x)^3 + x (\ln x)^4$.
  - Another good distribution to study for sabermetrics would be a Beta Distribution. A nice example is the Laffer curve from economics. I would like to try to modify the Weibull analysis from today's lecture to Beta distributions. The resulting integrals are harder -- if you're interested please let me know.
  - Today we discussed modeling, in particular, the interplay between finding a model that captures the key features and one that is mathematically tractable. While we used a problem from baseball as an example, the general situation is frequently quite similar. Often one makes simplifying assumptions in a model that we know are wrong, but lead to doable math (for us, it was using continuous probability distributions in general, and in particular the three parameter Weibull). For more on these and related models, my baseball paper is available here; another interesting read might be my marketing paper for the movie industry (which is a nice mix of modeling and linear programming, which is the linear algebra generalization of Lagrange multipliers).
    - One of the most important applications of finding areas under curves is in probability, where we may interpret these areas as the probability that certain events happen. Key concepts are:
      
      Probability distribution
      
      Mean or Expected Value
      
      Standard Deviation
      
      Independence
      
      Skewness and kurtosis (for the hypercompetitive students who really want to compare themselves to the class)
    - The more distributions you know, the better chance you have of finding one that models your system of interest. Weibulls are frequently used in survival analysis. The exponential distribution occurs in waiting times in lines as well as prime numbers.
    - In seeing whether or not data supports a theoretical contention, one needs a way to check and see how good of a fit we have. Chi-square tests are one of many methods.
    - Much of the theory of probability was derived from people interested in games of chance and gambling. Remember that when the house sets the odds, the goal is to try and get half the money bet on one team and half the money on the other. Not surprisingly, certain organizations are very interested in these computations. Click here for some of the details on the Bulger case (the bookie I mentioned in class is Chico Krantz, and is referenced briefly).
    - Any lecture on multivariable calculus and probabilities would be remiss if it did not mention how unlikely it is to be able to derive closed form expressions; this is why we will study Monte Carlo integration later. For example, the normal distribution is one of the most important in probability, but there is no nice anti-derivative. We must resort to series expansions; that expansion is so important it is given a name: the error function.
    - I strongly urge you to read the pages where we evaluate the integrals in closed form. The methods to get these closed form expressions occur frequently in applications. I particularly love seeing relations such as $1/c = 1/a + 1/b$; you may have seen this in resistors in parallel or perhaps the reduced mass from the two body problem (masses under gravity). Extra credit to anyone who can give me another example of quantities with a relation such as this.
    - Click here for a clip of Plinko on the Price I$ Right, or here for a showcase showdown.
  - We discussed how website like ESPN and MLB have a very limited space to display information, especially if it's for a smart phone. Thus one cannot show every statistic, and we have to pick and choose which ones are worth showing. In one section I made a joke about including the team names, but this is actually a serious comment! The MBTA (or MTA for us old folk!) had a contest on how to redesign their subway map of Boston. Below are links to an interesting article on the subject and the maps.
- Advice from Kayla: lot to learn from her baseball story. Many of you do not carefully read problems. Make sure you know what you're supposed to do and check and make sure you did it. Good advice for life!

Monday, November 23: Monte Carlo Integration, Buffon's Needle: https://youtu.be/kVNQvzZKfQE
- Monte Carlo Integration: Computer program here: http://web.williams.edu/Mathematics/sjmiller/public_html/341/mathematicaprograms/MonteCarloIntegration2.nb (right click and download)
  - Monte Carlo Integration is called by many the most important mathematical paper of the 20th century. Sadly, most integrals cannot be evaluated in closed form, and we must resort to approximation methods. Remember, the Fundamental Theorem of Calculus is useless for finding areas if we don't know the anti-derivative.
    - Here are some additional readings on the subject
      - Metropolis: The Beginning Of The Monte Carlo Method
      - Metropolis and Ulam: The Monte Carlo Method
      - Note on the origins of the method: http://www.fas.org/sgp/othergov/doe/lanl/pubs/00326866.pdf
    - In general, it is sadly the case that most functions do not have a simple closed form expression for their anti-derivative. Thus integration is magnitudes harder than differentiation. One of the most famous that cannot be integrated in closed form is exp(-x²), which is related to calculating areas under the normal (or bell or Gaussian) curve. We do at least have good series expansions to approximate it; see the entry on the erf (or error) function.
      - The anti-derivative of $\ln(x)$ is $x \ln(x) - x$; it is a nice exercise to compute the anti-derivative for $(\ln(x))^n$ for any integer $n$. For example, if $n=4$ we get $24 x - 24 x \ln(x) + 12 x (\ln x)^2 - 4 x (\ln x)^3 + x (\ln x)^4$.
  - Quasi-Monte Carlo / Fibonacci Numbers:
    - Problem with randomly choosing points uniformly is that can have gaps. Can use lattices if know exactly how many points have, but if suddenly have more, or simulation is taking too long and have to break early, in trouble. Solution is to use certain special lattices, some involving Fibonaccis (the most irrational of all irrational numbers, the golden mean / golden ratio, plays a big role).
  - Buffon's needle:
    - Buffon's needle: From my book on number theory: https://books.google.com/books?id=kLz4z8iwKiwC&pg=PA210&lpg=PA210&dq=steven+miller+buffon%27s+needle&source=bl&ots=-sasWPsMnh&sig=vXo22J2dZ8-6rr3Gaf9WKwpwBak&hl=en&sa=X&ei=pzlHVe70G8SdNvS8gOgN&ved=0CB4Q6AEwAA#v=onepage&q=steven%20miller%20buffon's%20needle&f=false
    - Buffon's needle: wikipedia: http://en.wikipedia.org/wiki/Buffon%27s_needle
    - Buffon powerpoint: http://faculty.cord.edu/biebigha/Spring08/402/Buffons_Needle_Problem.ppt
    - Lazzarini's estimate: http://en.wikipedia.org/wiki/Buffon%27s_needle#Estimating_.CF.80 and http://www.jstor.org/stable/2690682?seq=1#page_scan_tab_contents
      - See how many digits accuracy he would have if he had one more toss -- no matter where the stick lands, what happens to the accuracy?
      - N[(3408 / 1808)(5/3)- Pi,10] N[(3409 / 1808)(5/3)- Pi,10] N[(3409 / 1809)(5/3)- Pi,10]
      - 2.667641891*10^-70.0009220956727 -0.0008150600759
    - Here is the 'Proof from the Book' approach to Buffon's needle: http://www.math.leidenuniv.nl/~hfinkeln/seminarium/stelling_van_Buffon.pdf (no calculus needed!).
Friday, November 20: Poisson MGF, Proof of CLT for Poissons, Proof of CLT modulo results from Complex Analysis: https://youtu.be/f620wNdxyQQ
- Poisson Random Variables:
  - Here is a Mathematica program for sums of standardized Poisson random variables. The manipulate feature is very nice, and allows you to see how answers depend on parameters.
  - We proved the CLT in the special case of sums of independent Poisson random variables (click here for a handout with the details of this calculation, or see our textbook). The proof technique there used many ingredients in typical analysis proofs. Specifically, we Taylor expand, use common functions, and somehow argue that the higher order terms do not matter in the limit with respect to the main term (though they crucially affect the rate of convergence). We also got to take the logarithm of a product.
  - The proof for the Poisson random variable is very similar to the proof for arbitrary random variables whose moment generating functions exist in a neighborhood of t = 0. The difference, of course, is that while we always want to summify, it is particularly simple for the Poisson case as its moment generating function is a double exponential, specifically exp( λ (exp(t) - 1) ). This is a particularly nice function to take a logarithm of, and in fact this is why I always do this example.
    - It is worth thinking about the mean and the standard deviation of a Poisson random variable. The mean and the standard deviation are supposed to be in the same units, so if the mean is λ then shouldn't the standard deviation be λ, because if the variance were λ then the standard deviation would be λ^1/2 and that would have the wrong units, right? Wrong. For an exponential with density f(x) = λ exp(-λx) the mean and standard deviation are both 1/λ, and we can see that this is the correct λ dependence by scale issues: we exponentiate λx, so λx must be unitless so if x is in meters say then λ is in 1/meters, and thus this is the correct λ dependence for the mean and standard deviaton. What goes wrong for the Poisson? Remember the density there is f(n) = λⁿ e^λ/n!; here λ is alone in the exponential and is thus unitless! This means we can't use the unit analysis to say that the standard deviation and the mean have the same λ dependence.
  - As always, a lot of our day revolved around how to do algebra. When we were calculating the MGF of the standard normal, the integrand included $e^{tz} e^{-t^2/2}$. After awhile, we saw the natural thing to do was complete the square. As we started with a Gaussian, it was reasonable to do it as $e^{-(t^2 - 2tz)/2}$, and then complete the square.
  - Sadly I made a mistake in the algebra class when teaching this a few years ago. Was partly deliberate, partly not (admit mistakes!). For me, I actually like making some mistakes in lectures like this as it provides a great way to talk about how to try and fix things. I could tell we were going to be off by a $\lambda$ and knew there had to be a mistake. How to fix? Go back to the definition. There was a great suggestion in class (email me!) to figure out if the variance was $\lambda$ or $\lambda^2$ by looking at units, but sadly $\lambda$ is unitless here (unlike $\lambda$ for an exponential random variable. The reason is we have ${\rm Prob}(X = n) = e^{-\lambda} \lambda^n / n!$; all terms must be the same unit, and it must be unitless as each is a probability. Thus $\lambda$ is unitless. (For an exponential we have $f_X(x) = \lambda^{-1} e^{-x/\lambda}$, and thus $x$ can be in meters and thus $\lambda$ in meters as well. Always try tests to see if your answer is reasonable, or to try to find a mistake. Sadly sometimes the tools we have don't work, and we have to roll up our sleaves and do a long calculation.
- The Central Limit Theorem
  has a rich history and numerous applications. What makes it so powerful and applicable is that the assumptions are fairly week, essentially finite mean, finite variance, and something about the higher moments. The natural question is what exactly do we mean by convergence? There are several different notions. We essentially mean that if the moment generating function of $Y_n$ converges to that of the standard normal, then the cumulative distribution function of $Y_n$ converges to that of the standard normal.
  - These types of convergence are explained in detail in Chapter 7 of the recommended book, Probability and Random Processes by Geoffrey R. Grimmett and David R. Stirzaker (third edition), especially section 7.2. Almost sure convergence and convergence in the $r\textsuperscript{th}$ mean imply convergence in probability which implies weak convergence. The Borel-Cantelli problem from Chapter 1 of that book is quite useful in proving almost sure convergence. For us, we are just showing that the moment generating function converges to the moment generating function of the standard normal, with the rate of convergence depending on the third moment (or fourth moment if the third moment vanishes; note the fourth moment is never zero). As many distributions have zero third moment, the fourth moment frequently controls the speed. This is why instead of looking at the kurtosis (fourth moment) we often look at the excess kurtosis, which is the kurtosis of our random variable minus the kurtosis of the standard normal. This is because it is this difference that frequently controls the speed of convergence.
  - One can prove the CLT directly in the case of Bin(N, 1/2) (we did this earlier). As a binomial random variable is the sum of Bernoulli random variables, we see that Bin(N,1/2) should become normally distributed as N tends to infinity. This can be proved directly, and uses Stirling's formula to estimate the binomial coefficients.
- General Comments:
  - After proving such a monumental result, it's worth stepping back and thinking about what we've done, and what else we could have done. Some of these items we've already addressed (such as weakening the assumption that everything is identically distributed). We need at least the mean and the variance to be finite, as we use these to standardize our sum. It turns out that if just the first three (or maybe first four) moments exist we're fine, but not if we use this proof. Our proof uses the moment generating function in many ways; if any moment is infinite our proof breaks down, but that doesn't mean the result is no longer true. It is still true, but we need a new method.
  - We saw why there is such universality -- it comes from the fact that given any random variable with finite mean and variance we can adjust it to have mean 0 and variance 1; thus the first moment that shows the shape of the distribution is the third (and if it is symmetric about the mean, the fourth). In the proof of the CLT we saw the third and higher moments only enter as lower order terms that do not contribute as $n \to \infty$. This explains the universality!
  - If the moment generating functions exist in a neighborhood $|t| < \delta$ for some $\delta > 0$ and are equal for two random variables, the densities are equal.
    - This means $M_X(t) = M_Y(t)$ for $|t| < \delta$ implies $f_X = f_Y$.
    - Why is this true? It's our big theorem from complex but we can give an idea of why it's true, and why some type of conditions are needed. We have $\int_{-\infty}^\infty e^{tx}f_X(x) dx = \int_{-\infty}^\infty e^{tx}f_Y(x) dx$ implies $f_X = f_Y$. Let's look at a simpler problem. Assume we have functions $f, g$ that are continuous and that for all piecewise continuous $h$ we have $\int_{-\infty}^\infty h(x) f(x) dx = \int_{-\infty}^\infty h(x) g(x) dx$. By choosing $h$ appropriately we see $f = g$. Why? If these two functions are not identically equal say they differ at the input $x_0$, and without loss of generality we may assume $f(x_0) = 1, g(x_0) = -1$). Since we are assuming $f,g$ continuous there is a small interval centered at $x_0$ such that $f$ is positive and close to 1 and $g$ is negative and close to -1. We let $h$ be identically 1 on this small interval and zero elsewhere; then the integral of $h(x) f(x)$ is positive while that of $h(x) g(x)$ is negative -- contradiction! Thus $f = g$.
    - Why did the proof work? We needed continuity of $f, g$ (otherwise we have to use more advanced analysis, but the continuity gave us small regions where they differ). We then tested by integrating $f,g$ against different $h$. This is similar to much of my research in number theory -- there, we are restricted in what $h$ we can use. Imagine we can only use $h$ that are differentiable. Would the claim still hold? What of other restrictions on $h$? The miracle in complex analysis is that the set of functions $e^{tx}$ for $|t| < \delta$ is rich enough for our purposes.

Wednesday, November 18: Generating Functions, Moment Generating Functions and Examples: https://youtu.be/UM7GnOLVRTg
- One way to view the MGF is that we're doing a transform of our density function, replacing $f_X$ with $M_X$. There are many different transforms one can study. Two of the most important integral transforms are the Laplace Transform and the Fourier Transform; these two transforms are related to each other and to another one, the Mellin transform (we've seen the Mellin transform when studying the Gamma function, as the Gamma function is the Mellin transform of the exponential function). The Laplace Transform is just E[e^(tX)] (note this is just the MGF), and the Fourier transform is just replacing $t$ with $it$ (it's amazing what that $i = \sqrt{-1}$ does!). These are all integral transforms, which are frequently used to solve a variety of problems. The ones we are studying have the wonderful property that they can be expressed as integrating against a fixed function (called the kernel); for many important applications this is true, but not always (see Picard's iteration method to solve first order differential equations). Each of these transforms has its advantages and disadvantages; depending on the problem you are studying, some make the algebra easier and some make it harder. Note it is not always the case that the transform exists; for example, the moment generating function of X is E[e^tX] = ʃ e^tx f(x) dx, which does not make sense in a neighborhood of the origin for a Cauchy random variable (we have many wonderful proofs allowing us to pass from knowledge of moment generating functions to knowledge of the density when the moment generating function converges in a neighborhood of the origin). The Fourier transform of a probability distribution, however, always exists for all values; this is called the characteristic function, and as it always exists, one can see why this would be of use and interest. In general it isn't too bad to compute these integral transforms, but it is hard to invert them. Frequently we must restrict the space of functions we're studying in order to have a nice inversion statement. One space often studied is the Schwartz space. This leads to a nice formula for the Inverse Fourier Transform.
- We'll see later that the difficulty in proving the CLT is inverting these Fourier transforms. When talking about the difficulty of inverting a transform, it's worth noting that similar issues arise in cryptography. Many cryptosystems are based on a trap-door algorithm, namely taking some process that is easy one way but hard to invert unless you know a key or trap-door or some extra bit of information not publically available. The standard, but by no means only, example is the that it is easy to multiply two numbers, but currently it is hard to factor numbers. Many of these cryptosystems use just elementary math to state how they work, but very advanced math to discuss their security. Two of my favorites are RSA andelliptic curve systems. See also the homepage for my winder study on cryptography: Math 10: LQWURGXFWLRQ WR FUBSWRJUDSKB.
- We did a lot of algebra and calculations, and talked at length about good ways through them. This is why I wanted Cam to show how he's being taught addition / multiplication -- it gives you a new appreciation of the power of the carry method! We saw a great example of savings with the MGF of an exponential, which was $(1 - \lambda t)^{-1}$. We could take derivatives to find the moments, or we could use the geometric series formula in reverse to write $(1 - \lambda t)^{-1} = 1 + \lambda t + \lamba^2 t^2 + \lambda^3 t^3 + \cdots = 1 + \lambda t + (2! \lambda^2) t^2/2! + (3! \lambda^3) t^3/3! + \cdots$.
- Another example of algebra was in finding the generating function of an exponential random variable, where we multiplied by 1 and played some algebra games to write our integral as a constant times the normalization integral of a related exponential.
- We also talked about standardizing a random variable awhile back. What we did is really equivalent to sending X to (X - E[X]) / StDev(X). This allows us to compare apples and apples. Note of course not all random variables can be standardized; the Cauchy distribution for instance cannot. We only compute tables of the standard normal; by standardizing we can deduce the probabilities of any normal random variable from a table of probabilities of the standard normal. This is similar to the change of basis formula for logarithms. Knowing log_b(x) = log_c(x) / log_c(b), if we know logarithms base c we then know them base b, and thus it suffices to create just one table of logarithms. We'll use our MGF result $M_{\alpha X + \beta}(t) = e^{\beta t} M_X(\alpha t)$ to deal with these standardizatoins.
Monday, November 16: Generating functions and Convolutions (Discrete Random Variables): https://youtu.be/4wbai2-EdFU
- Fibonacci numbers and generating functions.
  - Here is a nice video on the Fibonacci numbers in nature: http://www.youtube.com/watch?v=J7VOA8NxhWY
  - There are many ways to prove Binet's formula for an explicit, closed form expression for the n-th Fibonacci number. One is through divine inspiration, the second through generating functions and partial fractions. Generating functions occur in a variety of problems; there are many applications near and dear to me in number theory (such as attacking the Goldbach or Twin Prime Problem via the Circle Method). The great utility of Binet's formula is we can jump to any Fibonacci number without having to compute all the intermediate ones. Even though it might be hard to work with such large numbers, we can jump to the trillionth (and if we take logarithms then we can specify it quite well).
- We will do a lot more with generating functions. It's amazing how well they allow us to pass from local information (the $a_n$'s) to global information (the $G_a$'s) and then back to local information (the $a_n$'s)! The trick, of course, is to be able to work with $G_a$ and extract information about the $a_n$'s. Fortunately, there are lots of techniques for this. In fact, we can see why this is so useful. When we create a function from our sequence, all of a sudden the power and methods of calculus and real analysis are available. This is similar to the gain in extrapolating the factorial function to the Gamma function. Later we'll see the benefit of going one step further, into the complex plane!
- Today we saw more properties of generating functions. The miracle continues -- they provide a powerful way to handle the algebra. For example, we could prove the sum of two independent Poisson random variables is Poisson by looking at the generating function and using our uniqueness result; we sadly don't have something similar in the continuous case (complex analysis is needed). We saw how to get a closed form expression for Fibonacci numbers, and next class will do $\sum_{m=0}^n \left({n \atop m}\right) = \left({2n \atop n}\right)$. We compared probability generating functions and moment generating functions, and talking about where the algebra is easier.
  - The main item to discuss is that if $X$ is a random variable taking on non-negative integer values then if we let $a_n = {\rm Prob}(X = n)$ we can interpret $G_a(s) = E[s^X]$. This is a great definition, and allowed us to easily reprove many of our results.
  - The idea of noticing a given expression can be rewritten in an equivalent way for some values of the parameter, but that expression means something else for other values, is related to the important concept of analytic or meromorphic continuation, one of the big results / techniques in complex analysis. The geometric series formula only makes sense when |r| < 1, in which case 1 + r + r^2 + ... = 1/(1-r); however, the right hand side makes sense for all r other than 1. We say the function 1/(1-r) is a (meromorphic) continuation of 1+r+r^2+.... This means that they are equal when both are defined; however, 1/(1-r) makes sense for additional values of r. Interpreting 1+2+4+8+.... as -1 or 1+2+3+4+5+... a -1/12 actually DOES make sense, and arises in modern physics and number theory (the latter is zeta(1), where zeta(s) is the Riemann zeta function)!
  - For analytic continuation we need some ingredient to let us get another expression. It's thus worth asking what the source of the analytic continuation is. For the geometric series, it's the geometric series formula. For the Gamma function, it's integration by parts; this led us to the formula Gamma(s+1) = s Gamma(s). For the Riemann zeta function, it's the Poisson summation formula, which relates sums of a nice function at integer arguments to sums of its Fourier transform at integer arguments. There are many proofs of this result. In my book on number theory, I prove it by considering the periodic function $F(x) = \sum_{n = -\infty}^\infty f(x+n)$. This function is clearly periodic with period 1 (if f decays nicely). Assuming f, f' and f'' have reasonable decay, the result now follows from facts about pointwise convergence of Fourier series.
  - Briefly, the reason generating are so useful is that they build up a nice function from data we can control, and we can extract the information we need without too much trouble. There are lots of different formulations, but the most important is that they are well-behaved with respect to convolution (the generating function of a convolution is the product of the generating functions).
- Matrix Multiplication:
  - You've probably seen matrix multiplication many times, but might not understand why it's defined the way it is. The reason is that we want $C = BA$ to satisfy $C \overrightarrow{v} = B(A\overrightarrow{v})$. This forces the definition upon us (for more on this see the links below). The situation is somewhat similar with the definition of generating functions. We can define anything we want; the question is when is it useful. The idea is that convolutions are critically important in studying sums of independent random variables, and we define a process that is nice with respect to convolutions (the generating function of a convolution is the product of the convolutions).
    - http://math.stackexchange.com/questions/24456/matrix-multiplication-interpreting-and-understanding-the-process
    - http://math.stackexchange.com/questions/31725/intuition-behind-matrix-multiplication
Friday Nov 13: CLT from Tossing Fair Coin: https://youtu.be/NlABhZo5OEA
- We earlier proved the sum of independent Poisson random variables is a Poisson random variable, and the parameter is the only thing it can be (as expectation is linear, it must be the sum of the parameters). We were able to see this by using a convolution to get the probability, and then doing some algebraic gymnastics to see that our expression was the probability of a Poisson random variable. IF you have an idea of what an answer is, that can often be helpful and suggest a method of proof. We'll see proofs of this result again when we reach generating functions; in addition to our textbook you can also find this result online; see for example http://www.stat.wisc.edu/courses/st311-rich/convol.pdf.
- We spent much of the day seeing how Stirling's formula implies the CLT for the second easiest example, independent tosses of a fair coin.
  - One can prove the CLT directly in the case of Bin(2N, 1/2). As a binomial random variable is the sum of Bernoulli random variables, we see that Bin(2N,1/2) should become normally distributed as N tends to infinity. This can be proved directly, and uses Stirling's formula to estimate the binomial coefficients. We see how beautifully everything fits together; the $\sqrt{2\pi}$ from Stirling is what we need for the normalization coefficient of the Gaussian. It's all related, as Stirling approximates $n!$, and the factorial function is interpolated by the Gamma function; the circle is completed by noting $\Gamma(1/2) = \sqrt{\pi}$.
    - We started our analysis by invoking Chebyshev's theorem. You should use Chebyshev and bound the probability that $|m| > N^{\delta}$. This becomes negligibly small as $N$ grows, and so we may assume $m$ is essentially $\sqrt{N}$ (or a little more, say $N^{1/2} \log N$). We could take $\delta = 3/4$, but saw later that required us to look at too many terms when we Taylor expanded a logarithm. Fortunately anything exceeding 1/2 works, and in the end we took $\delta = 3/5$..
    - The hardest part of the proof was analyzing $(1 - m/N)^{N-m} (1 + m/N)^{N+m}$. We can't just use $e^u = \lim_{M \to \infty} (1 + u/M)^M$; this is too crude and would give that product is 1 and we'd get convergence to a uniform and not a normal! We had to take logarithms and show more care. We approached this carefully by taking a logarithm.
    - Another interesting approach is to write it as $(1 - m/N)^{N} (1 + m/N)^{N} ((1 + m/N) / (1 - m/N))^{m} = (1 - m^2/N^2)^N ((1 + m/N) / (1 - m/N))^{m}$, and then try and figure out how these behave.
    - The final comparison, showing our expression converges to a standard normal, requires a little bit of work. There are two technical issues. First, of course, is that our values are discrete. Second, and very important, m can only be even! Thus we converge to a density that's twice that of the normal in some places, and an equal amount smaller (i.e., zero) elsewhere. We did alot of multiplying by 1 to get thigns to look right. We did this to replace some $N$'s with $2N$'s (to be the variance), and $m^2$ with $(2m)^2$ (as we're looking at the probability of having the net excess of heads equal $2m$.
  - In Economics, the standard random walk hypothesis seems to have lost most of its supporters, though there are variants (and I'm not familiar with all); see also the efficient market hypothesis and technical analysis, and all the links there. (There are also many good links on the wikipedia page on Eugene Fama). Two famous books (with different conclusions) are Malkiel's A random walk down wall street and Mandelbrot-Hudson's The (mis)behavior of markets (a fractal view of risk, ruin and reward). Some interesting papers if you want to read more:
    - Mandelbrot: Variation on certain speculative prices (a must read!)
    - Fama: Mandelbrot and Stable Paretian Hypothesis
    - Fama: Random Walks Stock Prices
    - For more on randomness, check out The Black Swan by Taleb (amazon.com page here, wikipedia page here).
    - For more on fractal geometry, click here. See the Koch snowflake; another popular one is the Cantor set. See here for fractal dimensions. To actually compute pictures of items like the Mandelbrot set, one needs to iterate polynomials. This can lead to the fascinating subject of efficient algorithms; when I wrote such programs years ago on what would now be considered `slow' computer, I had to use Horner's algorithm to get things to run in a reasonable time.
Monday Nov 9: Today's lecture covered the Method of Least Squares. The best fit value of the parameters depends on how we choose to measure errors. It is very important to think about how you are going to measure / model, as frequently people reach very different conclusions because they have different starting points / different metrics. We'll see another example of how our metric can affect the answer when we get to Lagrange multipliers.
- Video: Method of Least Squares: https://youtu.be/XdfieGTbhoI
  - Unfortunately the video on Monday is a bit dark and hard to see; here is the video from last spring: https://www.youtube.com/watch?v=_L72XJXxGCc&feature=youtu.be
- Lecture notes for the Method of Least Squares; also click here for my notes.
- The Method of Least Squares is one of my favorites in statistics (click here for the Wikipedia page, and click here for my notes). The Method of Least Squares is a great way to find best fit parameters. Given a hypothetical relationship $y = a x + b$, we observe values of y for different choices of $x$, say $(x_1, y_1), (x_2, y_2), (x_3, y_3)$ and so on. We then need to find a way to quantify the error. It's natural to look at the observed value of $y$ minus the predicted value of $y$; thus it is natural that the error should be $ \sum_{i=1}^n h(y_i - (a\ x_i + b)) $ for some function $h$. What is a good choice? We could try $h(u) = u$, but this leads to sums of signed errors (positive and negative), and thus we could have many errors that are large in magnitude canceling out. The next choice is $h(u) = |u|$; while this is a good choice, it is not analytically tractable as the absolute value function is not differentiable. We thus use $h(u) = u^2$; though this assigns more weight to large errors, it does lead to a differentiable function, and thus the techniques of calculus are applicable. We end up with a very nice, closed form expression for the best fit values of the parameters.
- Unfortunately, the Method of Least Squares only works for linear relations in the unknown parameters. As a great exercise, try to find the best fit values of $a$ and $c$ to $y = c/x^a$ (for definiteness you can think of this as the force due to two unit masses that are $x$ units apart). When you take the derivative with respect to $a$ and set that equal to zero, you won't get a tractable equation that is linear in $a$ to solve. Fortunately there is a work-around. If we change variables by taking logarithms, we find $\ln(y) = \ln(c/x^a)$; using logarithm laws this is equivalent to $\ln(y) = a \ln(x) + \ln(c)$; setting $Y = \ln(y), X = \ln(X)$ and $b = \ln(c)$ this is equivalent to $Y = a X + b$, which is exactly the formulation we need! This example illustrates the power of logarithms; it allows us to transform our data and apply the Method of Least Squares.
- Here are the files from today's talk:
  - My notes on the method of least squares.
  - My favorite Family Feud clip. Those final questions are hard! (When I taught this last time I didn't do the coin problem, and did a Family Feud game of trying to guess the letter frequencies; as there were students who had seen that lecture before didn't want to do that part again, and also wanted to spend a lot of time on the coin problem.)
- Here is a great homework problem to consider (it's optional):
  - Kepler's third law states that if $T$ is the orbital period of a planet traveling in an elliptical orbit about the sun (and no other objects exist), then $T^2 = C L^3$, where $L$ is the length of the semi-major axis. I always found this the hardest of the three laws; how would one be led to the right values of the exponents from observational data? One way is through the Method of Least Squares. Set $\mathcal{T} = \log T$, $\mathcal{L} = \log L$ and $c = \log \mathcal{C}$. Then a relationship of the form $T^a = C L^b$ becomes $a \mathcal{T} = b \mathcal{L} + c$, which is amenable to the Method of Least Squares. The semi-major axis of the the 8 planets (sadly, Pluto is no longer considered a planet) are Mercury 0.387, Venus 0.723, Earth 1.000, Mars 1.524, Jupiter 5.203, Saturn 9.539, Uranus 19.182, Neptune 30.06 (the units are astronomical units, where one astronomical unit is 1.496 $\cdot 10^8$ km); the orbital periods (in years) are 0.2408467, 0.61519726, 1.0000174, 1.8808476, 11.862615, 29.447498, 84.016846 and 164.79132. Using this data, apply the Method of Least Squares to find the best fit values of $a$ and $b$ in $T^a = C L^b$ (note, of course, you need to use the equation $a\mathcal{T} = b\mathcal{L} + \mathcal{C}$). Actually, as phrased above, the problem is a little indeterminate for the following reason. Imagine we have $T^2 = 5 L^3$ or $T^4 = 25 L^6$ or $T = \sqrt{5} L^{1.5}$ or even $T^{4} = 625 L^{12}$. All of these are the same equation! In other words, we might as well make our lives easy by taking $a=1$; there really is no loss in generality in doing this. This is yet another example of how changing our point of view can really help us. At first it looks like this is a problem involving \textbf{three} unknown parameters, $a$, $b$ and $C$; however, there is absolutely no loss in generality in taking $a=1$; thus let us make our lives easier and just look at this special case. For your convenience, here are the natural logarithms of the data: the lengths of the semi-major axes are $\{-0.949331,\ -0.324346,\ 0,\ 0.421338,\ 1.64924,\ 2.25539,\ 2.95397,\ 3.4032\}$ and the natural logarithms of the periods (in years) are $\{-1.42359,\ -0.485812,\ 0.0000173998,\ 0.631723,\ 2.47339,\ 3.38261,\ 4.43102,\ 5.10468\}.$
- Asimov's story: The Planet That Wasn't: http://geobeck.tripod.com/frontier/planet.htm (GREAT read, and illustrates how easy it is for bias to affect your results); you must be impartial when seeking the truth.
  - See also: http://www.theatlantic.com/science/archive/2015/11/science-doesnt-work-the-way-you-think-it-does/414744/?utm_source=SFFB
Friday Nov 6: Stirling's Formula, Poisson Random Variables and the CLT: https://youtu.be/9Po84qqfnzk
- Stirling's Formula: We can do a lot more with Stirling. Is our formula reasonable? Look at (n+1)!/n! -- that better look like n+1 for large n!
- The Weak Law of Large Numbers is a nice application of Chebyshev's inequality. It says the sample mean converges to the random variable's mean. More explicitly, the probability of being a fixed amount epsilon from the mean tends to zero at a rate $\sigma^2 / (\epsilon^2 n)$. This gives us some freedom to let $\epsilon$ depend on $n$; if $\epsilon = n^{1/4}$ then we get a very tight interval for the sample mean with probability $1$ minus something of the order $1/\sqrt{n}$. We had a nice instance of multiplying by $1$ here; instead of looking at it as $|Y - \mu| \ge \epsilon$ we write it as $\ge \epsilon/\sigma_Y \cdot \sigma_Y$.
- A big part of the Weak Law is how we have convergence. There are four main types:
- It's often worth reading about the famous mathematicians and their theorems: Markov Chebyshev Weak Law
- Poisson Random Variables
  - General Facts
    - Click for information on the Poisson distribution. Wikipedia has a nice discussion of how it arises here.
    - Our first though is always to find the mean, and then the variance. We saw both were computable via differentiating identities. For the Poisson, we saw how to compute the mean (and even harder, the variance) by being clever about the algebra. For the mean we needed to notice that $n / n!$ equals $1/(n-1)!$. For the variance, the trick was to write $n^2$ and $n(n-1 + 1)$, and then when we divide by $n!$ we get $1/(n-2)! + 1/(n-1)!$, and we can handle each piece.
    - It's worth thinking about the variance of the Poisson. A major theme of the course is the need to be able to look at a lengthy equation and get a feel for what it's saying. The mean and the standard deviation are supposed to be in the same units, so if the mean is λ then shouldn't the standard deviation be λ, because if the variance were λ then the standard deviation would be λ^1/2 and that would have the wrong units, right? Wrong. For an exponential with density f(x) = λ exp(-λx) the mean and standard deviation are both 1/λ, and we can see that this is the correct λ dependence by scale issues: we exponentiate λx, so λx must be unitless so if x is in meters say then λ is in 1/meters, and thus this is the correct λ dependence for the mean and standard deviaton. What goes wrong for the Poisson? Remember the density there is f(n) = λⁿ e^λ/n!; here λ is alone in the exponential and is thus unitless! This means we can't use the unit analysis to say that the standard deviation and the mean have the same λ dependence.
    - The Poisson random variable often models the number of events in a window of time. Also, frequently normalized spacings between events converge to Poissonian (a great example is to look at the primes). Another is the spacings between the ordered fractional parts of $n^k \alpha$ (click here for more).
  - General advice: to differentiate an identity, you need an identity. Seems silly to state but it's essential. Often the hardest part of these problems is figuring out how to do the algebra in a clean way. For us, we saw that frequently we want to move the normalization constant over to the other side; it allows us to avoid a product or quotient rule. We also saw sometimes it's easier to computer $E[X(X-1)]$ than $E[X^2]$, and then do algebra. It all comes down to whether or not it's easier to apply $d/dx$ or $x d/dx$. For the Poisson distribution, it helps to move the exponential to the other side and write $e^\lambda = \sum_{n=0}^\infty \lambda^n/n!$.
  - We proved the sum of independent Poisson random variables is a Poisson random variable, and the parameter is the only thing it can be (as expectation is linear, it must be the sum of the parameters). We were able to see this by using a convolution to get the probability, and then doing some algebraic gymnastics to see that our expression was the probability of a Poisson random variable. IF you have an idea of what an answer is, that can often be helpful and suggest a method of proof. We'll see proofs of this result again when we reach generating functions; in addition to our textbook you can also find this result online; see for example http://www.stat.wisc.edu/courses/st311-rich/convol.pdf.
  - We ended with how the CLT gives Stirling's formula: If $X_i \sim {\rm Poiss}(1)$ and these random variables are independent, then $Y_n = X_1 + \cdots + X_n \sim {\rm Poiss}(n)$, which by the CLT converges to being $N(n,n)$. Thus ${\rm Prob}(Y_n = m) = n^m e^{-n}/m! \approx \int_{n-1/2}^{n+1/2} (2\pi n)^{-1/2} \exp(-(x-n)^2/2n)dx \approx \exp(-(m-n)^2/2n) / \sqrt{2\pi n}$. Taking $m=n$ and cross multiplying gives Stirling's formula. Note the issue of continuous versus discrete; we solve this by associating the area under the continuous distribution from $m-1/2$ to $m+1/2$ to $m$ on the discrete.
Wednesday Nov 4: Markov and Chebyshev Inequalities, Divide and Conquer and Newton's Method, Stirling's Formula: https://youtu.be/5-LIXIi_QYg
- We proved Chebyshev's theorem, one of the gems of probability. The natural scale to measure fluctuations about the mean is the standard deviation (the square-root of the variance). Chebyshev's theorem gives us bounds on how likely it is to be more than k standard deviations from the mean. The good thing about this result is that it works for any random variable with finite mean and variance; the bad news is that because it works for all such distributions, its results are understandably much weaker than results tailored to a specific distribution (we will see later that its predictions for binomial(n,p) are magnitudes worse than what is true).
- We compared two methods to find roots of polynomials. In some special cases we can find closed form expressions for roots in terms of the coefficients. For example, any linear equation ($ax+b=0$), quadratic ($ax^2+bx+c=0$), cubic ($ax^3+bx^2+cx+d=0$) or quartic ($ax^4+bx^3+cx^2+dx+e=0$) has a formula for the roots in terms of the coefficients of the polynomials; this fails for polynomials of degree 5 and higher (the Abel-Ruffini Theorem; see also Galois). It is very convenient when we have a solution that is a function of the parameters; we can then use our methods to find the optimal values of the parameters. Sadly in industry it is often difficult to get closed form expressions; if you are looking for the most potent compound, for example, you might be required to do numerous different trial runs and just observe which is best. We thus need a way to find optimal values / solve equations. We describe two below.
  - Newton's method is significantly more powerful than divide and conquer (also called the bisecting algorithm); this is not surprising as it assumes more information about the function of interest (namely, differentiability). The numerical stability of Newton's method leads to many fascinating problems. One terrific example is looking at roots in the complex plane of a polynomial. We assign each root a different color (other than purple), and then given any point in the complex plane, we apply Newton's method to that point repeatedly until one of two things happen: it converges to a root or it diverges. If the iterates of our point converges to a root, we color our point the same color as that root, else we color it purple. This leads to Newton fractals, where two points extremely close to each other can be colored differently, with remarkable behavior as you zoom in. If you're interested in more information, let me know; a good chaos program is xaos (I have other links to such programs for those interested). One final aside: it is often important to evaluate these polynomials rapidly; naive substitution is often too slow, and Horner's algorithm is frequently used.
  - The fractal behavior exhibited by Newton's method applied to finding roots of polynomials is one of many examples of Chaos Theory, or extreme sensitivity to initial conditions. While one of the earliest examples was the work of Poincare on the motion of three planetary bodies, the subject really took off with Lorenz work on weather (the Butterfly Effect). Another nice example is the orbit of Pluto; while we know it will orbit the sun, its orbit is chaotic and we cannot say where exactly in the orbit it will be millions of years from now.
- Instead of approximating a function locally by a line, in two dimensions we now use a plane, or a hyperplane (in general). We can use the Mean Value Theorem to get some information on how close the estimation is, and then use these estimations to approximate our function. A Mathematica file with the tangent line and tangent plane approximations is here. One definition of differentiability is that a function is differentiable if the error in the tangent plane approximation tends to zero faster than the distance of where we are to where we start tends to zero. It is sadly possible for the partial derivatives to exist without the function being differentiable. We showed how it is not sufficient for the partial derivatives to exist; that is not enough to imply our function is differentiable. The example was $f(x,y) = (xy)^{1/3}$. What must we assume in order for the partial derivatives to imply our function is differentiable? It turns out it suffices to assume the partial derivatives are continuous. This is the major theorem in the subject, and provides a nice way to check for when a function is differentiable.
  - The proof of the alluded to theorem above uses two of my favorite techniques. While sadly we do not multiply by 1, we do get to add 0 and we do use the Mean Value Theorem. One of my goals in the class is to illustrate how to think about these problems, why we try certain approaches for our proofs. We want to study how well the tangent plane approximates our function, thus we need to study f(x,y) - f(0,0) - (δf/δx)(0,0) x - (δf/δy)(0,0) y. Our theorem assumes the partial derivatives are continuous, thus it stands to reason that at some point in the proof we should use the partial derivatives are continuous! The trick is to try and see how we can get another δf/δx and another δf/δy to appear. The key is to recall the MVT. If we add 0 in a clever way, we can do this. Our expression equals f(x,y) - f(0,y) + f(0,y) - f(0,0) - (δf/δx)(0,0) x - (δf/δy)(0,0) y. We now use the MVT on f(x,y) - f(0,y) and on f(0,y) - f(0,0). In each of these two expressions, only one variable changes. Thus the first is (δf/δx)(c,y) x and the second is (δf/δy)(0,ĉ). Thus the error in using the tangent plane is [(δf/δx)(c,y) - (δf/δx)(0,y)] x + [(δf/δy)(0,ĉ) - (δf/δx)(0,o)] y. We now see how the continuity of the partials enters -- it ensures that these differences are small, even when we divide by |(x,y)-(0,0)|.
- Stirling's Formula
  - We gave a poor mathematician's analysis of the size of n!; the best result is Stirling's formula which gives $n!$ is about $n^n e^{-n} \sqrt{2 \pi n} (1 +$ error of size $1/12n + \cdots)$. The standard way to get upper and lower bounds is by using the comparison method in calculus (basically the integral test); we could get a better result by using a better summation formula, say Simpson's method or Euler-Maclaurin; we'll do all this on Wednesday. We might return to Simpson's method later in the course, as one proof of it involves techniques that lead to the creation of low(er) risk portfolios! Ah, so much that we can do once we learn expectation..... Of course, our analysis above is not for $n!$ but rather $\log(n!) = \log 1 + \cdots + \log n$; summifying a problem is a very important technique, and one of the reasons the logarithm shows up so frequently. If you are interested, let me know as this is related to research of mine on Benford's law of digit bias.
  - It wasn't too hard to get a good upper bound; the lower bound required work. We first just had $n < n!$, which is quite poor. We then improved that to $2^{n-1} < n!$, or more generally eventually $c^n < n!$ for any fixed $c$. This starts to give a sense of how rapidly $n!$ grows. We then had a major advance when we split the numbers $1, \dots, n$ into two halves, and got $2^{n/2-1} (n/2)^{n/2 - 1}$, which gives a lower bound of essentially $n^{n/2} = (\sqrt{n})^n$. While we want $n/e$, $\sqrt{n}$ isn't horrible, and with more work this can be improved.
  - Instead of approximately all numbers in $n/2, \dots, n$ with $n$ we saw we could do much better by using the `Farmer Brown' problem, and noting that if we pair things so that the sums are constant, the largest product comes from the middle, and thus each pair is dominated by $((3n/4)^2)^{n/4}$. By splitting into four intervals we got an upper bound of approximately $n^n 2.499^{-n}$, pretty close to $n^n e^{-n}$.
  - There are other approaches to proving Stirling; the fact that $\Gamma(n+1) = n!$ allows us to use techniques from real analysis / complex analysis to get Stirling by analyzing the integral. This is the Method of Stationary Phase (or the Method of Steepest Descent), very powerful and popular in mathematical physics. See Mathworld for this approach, or page 29 of my handout here.

Monday Nov 2: Sums of chi-square random variables, Markov's Inequality: https://youtu.be/SJ1jJMKhSUc
- Chi-square distribution: We did a lot of calculations; all the details are in the book. The goal is not to be able to crank these out line by line, but to understand the logic behind how I chose to attack them. There are tricks to make our lives easier, ways to arrange the algebra. That's the goal: seeing this. We saw this again today.
  - The square of a standard normal is a chi-square with one degree of freedom; the sum of n independent squares of standard normals is a chi-square with n degrees of freedom. While at first the density may look a little strange, it's because the fundamental object is the normal, which we square.
  - We computed N(0,1)^2 and N(0,1)^2 + N(0,1)^2 directly using the CDF method; fortunately we never need the anti-derivative of the standard normal, which is essentially the error function (the name says it all!). For more sums, however, this won't work and we need something better, a more elegant approach. We hit generalized spherical coordinates. We saw a beautiful trick that allowed us to integrate out all the angular coordinates, even though we didn't know the Jacobian or the exact change of variable equations! (This is where we're using results from Calc III). When the algebra settles, we have a proof not only of the chi-square result, but a way to find the surface area and volume of n-dimensional spheres! We did this by using the theory of normalization constants. One nice application of the Gamma function and normalization constants is a proof of Wallis' formula,which says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is mostly elementary (see my article in the American Mathematical Monthly). Not surprisingly, the proof uses one of my favorite techniques, the theory of normalization constants (caveat: it does have on advanced ingredient from measure theory, namely Lebesgue's Dominated Convergence Theorem).
  - Fortunately we never had to work with the n-dimensional spherical coordinate system
  - There are many fascinating question involving spheres (with applications to error correcting codes!):
    - The kissing number, which is related to spherical codes.
    - Sphere packing (the special case in 3-dimensional space is known as the Kepler Conjecture, and a proof was presented by Hales in 1998).
    - If you want to read more about these, I'm happy to share a chapter in a cryptography book I'm writing.
    - One of the most important applications of spherical coordinates is to planetary motion, specifically, proving that the force one sphere exerts on another is equivalent to all of the mass being located at the center of the sphere. This is the most important integral in Newton's great work, Principia (we have a first edition at the library here). I strongly urge everyone to look at this problem. Proving that one can take all of the mass to be at the center enormously simplifies the calculations of planetary motion. See the Wikipedia article on the Shell Theorem for the computation. As this is so important, here is another link to a proof. Oh, let's do another proof here as well as another proof here. For an example of a non-proof, read the following and the comments.
- We proved Markov's inequality and showed how that implies Chebyshev's inequality
  - Our proof of Markov's inequality started by looking at special cases. Not surprisingly, the formula is useless if $a \le E[X]$. It's always good to play with a statement to get a feel of what it gives.
  - A huge part of this class is trying to give you a sense of how to prove results. We organically flowed to the proof today. It seemed reasonable to write down the desired probability, and since Markov's inequality involves $E[X]$, it makes sense to write that down. We then had to be a little clever in the algebra to manipulate it to the desired expression. I want you to walk away from this class with some comfort in proving results, and in figuring out what one should try to prove and investigate. This is why I felt it was so valuable to slowly work up to Markov's inequality.
Wednesday, October 28: Gamma Function, Normal Distribution and Chi-Square Random Variables: https://youtu.be/pLK9omzuH04
- Gamma Function:
  - We considered the Gamma function, which generalizes the standard factorial function. We gave a proof of its functional equation, Γ(s+1) = sΓ(s); this allows us to take the Gamma function (initially defined only when the real part of s is positive) and extend it to be well-defined for all s other than the non-positive integers. For more on the Gamma function and another proof of the value of Γ(1/2), see our textbook.
  - We talked about using the functional equation to analytically continue the Gamma function to other values than Re(s) > 0. Analytic continuation is a huge subject, and an important part of complex analysis. For more see this beautiful clip on why $1 + 1/2 + 1/3 + 1/4 + 1/5 + \cdots = -1/12$: https://www.youtube.com/watch?v=w-I6XTVZXww
  - All the moments of the standard normal can be expressed in terms of special values of the Gamma function, and the double factorial of odd numbers falling out. There are a lot of nice, special values of the Gamma function. One of the most important is Gamma(1/2). We can deduce these values from either the multiplication or reflection formula, but that just begs the question on how these are proved. See our textbook for details.
  - Below are a lot of comments on the Gamma function and where it occurs, and some related problems.
    - One nice application of the Gamma function and normalization constants is a proof of Wallis' formula, which says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is mostly elementary (see my article in the American Mathematical Monthly). Not surprisingly, the proof uses one of my favorite techniques, the theory of normalization constants (caveat: it does have on advanced ingredient from measure theory, namely Lebesgue's Dominated Convergence Theorem).
    - Many functions in mathematical physics initially exist only for some values of the parameters but can be continued elsewhere; my favorite is the Riemann zeta function (and the extension uses the Gamma function). What is amazing (and not initially apparent) is that the following frequently occurs. We have some function and we only care about its values at the real numbers (or maybe even just the integers); nevertheless,it is often easier to study it as a function of a complex variable (z = x + iy), as then we have all the tools and techniques of complex analysis at our disposal. A terrific example is the Prime Number Theorem (which says that, to first order, the number of primes at most x is about x/log x). This is a statement about integers, yet the `easiest' and `best' proofs all use the Riemann zeta function at complex arguments (and, as you may reasonably ask, why should we need to use complex numbers to count integers!). What follows is an aside on an aside -- this is clearly not needed for the course!
      - The complex analytic proof of the Prime Number Theorem uses several key facts. We need the functional equation of the Riemann zeta function (which follows from Poisson summation and properties of the Gamma function), the Euler product (namely that $zeta(s)$ is a product over primes), and the important fact that the Riemann zeta function does not have a zero on the line Re(s) = 1! If this happened, then the main term of $x$ from integrating $\zeta'(s)/\zeta(s) \ast x^s/s$ arising from the pole of $\zeta(s)$ at $s=1$ would be cancelled by the contribution from this zero! Thus it is essential that there be no zero of zeta(s) on Re(s) = 1. There are many proofs of this result. My favorite proof is based on a wonderful trig identity: $3 + 4 \cos(x) + \cos(2x) = 2 (1 - \cos(x))^2 \ge 0$ (many people have said that $w^2 \ge 0$ for real $w$ is the most important inequality in mathematics). If people are interested I'm happy to give this proof in class next week (or see Exercise 3.2.19 in my number theory textbook; this would make a terrific aside if anyone is still looking for a problem). There is an elementary proof of the prime number theorem (ie, one without complex analysis). For those interested in history and some controversy, see this article by Goldfeld for a terrific analysis of the history of the discovery of the elementary proof of the prime number theorem and the priority dispute it created in the mathematics community. We mentioned Riemann computed zeros of zeta(s) but didn't mention his achievement; the method only came to light about 70 years later when Siegel was looking at Riemann's papers. Click here for more on the Riemann-Siegel formula for computing zeros of zeta(s). Finally, terrific advice given to all young mathematicians (and this advice applies to many fields) is to read the greats. In particular, you should read Riemann's original paper. In case your mathematical German is poor, you can click here for the English translation of Riemann's paper. The key passage is on page 4 of the paper: One now finds indeed approximately this number of real roots within these limits, and it is very probable that all roots are real. Certainly one would wish for a stricter proof here; I have meanwhile temporarily put aside the search for this after some fleeting futile attempts, as it appears unnecessary for the next objective of my investigation.
  - There is a combinatorial interpretation of the double factorial, $(2m-1)!! = (2m-1) (2m-3) \cdots 3 \cdot 1$, is the number of ways to split $2m$ people into $m$ pairs of 2. It's very important not to add extra order; while we can assault the problem by adding labels, we must remove them at the end. Remember, while we can define whatever we want in mathematics, it's important to define useful expressions. In particular, the factorial of the factorial doesn't occur that often, but as we saw combinatorially multiplications of every other integer do.
  - The highpoint of the day is using a convolution to calculate the sum of two independent chi-square random variables with parameters $\nu_1, \nu_2$ and seeing it was chi-square with parameter $\nu_1+\nu_2$. We again did integration without integrating -- had $Y = X_1 + X_2$ led to $f_Y(y) = \int_0^\infty f_{X_1}(t) f_{X_2}(y-t) dt$, we then pulled out a lot of constants and noticed that if we changed variables to $t = uy$ we got $f_Y(y) = c_1 c_2 e^{-y/2} y^{(\nu_1+\nu_2)/2 - 1} \int_{u=0}^1 g(u) du$ for some function $g$; what's nice is that we don't need to know exactly what $g$ is or what its integral is. It's some constant $c$ and thus $f_Y(y) = c c_1 c_2 e^{-y/2} y^{(\nu_1+\nu_2)/2 - 1}$; this has the functional form of a chi-square random variable with parameter $\nu_1 + \nu_2$, and thus $c c_1 c_2$ must be the corresponding normalization constant!
- Chi-square distribution
  - Yesterday was `onslaught of calculus' day. We did a lot of calculations; all the details are in the book. The goal is not to be able to crank these out line by line, but to understand the logic behind how I chose to attack them. There are tricks to make our lives easier, ways to arrange the algebra. That's the goal: seeing this. We saw this again today.
    - We started by showing a sum of normal random variables is normal (and there can only be one possibility for the mean and variance, as we know the expected value of a sum is the sum of the expected values, and if independent same holds for variance). The key idea in the convolution was to manipulate the algebra to recognize an integral we knew, namely the integral of normals with means $\mu_1, \mu_2$ and variances $\sigma_1^2, \sigma_2^2$ has mean $mu_1+\mu_2$ and variance $\sigma_1^2+\sigma_2^2$; we did this by completing the square and pulling appropriate factors out, and recognizing we had to have some normal, and as there can only be one with a given mean and variance....
    - The general case of sums of normals was done by grouping; it's amazing how many times we use this, and how it can simplify our lives.
    - The square of a standard normal is a chi-square with one degree of freedom; the sum of n independent squares of standard normals is a chi-square with n degrees of freedom. While at first the density may look a little strange, it's because the fundamental object is the normal, which we square.
    - We computed N(0,1)^2 and N(0,1)^2 + N(0,1)^2 directly using the CDF method; fortunately we never need the anti-derivative of the standard normal, which is essentially the error function (the name says it all!). For more sums, however, this won't work and we need something better, a more elegant approach. We will see on Monday generalized spherical coordinates. We saw a beautiful trick that allowed us to integrate out all the angular coordinates, even though we didn't know the Jacobian or the exact change of variable equations! (This is where we're using results from Calc III). When the algebra settles, we have a proof not only of the chi-square result, but a way to find the surface area and volume of n-dimensional spheres! We did this by using the theory of normalization constants. One nice application of the Gamma function and normalization constants is a proof of Wallis' formula,which says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is mostly elementary (see my article in the American Mathematical Monthly). Not surprisingly, the proof uses one of my favorite techniques, the theory of normalization constants (caveat: it does have on advanced ingredient from measure theory, namely Lebesgue's Dominated Convergence Theorem).
    - Fortunately we never had to work with the n-dimensional spherical coordinate system
    - There are many fascinating question involving spheres (with applications to error correcting codes!):
      - The kissing number, which is related to spherical codes.
      - Sphere packing (the special case in 3-dimensional space is known as the Kepler Conjecture, and a proof was presented by Hales in 1998).
      - If you want to read more about these, I'm happy to share a chapter in a cryptography book I'm writing.
      - One of the most important applications of spherical coordinates is to planetary motion, specifically, proving that the force one sphere exerts on another is equivalent to all of the mass being located at the center of the sphere. This is the most important integral in Newton's great work, Principia (we have a first edition at the library here). I strongly urge everyone to look at this problem. Proving that one can take all of the mass to be at the center enormously simplifies the calculations of planetary motion. See the Wikipedia article on the Shell Theorem for the computation. As this is so important, here is another link to a proof. Oh, let's do another proof here as well as another proof here. For an example of a non-proof, read the following and the comments.
Monday, October 26: The Normal Distribution and Integration without Integrating: https://youtu.be/9zIvJahTp-I
- Normal Distribution:
  - We studied the Gaussian random variable (normalization factor, moments). We'll see another way to calculate the moments when we get to the Gamma function.
    - We computed the integral for the Gaussian's density. We showed it was finite by bounding the integrand of exp(-u^2/2) with exp(-u/2), as we can integrate the latter in closed form but not the former. Unfortunately, it's not always the case that exp(-u^2/2) < exp(-u/2). This is true for all u > 1, but for u < 1 the inequality goes the other way, and we would make the integral larger if we made this replacement. Remember, though, that our goal at this point in the day was not to evaluate the integral, but prove it is bounded. That's all we need to then know we can renormalize and make it a density. Thus, we split the integral up into u < 1 and u ≥ 1 (though really we could've split at any point greater than 1), and then used exp(-u^2/2) < exp(-u/2) only for large u.
    - We calculated the variance by using the method of differentiating identities (though we discussed what is required to make the integration by parts work, specifically splitting $x^2$ into $x \ast x$ and putting one $x$ with the $e^{-x^2/2} dx$ piece). Again we multiply through by one of the parameters to avoid using the product rule.
- Sums of Normal Random Variables
  - We started by showing a sum of normal random variables is normal (and there can only be one possibility for the mean and variance, as we know the expected value of a sum is the sum of the expected values, and if independent same holds for variance). The key idea in the convolution was to manipulate the algebra to recognize an integral we knew, namely the integral of a normal; we did this by completing the square and pulling appropriate factors out.
  - The general case of sums of normals was done by grouping; it's amazing how many times we use this, and how it can simplify our lives.
  - We'll eventually get to the Central Limit Theorem, which says that many processes converge to being normally distributed. A great example is the random walk model mentioned above. The fun way of phrasing this is that at each moment we toss a coin, and move left or right with equal probability of 1/2 (often the imagery used is that of a drunk wandering, or trying to wander, home from a bar late at night). Does the drunk return to the bar? How often? See wikipedia's page for some information on a random walk. Interestingly, the drunk returns home with probability 1 in one and two dimensions but not in three-dimensions; one of my professors at Yale said this is why we live in three dimensional space (it's the smallest dimension such that life is interesting for a drunk able to walk in the d-directions).
    - 1-dim random walk: http://mathworld.wolfram.com/RandomWalk1-Dimensional.html
    - 2-dim random walk: http://mathworld.wolfram.com/RandomWalk2-Dimensional.html
    - 3-dim random walk: http://mathworld.wolfram.com/RandomWalk3-Dimensional.html
    - See Polya's random walk constants: http://mathworld.wolfram.com/PolyasRandomWalkConstants.html
- The following comment is very important (hence the color!). We could have argued more elegantly for the sum of two normals. Note the exponent of the exponential, after the dust settles and we complete the square, has to look like $az^2 + b(t - cz)^2$ for some constants $a, b, c$. The actual values do not matter; when we integrate we can pull out the $\exp(-az^2)$ and then shift $t$ by $cz$, and then the $t$-integration and the other constants combine to give us something that looks like $C \exp(az^2) = C \exp(z^2/(2 \cdot 1/2a))$ for some constants $C, a$. However, we know that the mean of a sum of independent random variables is the sum of the mean (and similarly for the variance). Thus our distribution has the shape of a normal with mean $0$ and variance $1/2a$. As this has to have mean $0$ and variance $\sigma_1^2 + \sigma_2^2$ we must have $1/2a = \sigma_1^2 + \sigma_2^2$ and thus $C = 1/\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}$. Thus, integration without integrating is even more amazing than you might initially believe -- we don't even need to keep track of the algebra!!!
Friday, October 23: Envelope Problem, Expected Value of a Product, Multiplicative Convolution, Variances, CDF Method: https://youtu.be/_hm1gOJT2WQ
- I wanted to spend some time today trying to slip things past you (which I can't tell you ahead of time). The first is the famous two envelope problem, the second is the difficulty in the probability of a product. It's very easy to subtly assume something, to transfer an approach from one area to another and have it not work. Again, we only have so many hours in class -- I want to use that time effectively, so rather than go through all the calculations for all the random variables many of these are being left to you to do at home (if you have any questions of course please come and see me and happy to do them with you!). We'll do some more in class, but I really want to highlight how easy it is to fall victim to faulty generalizations.
  - I strongly encourage you to read the following two articles.
    - Here is a great article: http://blogs.scientificamerican.com/guest-blog/leveraged-yield/ (must read)
    - Here is another: http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2012.00538.x/epdf
    - Here is another: Did a math formula cause financial crisis.
- CDF Method:
  - We discussed the CDF method to find the probability density function of Y = g(X) in terms of the pdf of X. A key ingredient is the fundamental theorem of calculus, which allows us to bypass doing some difficult (if not impossible) integration in general. We expressed the cdf of Y in terms of the cdf of X and the applied the fundamental theorem of calculus to express the pdf of Y in terms of the pdf of X and the function g.
    - The cumulative distribution function is one of the key tools of the subject, and gives a sense of why continuous random variables are easier to analyze than discrete; namely, for continuous we have the Fundamental Theorem of Calculus at our disposal to pass from a cumulative distribution function to a density; we do not have differentiation available in the discrete case. Note that a cumulative distribution function does not determine a unique density; however, it almost does so, as any two densities must integrate to the same value on any interval. (The technical jargon is to say that the density is determined up to a function which is zero almost everywhere.) If there is interest, let me know and I'll talk a bit about the basics of measure theory (and show that almost no numbers are rational in the sense of measure).
    - As most integrals cannot be evaluated in closed form, it's worth mentioning some of the powerful numerical techniques. One of the most important is very probabilistic, and is called Monte Carlo integration, which has been hailed by some as one of the (if not the) most influential papers in the 20th century; here is a computer code to investigate. It gives really good results on numerically evaluating integrals. Specifically, if N is large and we choose N points uniformly, we can simultaneously assert that with extremely high probability (such as at most 1 - N^{-1/2}) the error is extremely small (at most N^{-1/4}). If you want to know more, please see me -- there are a variety of applications from statistics to mathematics to economics to .... Below are links to two papers on the subject to give you a little more info:
      - Metropolis: The beginning of the Monte Carlo Method: http://fas.org/sgp/othergov/doe/lanl/pubs/00326866.pdf
      - Metropolis and Ulam: The Monte Carlo Method: http://homepages.rpi.edu/~angel/MULTISCALE/metropolis_Ulam_1949.pdf
    - Sadly one of the most important densities, that of the standard normal, does not have a nice closed form cdf, though it does have a nice series expansion.
    - Related to the cdf method is a way to sample from other distributions if we can sample from the uniform; we discussed this in an earlier homework problem.
    - There are many applications of the Change of Variables formula, especially in probability theory; see here for a one-dimensional example (if you have access to JStor, here is one in economics).
    - One can easily get the area of an ellipse or the volume of an ellipsoid using the change of variable; the perimeter of an ellipse is much harder!
    - For example, the ellipse $(x/4)^2 + y^2 = 1 $ is four times longer than wide. It is clearly not circular; however, if we change units and measure along the $x$-axis in meters and the $y$-axis in the new length units of Ephs (where 1 Eph equals 1/4 of a meter, or 4 Ephs equals a meter), then in this biased measuring it is a circle!
      - There are lots of great units, many created by MIT students. Two of my favorites are the Smoot (interestingly, the person was the unit of measurement ended up as the president of the International Organization for Standardization) and the Bruno (this is the indentation, in cubic-centimeters I believe, made from a piano dropped 6 stories...).
- Envelope Problem:
Wednesday, October 21: Back to the Future Lecture: Marriage Problem, Differentiating Identities, Binomial Random Variables, Bernoulli Random Variables: https://youtu.be/QgmZpp5Ztyw
- General advice: to differentiate an identity, you need an identity. Seems silly to state but it's essential. Often the hardest part of these problems is figuring out how to do the algebra in a clean way. For us, we saw that frequently we want to move the normalization constant over to the other side; it allows us to avoid a product or quotient rule. We also saw sometimes it's easier to computer $E[X(X-1)]$ than $E[X^2]$, and then do algebra. It all comes down to whether or not it's easier to apply $d/dx$ or $x d/dx$.
- Binomial Distribution:
  - Wikipedia link: http://en.wikipedia.org/wiki/Binomial_distribution
  - Algebra: One of the hardest aspects of using the method of differentiating identities is to find the right identity and the simplest path through the algebra; fortunately it's also possible to reach the correct answer if you take a longer route! While we could have started with $\sum_{k=0}\left({n \atop k}\right) p^k (1-p)^{n-k} = 1$, if we apply $\frac{d}{dp}$ or $p\frac{d}{dp}$ we have to use a product rule, and we'd like to avoid that. We can't just multiply through by $k$ as that's our summation variable, and whatever we do to one side must be done to the other. A nice trick is to start with the Binomial Theorem: $\sum_{k=0}\left({n \atop k}\right) x^k y^{n-k} = (x+y)^n$, and now apply $x\frac{d}{dx}$. The advantage is clear: no product rule! All we need to do is then set $x = p$ and $y = 1-p$. What's nice is that, even though at the end of the day we have two dependent terms ($p, 1-p$), we start with two independent quantities ($x, y$) for differentiation.
- Negative Binomial: Didn't do this in lecture, but similar...
  - Wikipedia page: http://en.wikipedia.org/wiki/Negative_binomial_distribution
  - Algebra: The negative binomial can be defined many ways. We're using a parameter $p$ of success, and counting how many successes we have before we get $r$ failures. The probability is $\left({k+r-1 \atop k}\right) p^k (1-p)^r$; it is hard to directly show this sums to 1, but it isn't too bad by telling a story. To find the mean and other moments we could apply $p\frac{d}{dp}$, but we would have to use the product rule. The trick is to multiply through by $(1-p)^{-r}$, take the derivative, and then multiply by $(1-p)^r$.
  - Nomenclature: The name suggests a connection with the binomial distribution. Whenever you have a new distribution you want to try to understand it in terms of earlier ones. Here, we should try looking at the special case $r=1$. We find a geometric random variable, not a binomial (a little care is needed in comparing it to a geometric random variable, as it depends on whether you're counting successes or failures).
- Marriage Problem:
  - There's a really nice wikipedia article on the secretary / marriage problem. Somewhat related to this is the German tank problem. What I like most about this problem is that it's related to a lot of concepts we've studied (conditional probabilities, breaking a complicated problem into a lot of simpler ones), as well as the need to understand what a formula says and re-express it in a more meaningful way. We saw the harmonic numbers hiding in our expression, which we can approximate using the integral test. In a better approximation one meets the Euler-Mascheroni constant.
  - It is absolutely shocking that we can do so well in the marriage / secretary problem (ok, mathematically, not necessarily in practice). While assuming we know the number of applicants might sound overly restrictive, in some situations it's actually not so unreasonable. For example, the Math/Stats department is hiring a mathematician this year. Based on previous hiring searches in the past few years, I expect there'll be around 700 applications, and almost surely between 600 and 800. It's shocking that our final winning percentage is positive and not decaying to zero with n. As a nice exercise, try to compute the probability we end up with one of the top two candidates. Try to come up with a strategy that will have you `settle' as you get older (ie, start running out of candidates!).
  - Key input in the analysis was the sum of the harmonic series: http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)
    - See here for the growth of the partial sums: http://en.wikipedia.org/wiki/Harmonic_number
  - India bride walks out when groom doesn't know math: http://bigstory.ap.org/article/3267cd38925e46828ddb0b623fad9ead/groom-fails-math-test-indian-bride-walks-out-wedding
- Back to the Future Day: We have some freedom in order in which we cover some topics, so thought would be fun to do the Marriage Problem today....
  - http://www.nbcnews.com/tech/tech-news/10-ways-celebrate-back-future-day-n444721
  - http://www.today.com/money/back-future-part-ii-writer-talks-about-2015-predictions-1D80410566
Friday, October 16: Prizes in a Box Problem, Differentiating Identities, Geometric Random Variable, Linearity of Expectation: https://youtu.be/YF0v4O9dwyc
- Differentiating identities:
  - Here is a differentiating identies handout (this became the nucleus for the section in the book). If the sums are finite there is no difficulty justifying the method, but for infinite sums it is very important to check to make sure we can do this interchange (interchanging the sum and the derivative); it is frequently referred to as differentiating under the integral sign. Differentiating identities is a powerful technique; it creates infinitely many more identities from a given one.
  - The idea is that identities are hard to find, and if you can create more from one that's good! We started with the geometric series formula (with a hand-waving and then a rigorous proof), and talked about what we can get if we can interchange the infinite sum and the derivative. Later in class we showed that we can justify the interchange here as the tail of a geometric series is also a geometric series; this is a fortitous situation which we can exploit. In general one has to work harder.
    - Details of interchanging the sum and the derivative: https://www.youtube.com/watch?v=8dPhdo98Gk8&feature=youtu.be (Lecture from my Math 389 class).
  - We talked about some of the important standard distributions. We did the Geometric Distribution in class, others are in the book and will be hit a bit.
    - The first was the geometric distribution. If you thought the correct definition was to keep $p$ the probability of success and have the density function $(1-p)^{n-1}p$, you're right! This is the accepted normalization. We keep $p$ for sucess, ALWAYS. Also this way we talk about a success on toss $n$, not on $n+1$. Again, it doesn't matter which way you define it, but you want to be consistent with others.
    - Next was the exponential distribution. There are two `natural' definitions, depending on what we want the scale parameter to be. Personally, I prefer $(1/\lambda) \exp(-x/\lamba)$ to $w exp(-xw)$ (this is one of the few times I disagree with Wikipedia); the first has mean $\lambda$ while the second has mean $1/w$. You always have to be careful when looking at different sources, as they could use different normalizations.
    - Next is the Poisson distribution. Wikipedia redeems itself and has a nice discussion of how it arises here.
    - In looking at the two distributions above, our first though was to find the mean. We saw both were computable via differentiating identities. For the Poisson, we saw how to compute the mean (and even harder, the variance) by being clever about the algebra. For the mean we needed to notice that $n / n!$ equals $1/(n-1)!$. For the variance, the trick was to write $n^2$ and $n(n-1 + 1)$, and then when we divide by $n!$ we get $1/(n-2)! + 1/(n-1)!$, and we can handle each piece.
    - It's worth thinking about the variance of the Poisson. A major theme of the course is the need to be able to look at a lengthy equation and get a feel for what it's saying. The mean and the standard deviation are supposed to be in the same units, so if the mean is λ then shouldn't the standard deviation be λ, because if the variance were λ then the standard deviation would be λ^1/2 and that would have the wrong units, right? Wrong. For an exponential with density f(x) = λ exp(-λx) the mean and standard deviation are both 1/λ, and we can see that this is the correct λ dependence by scale issues: we exponentiate λx, so λx must be unitless so if x is in meters say then λ is in 1/meters, and thus this is the correct λ dependence for the mean and standard deviaton. What goes wrong for the Poisson? Remember the density there is f(n) = λⁿ e^λ/n!; here λ is alone in the exponential and is thus unitless! This means we can't use the unit analysis to say that the standard deviation and the mean have the same λ dependence.
    - The Poisson random variable often models the number of events in a window of time. Also, frequently normalized spacings between events converge to Poissonian (a great example is to look at the primes). Another is the spacings between the ordered fractional parts of $n^k \alpha$ (click here for more).
  - General advice: to differentiate an identity, you need an identity. Seems silly to state but it's essential. Often the hardest part of these problems is figuring out how to do the algebra in a clean way. For us, we saw that frequently we want to move the normalization constant over to the other side; it allows us to avoid a product or quotient rule. We also saw sometimes it's easier to computer $E[X(X-1)]$ than $E[X^2]$, and then do algebra. It all comes down to whether or not it's easier to apply $d/dx$ or $x d/dx$.
  - The toy prize problem is great, highlighting so many great parts of math. We can get a rough sense of the answer (if there are N prizes the answer should lie b/w N and N!, then improve to N to N^2). We use linearity of expectation to write the random variable we want, X (the total wait time), as a sum of random variables we know (wait times for success for Geometric Random Variables). We end with the harmonic sum.
    - https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)
    - https://en.wikipedia.org/wiki/Integral_test_for_convergence
  - Finally, some interesting applications of probability: is there really a need to punt with 10 seconds left? (Video is here, go to 1:01.) ESPN is listing Michigan State's chance of winning at .2% (a little surprised it was that high, but of course this leads to a great question: how likely are unlikely events -- can we we stimate their probabilities well?). Or, if you want even more bad sports calls: http://www.bostonglobe.com/sports/2015/10/19/patriots-should-expect-more-what-colts-tried/jCXNZK3cJTdtVNf3DYXVFL/story.html?p1=Article_Recommended_ArticleText

Wednesday, October 14: We discussed joint pdfs, marginals, convolutions, and linearity of expectation.
- Video: Joint pdfs, linearity of expectation: http://youtu.be/gQzorseWuVc
- Joint Probability Density Functions:
  - If X and Y are random variables, $f(x,y)$ (or perhaps $f_{X,Y}(x,y)$) is the joint density function. If the random variables are independent, $f(x,y) = f_X(x) f_Y(y)$, which greatly simplifies the analysis. If we integrate out one of the variables we are left with what's called the marginal. We use this in our proof of linearity of expectation. As you'll find out, I love linearity of expectation (this is a link to notes I've written on the subject, which we'll get to later).
  - We have to be very careful in interchanging orders of operations. We concentrated on interchanging two integrals, but one can interchange a derivative and an integral (click here for conditions on when this is permissible; this is called differentiating under the integral sign). In general we cannot interchange orders of operations ($\sqrt{a+b}$ is typically not $\sqrt{a} + \sqrt{b}$), but sometimes we're fortunate (click here for a nice article on Wikipedia on when this is permissible).
  - It is not always possible to interchange orders of integration (see Fubini's Theorem for when this may be done). The main take-away is that we must be careful interchanging.
  - Fubini's theorem is one of the most important results in integration theory in several variables. There really isn't an analogue in one dimension, as there we have no choice in how to integrate!
    - Whenever you are given a theorem, it is worthwhile to remove a condition and see if it is still true. Typically the answer is no (or if it is still true, the proof is frequently much harder). There are many functions and regions where the order of integration matters. The simplest example is looking at double sums rather than double integrals, though with a little work we can convert this example to a double integral. We give a sequence $a_{mn}$ such that $\sum_{m = 0}^{\infty} \sum_{n = 0}^{\infty} a_{m,n}$ is not equal to $\sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} a_{m,n}$. For $m, n \ge 0$ let $a_{m,n} = 1$ if $m = n$, $-1$ if $n=m+1$ and $0$ otherwise. We showed that the two different orders of summation yield different answers. The reason for this is that the sum of the absolute value of the terms diverges.
    - Click here for another example where we cannot interchange the order of integration; a more involved example is available here.
    - Click here for a video by Cameron on how he applies Fubini's theorem to change the order of operations (he does a double sum instead of a double integral, but the principle is the same).
- Convolutions:
  - We introduced convolutions formally, though we had seen them earlier. We saw why the convolution of two densities is the density of the sum of the corresponding random variables. This property is the reason convolutions play such an important role in the theory. We informally remarked that if $Z = X+Y$ then $f_Z(z) = \int_{x=-\infty}^\infty f_X(x)f_Y(z-x)dx$, but saw a longer justification of that from $F_Z(z) = {\rm Prob}(Z \le z) = \int_{x=-\infty}^\infty \int_{y=-\infty}^{z-x} f_X(x) f_Y(y) dy dx$. We then take the derivative with respect to $z$ of both sides, and note $\frac{d}{dz}F_Z(z) = f_Z(z)$; this is the advantage of good notation - using capital letters for cdfs reminds us that their derivatives equal the pdfs. We pass the derivative past the $x$-integration, and get $\frac{d}{dz}\int_{y=-\infty}^{z-x} f_Y(y)dy = \frac{d}{dz}\left[F_Y(z-x) F_Y(-\infty)\right] = f_Y(z-y) \cdot 1$ by the Chain Rule. What's nice is we never need to know what $F_Y$ is explicitly, as we immediately take its derivative!
  - The Fourier transform of a convolution is the product of the Fourier transforms. This converts a very difficult integral into the product of two Fourier transforms, and frequently these integrals can be evaluated. The difficulty is that, at the end of the day, we must then invert, and to prove the Fourier Inversion Theorem is no trivial task.
  - Earlier we considered X1 + X2 with each Xi ~ Uniform(0,1) (or maybe we just did the case of discrete uniform random variables, such as a die with 6 sides). To get a feeling for the answer, we looked at rolling two fair die and the distribution of the resulting sums. We found Prob(R1 + R2 = k) = (6 - |k-6|)/36 for 2 <= k <= 12 and 0 otherwise. This is a triangle, it's symmetric about the mean, the density is largest at the mean, .... It is unlikely that these features depend on the die having 6 sides, and thus it is reasonable to expect X1 + X2 to be a triangle supported in [0,2] with maximum density at the mean of 1.
  - We can prove this by using convolutions and then brute force integration. Convolutions are incredibly powerful and useful in probability, and provide a very useful way to explore many problems. The convolution is defined by (f1 * f2)(x) = Integral_{t = -oo to oo} f2(t) f2(x-t)dt. If fi is the density of Xi, this is the density of X1+X2. We proved this by using the cumulative distribution function of Y = X1+X2 (which was a double integral) and then differentiating. The key step was interchanging the derivative and the integral. In general we cannot interchange orders of operations (sqrt(a+b) is typically not sqrt(a) + sqrt(b)), but sometimes we're fortunate (click here for a nice article on Wikipedia on when this is permissible).
  - There is enormous structure behind convolutions of probability distributions. Let f be the density function for the random variable X, and g the density function for the random variable Y. As X+Y = Y+X, we find f * g = g* f (ie, the operation is commutative), and f * (g * h) = (f *g) * h (the operation is associative). Convolution is also closed (if f and g are densities, so is f * g). Note this is beginning to look like a group; namely, we have a collection of objects (in this case, probability densities or maps from the reals to the reals) and a way to combine them (convolution) that is closed, associative, and even commutative. If we just had an identity element and inverse, we would have a group (a commutative group, in fact). Groups occur throughout the sciences and the world, two of my favorite are the Rubik's cube and the Monster group. As there is a lot of structure in groups, it's natural to ask whether or not we can find an identity element and inverses.
    - We must find an identity element. We define the Dirac delta functional δ(x) as follows: for any probability density f(x), Integral_{x = -oo to oo} f(x)δ(x) dx = f(0). One may view δ(x) as the density corresponding to a unit point mass located at 0; similarly we would have Integral_{x = -oo to oo} f(x) δ(x-a) dx = f(a), corresponding to a unit point mass at a. We have actually seen Dirac delta functionals before. For example, let X be Bernoulli(p). This means Prob(X=1) = p, Prob(X=0) = 1-p and any other x has Prob(X=x) = 0. If we let f(x) denote the probability mass function, we have f(x) = p δ(x-1) + (1-p) δ(x). It turns out that the Dirac delta functional (which does integrate to 1, which can be seen by taking f(x) = 1 in Integral_{x = -oo to oo} f(x) δ(x-a) dx) acts as the identity. We now show f * δ = f. We have (f * δ)(x) = Integral_{t = -oo to oo} f(t) δ(x-t) dt = f(x).
    - Thus the only obstacle in whether or not we have a group (with group operation given by convolution) is whether or not there is an inverse. Is there? Perhaps there is an inverse if we restrict the types of probability distributions we study (for example, maybe we only look at densities defined on a compact interval).
- Expected Value:
  - We gathered some data on the expected value of a sum being the sum of the expected values, and saw it didn't seem to matter if the variables were independent or not (which is different than the case of variance, where the variance of a sum depended on whether or not the random variables were independent). In the book we computed $\sum_{k=0}^n \left({n \atop k}\right) 2^{-n}$ for $n = 1, 2, 3$ and found expected values of 1/2, 2/2, 3/2, leading us to conjecture for general $n$ that it'll be $n/2$.
  - The proof that the expected value of a sum is the sum of the expected values (assuming there are finitely many terms and every integral/sum is finite) has a few interesting steps. We need to switch the orders of integration and go from joint densities to marginal densities. It's natural to want to do this. The reason is we know E[X] and E[Y], so we want to rearrange the algebra so we can find these terms. This means we need to get to points where we're integrating terms like $x f_X(x)$ and $y f_Y(y)$; knowing that we need to reach these points suggests how to approach the algebra.

Wednesday, October 7: Introduction to expectation, moments, Taylor series. Taylor Series and Applications, Moments: https://youtu.be/w_lTjpY4I3s
- Mathematica code snippet for the week: Here is a nice way to program a variable number of for loops. This is great when you initially do not know how many For loops you'll need. The trick is to use a base B expansion.
  - varforloop[n_, B_] := Module[{},
       For[m = 0, m <= B^n - 1, m++,
         {
           list = IntegerDigits[m, B, n];
           (* elements of list are the values of x_1, ..., x_n *)
        }];
    ];
- Review: The cumulative distribution function
  - The cumulative distribution function is one of the key tools of the subject, and gives a sense of why continuous random variables are easier to analyze than discrete; namely, for continuous we have the Fundamental Theorem of Calculus at our disposal to pass from a cumulative distribution function to a density; we do not have differentiation available in the discrete case. Note that a cumulative distribution function does not determine a unique density; however, it almost does so, as any two densities must integrate to the same value on any interval. (The technical jargon is to say that the density is determined up to a function which is zero almost everywhere.) If there is interest, let me know and I'll talk a bit about the basics of measure theory (and show that almost no numbers are rational in the sense of measure).
- We began today with a quick review of Taylor series. We talked a lot about the consequences of a specific function's Taylor series.
  - The first step in any investigation is to figure out what questions to ask. Here are the two standard ones: (1) does the Taylor series exist (or for what x does it converge and equal the original function), and (2) is the Taylor series unique? The answers were surprising; a Taylor series must converge at the expansion point, but it's possible to only converge there; it's also possible for two different, infinitely differentiable functions to have the same Taylor series!
  - Analysis is hard. The function f(x) = exp(-1/x²) if x is not zero and 0 otherwise has all of its derivatives vanish at 0, but its Taylor series agrees with the original function only at x=0 (which is nothing to be proud of!). Complex analysis is quite different; there if a function is complex differentiable once then it is infinitely complex differentiable, and it equals its Taylor series in a neighborhood of the point. This fact is one reason why we frequently use characteristic functions instead of generating or moment generating functions (which we'll cover later in the semester). We also discussed the similarities between how Taylor coefficients uniquely determine a nice function and how moments uniquely determine a nice probability distribution. It is sadly not the case that a sequence of moments uniquely determines a probability distribution; fortunately in many applications some additional conditions will hold for our function which will ensure uniqueness. For the non-uniqueness of Taylor series, the standard example to use is f(x) = exp(-1/x^2) if x is not zero and 0 otherwise. To compute the derivatives at 0 we use the definition of the derivative and L'Hopital's rule. We find all the derivatives are zero at zero; however, our function is only zero at zero. We will see analogues of this example when we study the proof of the Central Limit Theorem.
  - Here's a fun application of Taylor series: we can prove all trig identities!
    - Using the Taylor series expansions for cosine and sine, we find e^(iθ) = cos θ + i sin θ. From this we find |e^(iθ)| = 1; in fact, we can use these ideas to prove all trigonometric identities! For example:
      Inputs: e^(iθ) = cos θ + i sin θ and e^(iθ) e^(iφ) = e^(i (θ+φ))
      
      Identity: from e^(iθ) e^(iφ) = e^(i (θ+φ)) we get, upon substituting in the first identity, that (cos θ + i sin θ) (cos φ + i sin φ) = cos(θ+φ) + i sin(θ+φ). Expanding the left hand side gives (cos θ cos φ - sin θ sin φ) + i (sin θ cos φ + cos θ sin φ) = cos(θ+φ) + i sin(θ+φ). Equating the real parts and the imaginary parts gives the identities
      
      cos(θ+φ) = cos θ cos φ - sin θ sin φ
      
      sin(θ+φ) = sin θ cos φ + cos θ sin φ
      
      One can prove other identities along these lines....
      
      Of course, we have to be careful. Note above we are using $e^x e^y = e^{x+y}$; if this were false it would be bad notation, but we must make sure it is true before using. This inequality is a statement about three infinite series, namely the product of the first two is the third. We must show $\sum_{n=0}^\infty x^n/! \sum_{m=0}^\infty y^m/m!$ equals $\sum_{\ell=0}^\infty (x+y)^\ell/\ell!$.
      
      The double sum converges absolutely, so we can re-order the summands (this is essentially Fubini's theorem, which you might've seen in Calc III or Real Analysis).
      
      We collect all products whose power sums to $\ell$ and find the double sum is $\sum_{\ell=0}^\infty \sum_{k=0}^\ell x^k y^{\ell-k} / k! (\ell-k)!$.
      
      A large part of mathematics is pattern recognition. Notice what we have is close to $\left({\ell \atop k}\right)$, the binomial coefficient. We thus multiply by 1 in the form $\ell!/\ell!$, notice the binomial coefficient, apply the binomial theorem and end with the claimed result.
  - Onion trig article: http://www.theonion.com/articles/nations-math-teachers-introduce-27-new-trig-functi,33804/
  - For real trig functions: http://www.3quarksdaily.com/3quarksdaily/2013/09/10-secret-trig-functions-your-math-teachers-never-taught-you.html
  - Here are a lot of comments on Taylor series from when I taught Math 105 (now Math 150) a few years back.
    - We saw how well Taylor series approximate functions. A Mathematica program here is (hopefully) easy to use. You can specify the point and number of terms of the Taylor series of cos(x) to do. At first it might seem surprising that there is no improvement in fit when we go from a second order to a third order Taylor series approximation; however, we have cos(x) = 1 - x^2/2! + x^4/4! - x^6/6! + .... In other words, all the odd derivatives vanish at the origin, and thus there is no improvement at the origin in adding a cubic term (ie, the best cubic coefficient at the origin is 0). If we go to a fourth order, we do see improvement. By n=10 or 12 we are already getting essentially an entire period correct; by n=40 we have several cycles.
    - Taylor's theorem one of the most important applications of calculus. It allows us to replace complicated functions with simpler ones. There are numerous questions to ask.
      - Are Taylor series unique? Yes. The definition just involves taking sums of derivatives; the process is well-defined.
      - Does every infinitely differentiable function equal its Taylor series expansion? Sadly, no; the function f(x) = exp(1/x^2) if |x| > 0 and 0 if x=0 is the standard example. This function causes enormous problems in probability. There are many functions which do equal their own Taylor series expansion, such as exp(x), cos(x) and sin(x). It's not surprising that these three are listed together, as we have the wonderful Euler-Cotes formula: exp(i x) = cos(x) + i sin(x), with i = sqrt(-1). At first this formula doesn't seem that important; after all, we mostly care about real quantities, so why complexify our life by adding complex (i.e., imaginary) numbers? Amazingly, even for real applications (applications where everything is real), complex numbers play a pivotal role. For example, note that a little algebra gives cos(x) = (exp(i x) + exp(-i x)) / 2 and sin(x) = (exp(i x) - exp(-i x)) / 2i. Thus properties of the exponential function transfer to our trig functions. The hyperbolic cosine and sine functions are similarly defined; cosh(x) = cos(i x) = (exp(-i x) + exp(x)) / 2. The Foxtrot strip below (many thanks to the author, Bill Amend, for permission to post) illustrates the confusions that can happen between hyperbolic and regular trig functions (for extra credit, why does Eugene know that the calculator cannot be giving the right answer?). It's worth noting that the formula exp(i x) = cos(x) + i sin(x) allows us to derive ALL trig identities painlessly! See the comments immediately above.
      - FoxTrot (c) Bill Amend. Used by permission of Universal Uclick. All rights reserved.
      - How easy are Taylor series to use? If we keep just a few terms, it's not too bad; however, as the great Foxtrot strip below shows, it's not always clear how nicely something simplifies.
      - FoxTrot (c) Bill Amend. Used by permission of Universal Uclick. All rights reserved.
      - In the strip above, notice the large factorials in the denominator. Note 52! is about 10⁶⁸; in other words, these terms are small! For interest, 52! is the number of ways (with order mattering) of arranging a standard deck of cards. There are about 10⁸⁵ or so subatomic thingamabobs in the universe; we see quite quickly reach numbers this high (a deck with 70 cards more than sufices; in other words, we could not have each subatomic object in the universe represent a different shuffling of a deck of 70 cards). In a related note, it's important to think a bit and decide what 0! should be. It simplifies many formulas to have 0! = 1, and we can make this somewhat natural by saying there is only 1 way to do nothing (mathematically, of course). The definition of the factorial function on Wikipedia talks a little bit about this.
      - Unlike 0!, 0^0 is a bit more controversial as to what the definition should be. As I don't want to pressure anyone, I will not publically disclose where I stand in the great debate, though I'm happy to tell you privately / through email.
      - It's worth remarking on why we have n! in the denominators. This is to ensure that the nth derivative of our function equals the nth derivative of the Taylor expansion at the point we're expanding. In other words, we're matching up to the first 2 derivatives for the 2nd order Taylor expansion, up to the first 3 for the 3rd order Taylor expansion, and so on. It isn't surprising that we should be able to do a good job; the more derivatives we use, the more information we have on how the function is changing near the key point.
      - For many purposes, we just need a first order or second order Taylor series; one of my favorites is the proof of the Central Limit Theorem in probability. One of my favorite proofs involves second order Taylor expansions of the Fourier Transforms (these were mentioned in the additional comments on Friday, March 12).
      - If f(x) equals its infinite Taylor series expansion, can we differentiate term by term? This needs to be proved, and is generally done in a real analysis course. For some functions such as exp(x) we can justify the term by term differentiation, but note that this is something which must be justified.
  - One of the most important applications of finding areas under curves is in probability, where we may interpret these areas as the probability that certain events happen. Key concepts are:
    - Mean or Expected Value
    - Standard Deviation
    - Skewness and kurtosis (for the hypercompetitive students who really want to compare themselves to the class).
  - We talked a lot about the Cauchy distribution, and how it does not have a finite mean or variance. It does play a major role in some economics theories, discussed in the next comments. The starting point of these discussions is linearity of expectation; while we didn't prove it today, we saw it was reasonable. We'll see that variances cannot be linear just by looking at X+X and X-X.
  - Economics: the standard random walk hypothesis seems to have lost most of its supporters, though there are variants (and I'm not familiar with all); see also the efficient market hypothesis and technical analysis, and all the links there. (There are also many good links on the wikipedia page on Eugene Fama). Two famous books (with different conclusions) are Malkiel's A random walk down wall street and Mandelbrot-Hudson's The (mis)behavior of markets (a fractal view of risk, ruin and reward). Some interesting papers if you want to read more:
    - Mandelbrot: Variation on certain speculative prices (a must read!)
    - Fama: Mandelbrot and Stable Paretian Hypothesis
    - Fama: Random Walks Stock Prices
    - For more on randomness, check out The Black Swan by Taleb (amazon.com page here, wikipedia page here). Several members of the class have recommended this book highly, and from reading excerpts on the web I understand why.
    - For more on fractal geometry, click here. We did the Koch snowflake; another popular one is the Cantor set. See here for fractal dimensions. To actually compute pictures of items like the Mandelbrot set, one needs to iterate polynomials. This can lead to the fascinating subject of efficient algorithms; when I wrote such programs years ago on what would now be considered `slow' computer, I had to use Horner's algorithm to get things to run in a reasonable time.
  - We recalled a powerful technique from Calc I: if f(g(x)) = x (so f and g are inverse functions, such as sqrt(x^2) or, the one we needed today for the Cauchy distribution, tan(arctan(x))), then g'(x) = 1 / f'(g(x)); in other words, knowing the derivative of f we know the derivative of its inverse function.
    As the chain rule and inverses are so useful, let's go through the theory.
    - If $f(g(x)) = x$ (so $f$ and $g$ are inverse functions, such as $\sqrt{x^2})$, then using the Chain Rule to take derivatives gives $g'(x) = 1 / f'(g(x))$; in other words, knowing the derivative of $f$ we know the derivative of its inverse function $g$. This was used in Calc I to pass from knowing the derivative of $\exp(x)$ to the derivative of $\ln(x)$. We can apply this to various inverse trig functions (a list of the derivatives of these are here). This highlights one of the most painful parts of integration theory -- just because we are close to finding an anti-derivative does not mean we can actually find it! While there is a nice anti-derivative of $\sqrt{1 - x^2}$, it is not a pure derivative of an inverse trig function. There are many tables of anti-derivatives (or integrals) (a fun example on that page is the Sophomore's Dream). Unfortunately it is not always apparent how to find these anti-derivatives, though of course if you are given one you can check by differentiating (though sometimes you have to do some non-trivial algebra to see that they match). In fact, there are some tables of integrals of important but hard functions where most practitioners have no idea how these results are computed (and occasionally there are errors!). We will see later how much simpler these problems become if we change variables; to me, this is one of the most important lessons you can take from the course: Many problems have a natural point of view where the algebra is simpler, and it is worth the time to try to find that point of view!
    - Let $f(x) = \exp(x)$. Then $f'(x) = \lim [f(x+h) - f(x)]/h$ = $\lim_{h \to 0} [\exp(x+h) - \exp(x)] / h$ = $\lim_{h \to 0} [\exp(x) \exp(h) - \exp(x)] / h$ = $\exp(x) \lim_{h \to 0} [\exp(h) - 1] / h$. As $\exp(0) = 1$, we find $f'(x) = \exp(x) \lim_{h \to 0} [f(h) - f(0)] / h$ = $\exp(x) f'(0)$; thus we know the derivative of the exponential function everywhere once we know the derivative at 0!
Monday Oct 4: We beguun our discussions of probability densities, learning the language (random variables, continuous and discrete, probability mass functions and densities). The key fact is that random variables must be real valued. This is so that we can add them or take averages et cetera. Thus we never have $X_i(\omega)$ be H if the ith toss is a head and T if the ith toss is a tail, but rather 1 if the ith toss is a head or 0 otherwise. In this case $X_i$ is a binary indicator variable, and we can add such random variables together (if you can tell me what a head plus a tail is, I'd love to know!). Video: Introduction to Discrete and Continuous Random Variables: https://youtu.be/hfhPh14ExMA
- For our probability spaces (Ω, F, P), we typically take the σ-algebra F to be 2^Ω if Ω is either finite or countable; recall that 2^Ω means the set of all subsets of Ω. This is not the only σ-algebra we may look at, but it is the most useful for these problems. For example the following is always a σ-algebra: {Ø, Ω}. Another possibility is to take, for any set A, F to be {Ø, A, A^c, Ω}. The point is we want our σ-algebra to be as large as possible (i.e., we want to define the probability of as many subsets of Ω as we can). If Ω is infinite, say [0,1] or the real line (-∞, ∞), we take the σ-algebra to be what is generated by open intervals (a,b). In other words, we start with all open intervals and see what sets we can form by going through the definitions of a σ-field. For example, countable intersections belonging means [a, b] is in the σ-algebra because it equals the intersection of (a - 1/n, b + 1/n). Click here to get a sense of what kind of sets we can form by these processes. For our purposes, we will only be assigning probabilities to finite sets, countable sets, or intervals, squares and similar figures; however, it is good to be aware of the advanced analysis.
- The cumulative distribution function is one of the key tools of the subject, and gives a sense of why continuous random variables are easier to analyze than discrete; namely, for continuous we have the Fundamental Theorem of Calculus at our disposal to pass from a cumulative distribution function to a density; we do not have differentiation available in the discrete case. Note that a cumulative distribution function does not determine a unique density; however, it almost does so, as any two densities must integrate to the same value on any interval. (The technical jargon is to say that the density is determined up to a function which is zero almost everywhere.) If there is interest, let me know and I'll talk a bit about the basics of measure theory (and show that almost no numbers are rational in the sense of measure).
- The book discusses how to use symmetry (especially grouping) to attack problems, and used it to get formulas for multinomial coefficients, and discussed how we could use it to understand rolls of four die. We saw how the sum of 1 die is uniform, 2 die is a triangle.... In the limit we get the normal distribution; this is the Central Limit Theorem, one of the goals of the course is to give as much of a proof of this as we can.
Friday Oct 2: We finished our lectures on advanced combinatorics -- there are of course many more problems we could discuss (see the book for some examples). Video: Inc-Exc Example, Circular Orderings, Mississippi Words, Cookie Problem: https://youtu.be/esi25VbpKms
- We talked about circular orderings. This leads to a difficult but fun problem, the napkin problem (I'm not the greatest at manners, and as a group mathematicians are often taken to be a bit clueless).
- We then turned to the cookie problem, counting the number of ways to divide 10 identical cookies among 5 distinct people (here's the classic clip where Cookie Monster meets the Count, whose full name is Count von Count -- they changed how he appears!). Counting is very important; we talked a bit about counting 'good' games in tic-tac-toe, and ignoring counting obviously bad ones. (Consider the error in the classic Princess Bride `Chess' Match or in the Princess Bridge Battle of Wits). It's usually called the stars and bars problem. What I love here is the power of changing your perspective -- we go from a very painful brute force approach to being able to solve it in one line.
- I have posted the cookie problem on my math riddles page (email me if you want to contribute); someone with far more patience than I solved it by brute force. Here's their solution. The final number, tabbed from the others, are how many distinct rearrangements we have of this basic configuration. The total is 1001, or (10+5-1 choose 5-1). We saw this problem lead to a discussion of multinomial coefficients. When there are two people getting cookies, say 8 and 2, there aren't (5 choose 2) ways to assign people, but (5 choose 2) * 2! (we choose the 2 people, then there are 2! ways to choose which gets the 8 and which gets the 2). If we have 8 1 1 it's more involved. In that case it's (5 choose 3) to choose the three people, 3! ways to order which of the people gets which number, but then we must divide by 2! (as the two people getting 1 are indistinguishable). A better way to view 3!/2! is 3! / (2! 1!) (note the numbers on the bottom sum to the top). This is an example of a multinomial coefficient, a generalization of binomial coefficients. For example, if we have MISSISSIPPI, there would be 11! ways to order the letters (order matters) if the letters are distinguishable, but they're not. So let's put subscripts on the letters: MI₁S₁S₂I₂S₃S₄I₃P₁P₂I₄. We then have 4! ways of placing the four marked S's in the four S positions, and so on, giving 11! / (4! 4! 2! 1!) (I like including the final 1 so that the bottom sums to the top).
- Below are the person's solutions, arranged differently than we did in class.
- 0 0 0 0 10         5
  
  0 0 0 1 9         20
  0 0 0 2 8         20
  0 0 0 3 7         20
  0 0 0 4 6         20
  0 0 0 5 5         10
  
  0 0 1 1 8         30
  0 0 1 2 7         60
  0 0 1 3 6         60
  0 0 1 4 5         60
  0 0 2 2 6         30
  0 0 2 3 5         60
  0 0 2 4 4         30
  0 0 3 3 4         30
  
  0 1 1 1 7         20
  0 1 1 2 6         60
  0 1 1 3 5         60
  0 1 1 4 4         30
  0 1 2 2 5         60
  0 1 2 3 4         120
  0 1 3 3 3         20
  0 2 2 2 4         20
  0 2 2 3 3         30
  
  1 1 1 1 6         5
  1 1 1 2 5         20
  1 1 1 3 4         20
  1 1 2 2 4         30
  1 1 2 3 3         30
  1 2 2 2 3         20
  
  2 2 2 2 2          1
- What we're really doing is solving the equation $x_1 + \cdots + x_5 = 10$ in non-negative integers. This is a very special type of Diophantine equation. It's actually a special case of Waring's problem, which looks at solving $x_1^k + \cdots + x_s^k = n$ for fixed $s$ and $k$. These problems are in general not accessible through combinatorics; the case $k=1$ is special. The general approach proceeds via generating functions, which we will cover in great detail later in the semester (it's one of the key concepts of the class).
- Our solution to the cookie problem is quite elegant, and in some respects reminiscent of geometry class (remember all those proofs where the teacher cleverly adds auxiliary lines; the difference here is we just add more cookies). While it is possible to solve many combinatorial problems by brute force in principle, in practice this is not a good way to go -- it is time consuming, and quite likely that one makes a mistake. Typically one finds a way to interpret a given quantity two ways; we can compute one of them and thus we obtain a formula for the other. For example, we showed the number of ways of dividing C cookies among P people is (C + P - 1 choose P-1); here all the identical cookies are divided. What if we don't assume all the cookies are divided -- what is the answer now? It is just Sum_{c = 0 to C} (c + P - 1 choose P - 1); this is because we are just going through all the cases (we give out no cookies, 1 cookie, ...). What does this sum equal? Imagine now we have another person, say the Cookie Monster (this is one of Cameron's favorite clips), who gets all the remaining cookies. Then dividing at most C cookies among P people is the same as dividing exactly C cookies among P+1 people, and hence our sum equals (C + P+1 - 1 choose P+1 - 1).
- Related to the cookie problem is the partition problem: for the cookie problem we consider 2+3+3+1+1 different from 1+2+3+3+1 as the five people are distinct; if we don't consider these distinct then we have a partition problem. It's a lot more complicated to count these, but some great mathematics (such as Young tableau).
- Finally, we ended with a brief discussion of the lottery problem. Say there are 50 numbers. If we cannot use any of the 50 numbers more than once, there are (50 choose 6) = 15,890,700 ways. What if we can use the same number multiple times -- how many combinations are there now? Writing the answer cleanly would give it away, so I'll just say that if we have to choose 6 numbers from {1,...,50} and we can use each number up to 6 times, and if order doesn't matter, then the number of combinations is 28,989,675, which is less than a factor of two more! For comparison, note that (300 choose 6) is the significantly larger 962,822,846,700, which is over 60,000 times larger than (50 choose 6)! If you want to see the solution, let me know.
Wednesday Sept 30: Lots of great stuff today. The major topic was the Method of Inclusion - Exclusion (we will do more on this on Friday): Video: ABBA and Combinations, Partitions, Inclusion-Exclusion: https://youtu.be/V0jgvAVj-Lw
- ABBA:
  - The ABBA example is a great way to remember how careful we need to be. It's very easy to accidentally make an assumption. This happened in the financial collapse a few years ago, when people assumed formulas held in places where there was no solid justification.
  - ABBA's big breakthrough came with their song Waterloo (thankfully people don't dress like this anymore!), but they might be more famous now for Dancing Queen or Mama Mia. Of course, you should hear at least one of their songs in Swedish.
  - Winner takes it all: song: https://www.youtube.com/watch?v=92cwKCU8Z5c Wikipedia article: https://en.wikipedia.org/wiki/The_Winner_Takes_It_All
- We discussed the inclusion / exclusion principle, one of my favorite methods in general and especially important in probability as it is very easy to accidentally double count events. We talked about computing the probability that at least one person has all cards in the same suit; on Friday we'll us this to show that the probability that none of $n$ people return to their seat (given that each ordering is equally likely) converges to $1/e$. Another fun example is to show that the probability a number is square-free converges to $6/\pi^2$; more generally, the probability that it is $k$-power free for $k$ at least $2$ is $1/\zeta(k)$, where $\zeta(s) = \sum_{n = 1}^\infty 1 / n^s = \prod_{p\ {\rm prime}} (1 - 1/p^s)^{-1}$ (if Re(s) > 1) is the Riemann zeta function. Sadly these arguments cannot be used to prove results about how many primes there are (it comes down to dealing with the error terms in dropping the floor function, though this has not stopped lots of amateurs from using this to `prove' some of the big open problems in number theory).
- We talked about analyzing perfect deals (fully perfect versus partially perfect). Be skeptical when rare events are reported; a great test is to see if less rare events are reported as well. Here's a nice article about such deals.
- One of the more interesting uses of this principle is in Brun's sieve, where he uses inclusion-exclusion to show that there cannot be too many twin primes. Perhaps the strangest application of this is that this is how the famous Pentium Bug was discovered! What about the more general case, namely when we reorder and have at least $r$ correct?
- Here's Nicely's webpage. The calculation being performed was $\sum_{p:\ p\ {\rm prime\ and\ either}\ p+2\ {\rm or} \ {p-2}\ {\rm is\ prime}} 1/p$; this is known as Brun's constant. If this sum were infinite then there would be infinitely many twin primes, proving one of the most famous conjectures in mathematics; sadly the sum is finite and thus there may or may not be infinitely many twin primes (twin primes are two primes differing by 2).
- If you liked our experimental way to determine $1/e$, check out Buffon's needle (we can't ignore $\pi$!).
- Here are the links to the paper which just came out on hot hands and independence.
  - WSJ article: http://www.wsj.com/articles/the-hot-hand-debate-gets-flipped-on-its-head-1443465711
  - Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2627354

Monday Sept 28: We went from conditional probabilities to independence. One of my favorite parts of the course is trying to figure out the formula for conditional probabilities based on extreme cases -- it shows a great way to attack a variety of problems.
- Earlier in the semester we considered a trump split problem: the probability that if there are 5 hearts split between two hands of 13 that all five are in the same hand. We get a nice product: 13/26 * 12/25 * 11/24 * 10/23 * 9/22 = (1/2) 9/230; multiplying by 2 we get 9/230 or about 3.91%. This is the same probability as getting a 5-0 split among two hands of 13. This is a great example and a great introduction to conditional probability. Frequently it's easy to compute a conditional probability than the original probabilities. For example, the probability that the next heart goes to player 1 given that the first two hearts went to player 1 is 11/24 (there are 24 'slots' available, and 11 are in player one). It's worth spending a lot of time thinking and pondering this solution. It's the same as 2 * (5 choose 5) (21 choose 8) / (26 choose 13), but the calculation is done in a very different way.
- We spent a lot of time trying to figure out what should be the formula for Pr(A|B). It's natural to try some function involving Pr(A), Pr(B), but some quick experimenting shows that this can't work. We need more information; the next natural input is Pr(A and B). Exploring some simple cases suggests Pr(A|B) = Pr(A and B)/Pr(B). This is similar to 'guessing' the formula for product rule. In both finding Pr(A|B) and for the derivative of h(x)=f(x)g(x) we need choices where we can compute the answer. For derivatives, we know how to take derivatives of polynomials, so try f(x) = x^n, g(x) = x^m, h(x) = f(x)g(x) = x^{n+m}. We find f'(x) = n x^{n-1}, g'(x) = m x^{m-1}, h'(x) = (n+m) x^{n+m-1}. Sadly there's no way this can be f'(x)g'(x), but we see f'(x)g(x) and f(x)g'(x) have the correct power of x, and summing them gives the right answer. This isn't a proof, but is suggestive.
- This is one of the most important moments in the course, looking and trying to feel the formula before proving it, trying to get a sense of what the formula should look like.
- We then moved to independence: it's often very hard to make sure two events are perfectly independent. We then moved on to what it means for three events to be independent: pairwise independence is not enough.
- The study of independence is one of the central themes in probability. While many real world or mathematical processes are not independent, frequently one can build a good model by assuming independence. Later in the semester we'll see how we can use this to model iterates of the 3x+1 map or to predict the answers to many problems in number theory (such as the number of distinct prime factors certain special numbers have). Other examples include the probability a number is square-free. For independence it is essential that all combinations be independent; as we saw in class, pairwise independence does not imply independence.
- The definition of independence states that events {A_i}_{i in I} are independent if Prob( intersection_{j in J} A_j) = prod_{j in J} Prob(A_j) for any J a subset of I. For example, if I = {1,2,3} then J could be {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, or {1,2,3}. Here's a great question: assume we have events such that Prob(A intersect B intersect C) = Prob(A) Prob(B) Prob(C), and all these events have positive probability. Must A and B be independent?
Friday, Sept 25. Fibonacci Numbers, Binet's Formula, Double Plus 1: https://youtu.be/ltR19jRU8wA
- A nice application of sequences and series is to the strategy of double plus one for roulette, and why that is such a bad idea. Using some linear algebra one can write down explicit solutions for these finite sums. In particular, this leads to the topic of difference equations and Binet's formula. I made a nice video on this with OIT: double plus ungood.
- Here is a nice video on the Fibonacci numbers in nature: http://www.youtube.com/watch?v=J7VOA8NxhWY
- There are many ways to prove Binet's formula for an explicit, closed form expression for the n-th Fibonacci number. One is through divine inspiration, the second through generating functions and partial fractions. Generating functions occur in a variety of problems; there are many applications near and dear to me in number theory (such as attacking the Goldbach or Twin Prime Problem via the Circle Method). The great utility of Binet's formula is we can jump to any Fibonacci number without having to compute all the intermediate ones. Even though it might be hard to work with such large numbers, we can jump to the trillionth (and if we take logarithms then we can specify it quite well).
- We will do a lot more with generating functions. It's amazing how well they allow us to pass from local information (the $a_n$'s) to global information (the $G_a$'s) and then back to local information (the $a_n$'s)! The trick, of course, is to be able to work with $G_a$ and extract information about the $a_n$'s. Fortunately, there are lots of techniques for this. In fact, we can see why this is so useful. When we create a function from our sequence, all of a sudden the power and methods of calculus and real analysis are available. This is similar to the gain in extrapolating the factorial function to the Gamma function. Later we'll see the benefit of going one step further, into the complex plane!
- Today's lecture showed how many of the techniques we've already learned can be successfully used. We reduced an infinite sum to a finite closed form expression, and extracted information from it using partial fractions. We also talked about coding issues, the power of while loops, ....
- The binomial coefficients are the entries in Pascal's triangle. If you look at these coefficients modulo 2, something interesting appears, and you're on your way to fractal geometry!
  - Here is the video for Pascal's triangle mod 2. There is a huge theory about Pascal's triangle modulo different integers. There are numerous resources; a quick search gave the following among others:
    - http://orion.math.iastate.edu/reu/modupasc.htm
    - Java program to visualize: http://math.fau.edu/richman/mla/pascal.htm
    - One of many papers....
  - Plotting Pascal's triangle modulo 2 (data file here)
Wednesday, Sept 23. Basic coding in Mathematica: https://youtu.be/wUC-iUsZL6M
- Purpose of today was to do basic coding.
  - Click here for the Mathematica code Click here for a pdf (note: also added how to do full house, queens and kings)
  - some suggested problems you should try (click here for Mathematica notebook, and click here for a pdf).
Monday, Sept 21. We continued our introduction to combinatorics. We did the factorial function and binomial coefficients. We saw how to prove two of the key identities about them by telling a story (a powerful proof technique in the subject), and saw a proof of the Binomial Theorem. We also got to multiply by 1 today (though sadly not to add zero), two of my favorites methods.
- Lecture online here: Counting and probability (tic-tac-toe, poker hands, bridge), binomial theorem: https://youtu.be/pJ7RXimgBBo
- Binomial Coefficients
  - We discussed (n choose r). Most of the combinatorics we'll do involves this and n!. One nice application from today is proving the Binomial Theorem (I must admit to remembering its mention in a Holmes story).
  - The binomial coefficients are the entries in Pascal's triangle. If you look at these coefficients modulo 2, something interesting appears, and you're on your way to fractal geometry!
    - Here is my Mathematica program to compute the Pascal's triangle video, and the data file if you don't want to run the program and just play the movie. There is a great lesson to be learned here -- just because we have a simple formula to compute binomial coefficients does not mean we can efficiently compute many of them! We ended class by talking about how we can efficiently arrange computations, noting ${n \atop k+1} = \frac{n-k}{k+1} {n \atop k}$.
  - Application of Binomial Coefficients: Binomial Theorem and Derivatives
    - Let's consider the derivative of $x^r$ for general $r$. If $r$ is an integer we can do it via the binomial theorem, which gives us the expansion for $(x+y)^n$ for integer $n$ (you might know this from Pascal's Triangle). To take the derivative of $x^{3/2}$ we proceed via the Chain rule: if $f(x) = x^{3/2}$ then $g(x) = f(x)^2 = x^3$; we then get $2 f(x)f '(x) = 3 x^2$; substituting for $f(x)$ and isolating the derivative $f '(x)$ gives $f '(x) = (3/2) x^{1/2}$. If now we have $x^{\sqrt{2}}$, this is harder. What do we even mean by a number to an irrational power? If we write $x^{\sqrt{2}}$ as $e^{y(x)}$, we see $y(x) = \sqrt{2} \ln(x)$. Thus $x^{\sqrt{2}} = \exp(\sqrt{2} \ln(x))$; we take the derivative of this using the chain rule, and after some algebra find the derivative of $x^{\sqrt{2}}$ is $\sqrt{2} x^{\sqrt{2}-1}$. It's a bit amazing that to find the derivative of $x^r$ in general requires us to know the exponential function!
- We say an ordering of $n$ objects is a derangement if nothing returns to where it started. The probability an ordering is a derangement is about $1/e$ if $n$ is large. We will do a lot more with this later in the semester.
- The chapter discusses poker, solitaire and bridge. We saw some of the fundamental commandments of Probability: (1) Thou Shalt Not Double Count. (2) Thou Shalt Remember All Thy Possibilities. (3) Thou Shalt Honor And Remember Ordering.
  - We saw a classic counting error, illustrating the dangers of doing some of a calculation as ordered and some as unordered. Usually if you start making people special you're introducing order, but not always. Be alert! For many of these problems it's fine to introduce order or to do without, but you must choose one. If possible, do both (though not at the same time) and see if you get the same answer.
  - We talked about tic-tac-toe today as a counting problem: how many `distinct' games are there. We are willing to consider games that are the same under rotation or reflection as the same game; see http://www.btinternet.com/~se16/hgb/tictactoe.htm for a nice analysis, or see the image here for optimal strategy.
    - Probably the most famous movie occurrence of tic-tac-toe is from Wargames; the clip is here (the entire movie is online here, start around 1:44:17; this was a classic movie from my childhood).
    - A math conundrum in 2012 involves tic-tac-toe and a fun generalization: Tic-Tac-Toe. Consider ‘Russian Doll’ tic-tac-toe. Each person has two large, two medium and two small pieces; the large can swallow any medium or small, the medium can swallow any small. If someone gets 3 in a row they win, else it’s a tie. If blue goes first, do they have a winning strategy (can they make sure that they win, no matter how orange responds)? If not, can blue at least ensure that they do no worse than tie? Feel free to come to my office (Bronfman 202) to ‘test’ your theories on a board.
    - There are several other interesting variants of tic-tac-toe. See Develin and Payne: bidding tic-tac-toe analysis for a great one.
  - In analyzing games like tic-tac-toe, it is imperative that we exhaust all possibilities. Certain games have been `solved'; checkers has been solved; chess and go are still open (though see the Deep Blue versus Kasparov).
- Interesting read of the day: http://blogs.ams.org/matheducation/2015/02/10/mathematics-professors-and-mathematics-majors-expectations-of-lectures-in-advanced-mathematics/#more-631
  - In particular, look at "Expectation #1 Students can learn a lot by filling in the logical details of the presented proofs"; this is one of the reasons I'm structuring the course the way I am, as I want you to spend the time going over the material in the book, and then talking to each other, the TAs and me to personalize the material as needed.
Friday, Sept 18. Video here: We aren't going to discuss a sigma-algebra in depth; see Section 2.6 for an introduction. For finite sets it's not bad; the trouble is infinite sets. There the lesson is we must stay away from doing uncountably many things.Kolmogorov Axioms of Probability, Implications of Axioms, Factorial Function, Binomial Coefficients, Proofs by Story: https://youtu.be/5a75Ta-0f3o
- We've talked about Kolmogorov's axioms for probability. A very important take-away is learning what is the minimal amount we must assume to build our theory. We had included Pr(A) + Pr(not A) = 1 in our wish list; there's no need, as that follows from Kolmogorov's three axioms.
  - If you've seen group theory, you should be aware of the four axioms there for something to be a group: closure, associativity, identity (there is an e in G such that for any g in g we have e g = g e = g), and inverse (given g in G, there is an inverse, denoted inv(g), such that g inv(g) = inv(g) g = e). It turns out this is overkill; we do not need the existence of a two-sided identity and a two-sided inverse; it's possible to assume less and then deduce the other properties hold (as a nice exercise, try it; there are only a few possibilities: there is a left inverse for each element and a left identity, or there is a left inverse for each element and a right identity, ...).
  - For more information see http://web.math.princeton.edu/generals/miller_steven_joel (look at the second algebra question); go to http://web.math.princeton.edu/generals/ for Princeton math general exam questions (look up some of your favorite mathematicians!). You might recognize some of the formatting and color choices!
- We discussed the expected counts method / probability tree method to find probabilities. Click here for more on frequentist approaches to the subject.
- We ended with a long discussion on trying to do experimental mathematics and sniff out a formula. I think this is a very valuable use of class time; you want to learn how to see math, and knowing what the answer should be is a huge aid in trying to prove it.
  - In an ideal world (fg)' would be f'g', but calculus doesn't work this way. We can sniff out the formula by trying special cases. Taking f(x) = x^n, g(x) = x^m and h(x) = f(x)g(x), we try to find expressions involving f(x), f'(x), g(x) and g'(x) that work; after some experimenting we see that (fg)' might be f'g + fg'.
  - We can try something similar for Pr(A union B). It should be a function of P(A), P(B) and P(A∩B). If we assume it's polynomial, we get P(A union B) = c_1 + c_2 P(A) + c_3 P(B) + c_4 P(A∩B) + c_5 P(A)^2 + .... For simplicity let's assume there are no quadratic or higher terms.
    - If we take A = B = the empty set, then we get 0 = c_1.
    - If we take A = B = Ω, then we get 1 = c_2 + c_3 + c_4.
    - If we take B = the empty set, then we get c_2 = 1. Either by symmetry, or now switching and taking A to be the empty set, we find c_3 = 0. We're now done as we get c_3 = -1.
    - Notice we're able to get to the right formula just by exploring some basic cases!
  - For another example where this kind of argument can be of great use, we can apply it to finding discriminants of polynomials (see http://en.wikipedia.org/wiki/Discriminant). I'll add a section on this in the book, but if you've seen some algebraic number theory you'll appreciate how much time this can save!
- Fun math of the day: Mathematics Professors and Mathematics Majors Expectations of Lectures in Advanced Mathematics; I like the part about the need NOT to give complete proofs:
- See Arrow's Theorem for an example where your intuition cannot be axiomatized to reality! https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem
Wednesday, September 16. Video: Basics of Set Theory, Russell's Paradox, Countable and Uncountable Sets, Basic Probability, Wish List for Probability, Probability of Unions, Sniffing out Formulas: https://youtu.be/bkGZ62RNQSA
- The key takeaway today is the need for rigor. Intuitively plausible things can be false; we must be careful to set everything on firm foundations.
- Russell's paradox is one of the most famous in all of mathematics; it showed that we didn't even understand what it meant to be a set or an element of a set! Another famous one is the Banach - Tarski paradox, which tells us that we don't understand volumes! It basically says if you assume the Axiom of Choice, you can cut solid sphere into 5 pieces, and reassemble the five pieces to get two completely solid spheres of the same size as the original! While it is rare to find these paradoxes in mathematics, understanding them is essential. It is in these counter-examples that we find out what is really going on. It is these examples that truly illuminate how the world is (or at least what our axioms, imply). Most people use the Zermelo-Fraenkel axioms, abbreviated ZF. If you additionally assume the Axiom of Choice, it's called ZFC or ZF+C. Not all problems in mathematics can be answered yea or nay within this structure. For example, we can quantify sizes of infinity; the natural numbers are much smaller than the reals; is there any set of size strictly between? This is called the Continuum Hypothesis, and my mathematical grandfather (my thesis advisor's advisor), Paul Cohen, proved it is independent (ie, you may either add it to your axiom system or not; if your axioms were consistent before, they are still consistent).
- In a real analysis course, one develops the notation and machinery to put calculus on a rigorous footing. In fact, several prominent people criticized the foundations of calculus, such as Lord Berkeley; his famous attack, The Analyst, is available here. It wasn't until decades later that a good notion of limit, integral and derivative were developed. Most people are content to stop here; however, see also Abraham Robinson's work in Non-standard Analysis. He is one of several mathematicians we'll encounter this semester who have been affiliated with my Alma Mater, Yale. Another is the great Josiah Willard Gibbs.
- The geometric series formula only makes sense when $|r| < 1$, in which case $1 + r + r^2 + \cdots = 1/(1-r)$; however, the right hand side makes sense for all r other than 1. We say the function $1/(1-r)$ is a(meromorphic) continuation of $1+r+r^2+\cdots.$ This means that they are equal when both are defined; however, $1/(1-r)$ makes sense for additional values of $r$. Interpreting $1+2+4+8+\cdots$ as $-1$ or $1+2+3+4+5+ \cdots = -1/12$ actually DOES make sense, and arises in modern physics and number theory (the latter is $\zeta(1)$, where $zeta(s)$ is the Riemann zeta function)!
Monday, September 14. We discussed the Birthday Problem (Wikipedia gives the Taylor expansion argument from taking logarithms, which is also in my book) and the Hoops Problem. Video for the day is here: day music died, HW, birthday problem, basketball game: https://youtu.be/VSOAXnBaDws
- Birthday Problem:
  - You should have a Pavlovian response and always always always think `take a logarithm' when you see a product. What is particularly nice about this problem is you can see the parameter dependence. This problem is `simple' in that there's only one parameter, the number of days in the year (or two if you consider the percentage variable as well). The argument generalizes easily; in the book I give the general case. This is but one of many possible generalizations. You should get into the habit of asking what else can you ask. What are other good questions? What if we ask for how many people we need to have at least a 50% chance that at least three will share a birthday? Or that there will be at least two pairs of people sharing birthdays? Questions like these are great extra credit / challenge problems: if you're interested, just let me know.
  - We saw a nice application of calculus -- the derivative allows us to calculate the tangent line, which allows us to replace a complicated function locally with a simpler one which is much easier to study.
  - Notice we constantly checked the algebra as we were going along to see if we had made a mistake. This is a great habit to get in; for example, when the right hand side became negative we paused to see if this is reasonable. A fun real world example of this is the use of colored lego bricks in the interior of sets that you cannot see -- the point is that we can quickly glance down and use those points for comparisonns.
  - There are practical applications of the birthday problem. One of these is in the birthday attack, which arises in cryptography.
  - Implicit in our analysis was the pigeon-hole principle, a powerful method and one worth knowing. You should always try and get a sense of the answer before doing it. Clearly we need at least 2 people, clearly at most 365 or 366; what about 180? Do we think that's too many?
  - We also needed the sum of the integers up to n-1; there's an industry in computing these values. One great way is via mathematical induction, but this has the drawback of requiring you to know the answer ahead of time. The sums of integers are the triagonal numbers; there are formulas for sums of powers (see here and here). We saw a great way to calculate this sum by writing the numbers in reverse order, adding and seeing each pair had the same sum and thus we double counted so we divide by 2.
    - Great exercise: try and get order of magnitude estimates for the sum of 1 + 2 + ... + n; without too much work can you show it is at least n^2/2, and at most n^2? That gives the correct order of magnitude.
- Hoops Problem:
  - The proof we gave today of the geometric series formula (by shooting baskets) uses many great techniques in mathematics. The basketball example is a great in that it highlights many of the axioms / main theorems of probability. What events can we assign probabilities to? Is the probability of a union the sum of the probabilities? The probability of A is one minus the probability of not A. (We also used this last bit in the Birthday problem.) It is thus well worth it to study and ponder the proof.
  - Memoryless process: once both people miss, it is as if we've just started the game fresh. This is the key observation to the problem, and converts an infinite sum to a finite one -- this is a major leap forward in the analysis!
  - There is a real need to have good variable names. It's easy to get confused and forget that the probability of winning depends on the probabilities of both people and who starts with the ball!
Monday, September 11, 2015. Slides from the first day are available here; click here for the handout. Lecture: Sept 11, 2015: Introduction, Course Mechanics, Hedging: https://youtu.be/URenWxlXHBQ, and finally comments on chapters and sections. Also a nice article on industrial math: http://arxiv.org/pdf/1509.03272.pdf
- In other years I've started the class with this clip from a game show: the YouTube Press Your Luck clip is available here. Wikipedia has a nice article on what really happened; click here for details.
- Sports betting: Huge industry and lots of applications. Often don't need deep math for important applicationsl.
  - Hedging: http://en.wikipedia.org/wiki/Hedge_%28finance%29
  - Most successful individual better: Interesting ESPN Story: http://espn.go.com/espn/feature/story/_/id/12280555/how-billy-walters-became-sports-most-successful-controversial-bettor.
    - One of my favorite parts: But in fact, Rubalcada wasn't even always trying to win, though he didn't know it at the time. Eventually, he grew to understand one of Walters' keys to success: Some of his bets were intentional losers, designed to manipulate the bookmakers' odds. Walters might bet $50,000 on a team giving 3 points, then $75,000 more on the same team when the line reaches 3.5. The moment the line gets to 4, a runner is instructed to immediately place a larger bet -- perhaps $250,000 -- on the other team. The $125,000 on the initial lines will be lost, but if things go according to plan, the $250,000 on the other side will win enough to make up for it many times over. Walters uses the same method on multiple games, often risking millions each weekend
- Sabermetrics: I've written a lot of papers on the subject and am working with students to create a Williams sabermetrics club. If you're interested please let me know.
- Clinton - Obama Tie: We didn't get to this in the introduction today, but it is in the slides for a future lecture and I encourage you to read it when we get to it. The key items there were the notion of binomial coefficients, and deciding the proper odds.
  - Binomial coefficients: http://en.wikipedia.org/wiki/Binomial_coefficient
  - Binomial theorem: http://en.wikipedia.org/wiki/Binomial_theorem
  - Story on the event: WSJ: http://blogs.wsj.com/numbers/how-unlikely-was-the-obama-clinton-tie-in-syracuse-274/
- Feeling mathematics: This is going to be a major theme of the course; I want to spend a lot of time on this in class.
  - In an ideal world we would have an infinite amount of time for class, and we would cover every item. We have to pick and choose, unfortunately. What I would prefer to do is use class time to discuss bigger concepts, and leave some of the more standard material for you to read at home or see me / the TAs in office hours. I want you to get ot hte point where you an see the big picturel wher eyou can look and see how an equation should behave without doing all the calulations.
  - We saw several examples of that today. The first was Bill James' log-5 method: if A wins $p$ percent of their games and B wins $q$, then the probability that A beats B is well-approximated by one of $(p \pm pq) / (p + q \pm 2pq)$. Which of these four work? We'll see a derivation later in the semester, but simple arguments eliminate 3 of the 4 candidates. Look at extreme cases; look at cases where we should know the answer. Try special pairs of $(p, q)$, such as $(1,0), (1/2, 1/2), (p,p)$ among others.
  - The next example we saw was in hedging our bets. We spent a lot of time on this problem for good reason, as it illustrates a lot of concepts we'll see throughout the course.
    - Complementary events: If the Patriots win a game with probability $p$, their opponents win with probability $1-p$ as one of the events must happen (ok, this is truer in baseball than football, as in football we can occasionally have a tie; the proper way to view this is that if the Patriots win with probability $p$ then they do not win with probability $1-p$.
    - Expected value: We want to look at what our expected return is as we vary certain parameters. We looked at how much money we expected to make, given our bet on the Patriots and a new bet of $$B$ on the Giants, as a function of $B$. For ease of discussion we kept the other values fixed during the discussion (we had $p = .8$ and $x = 3$, but there is no need. We could allow all three to vary, and now we're in the realm of multivariable calculus!
    - Metrics: This is huge, and often not emphasized. How do we judge among several alternatives? Are we trying to maximize our potential gain? Minimize our potential loss? Maximize our average gain? Maximize our guaranteed winnings? There are many choices, and depending on how risk averse we are, different ones will be preferable. This is very different than earlier classes, where you are clearly told what the problem is and there is one solution; depending on what you value you will get different answers! My friend's friend wanted to maximize potential winnings and didn't hedge (in my mind: moron!); I would maximize minimum return. If you've seen the Method of Least Squares versus the Method of Absolute Values, that is another example of a situation where you have choices and need to decide on the relative weights.
    - Dependence: One of my favorite parts of the hedging problem was that we could see how a lot of it should behave without doing the math / calculations; this is enormous, and one of the goals of the class -- I want you to see the relationships and think about how everything is connected before doing the math. If you can do this, you have a great chance of catching a small mistake.
      - When looking at the minimum return as a function of the bet $B$, we quickly saw that this was independent of the probability $p$ of the Patriots winning. Why? That just affects how likely each of the two outcomes are; it does nothing about the worse case scenario (which is what we assumed happened!).
      - We then looked at what happens as we vary the return $x$ (how much we get for each dollar bet on the Giants when they win). The plot showed us that there was one key bet $B$ which maximized our ensured returns. If we increase $x$ that should change the location of that point and the amount we make; how does it change? After some discussion we realized that the larger $x$ is, the less we need to wager on the Giants to have that return equal the big payoff on the Patriots (take the ridiculous case of a bet of $1 on the Giants returning $1,000,000,000; if we just bet a penny on the Giants and they win we get a huge amount back, and that small wager protects us! The larger $x$, the less we need to bet on the Giants to protect ourselves from the case of the Patriots losing. Further, the more we should make. Thus we have a great feel for how the plot of minimum return changes as we vary $x$ without ever looking at the details! This is extremely important, as we won't always be fortunate enough to have closed form expressions.
        
        A terrific example of the above concept is the Laffer curve in economics.