Additional comments related to material from the
class. If anyone wants to convert this to a blog, let me know. These additional
remarks are for your enjoyment, and will not be on homeworks or exams. These are
just meant to suggest additional topics worth considering, and I am happy to
discuss any of these further.
 Thursday, December 10.
 Tuesday, December 8. We finished using
geometric random
variables to model
baseball games. This is the first example of a rich field,
sabermetrics (the
art/science of applying math to baseball). It is quite difficult to obtain
closed form expressions; it is quite rare, and one should celebrate one's good
fortunate whenever this occurs. If we don't have closed form expressions, we
are forced to run the simulations again.
 We used a
geometric random variable to model runs scored and and runs allowed. There
are some problems with this model, but amazingly it does lead to a clean
answer at the end of the day. The model is defined in terms of a decay
parameter; the probability of scoring n+1 runs is always a factor of p less
than the probability of scoring n runs. We end up with a complicated answer in
terms of p and q (the decay probabilities of each team). Amazingly, after some
algebra we get a very clean formula in terms of the means of the two geometric
random variables, RS for X (runs scored) and RA for Y (ie, how many runs team
X allows, which is how many runs team Y scores). There are many ways to try
and simplify the algebra; the key observation is to note that RS = p/(1p) and
RA = q/(1q). Noting this, we multiply our answer by 1/(1p)(1q) as this
allows us to obtain expressions involving RS and RA. This is one of the most
important things to learn, namely how to multiply by 1 or add zero to clean up
the algebra. While Mathematica can simplify the `ugly' expression obtained by
replacing all p's with RS/(1+RS) and q's with RA/(1+RA), it is much faster to
multiply by 1 as stated above, and I feel more illuminating.
 It is also worth noting that the final formula does have many properties
we would like: it's between 0 and 1, as RS increases so does the winning
percentage, there is no chance of winning if RS = 0 or RA goes to infinity, if
RS=RA the winning percentage is 50%, .... You should ALWAYS try to do these
simple heuristics to get a sense of a formula's reasonableness.
 A better estimator of a team's winning percentage is RS^{γ}/(RS^{γ}
+ RA^{γ}). Originally γ was taken to be 2 (and sadly ESPN still uses
that!); however, recent research shows taking γ to be about 1.8 does a better
job (this is what MLB.com) uses. This is called the
Pythagorean
WonLoss Formula, and is very useful in terms of predicting future
performance. (For example, if a team is playing below their predicted ability,
they might not write off the season and trade prospects for a veteran to help
with a playoff push; if they are not underperforming, they may write off the
season and save their rookies for next year).
 For more on the subject, see the
links below:

Economics:
the standard
random walk hypothesis seems to have lost most of its supporters, though
there are variants (and I'm not familiar with all); see also the
efficient
market hypothesis and
technical analysis,
and all the links there. (There are also many good links on the wikipedia page
on Eugene Fama). Two
famous books (with different conclusions) are Malkiel's
A random walk down wall street and MandelbrotHudson's
The (mis)behavior of markets (a fractal view of risk, ruin and reward).
Some interesting papers if you want to read more:

For more on
randomness, check out The Black Swan by Taleb (amazon.com
page here,
wikipedia
page here). Several members of the class have recommended this book
highly, and from reading excerpts on the web I understand why.

For more on
fractal geometry,
click here. We did the
Koch snowflake;
another popular one is the
Cantor set. See
here for fractal dimensions. To actually compute pictures of items like
the Mandelbrot set,
one needs to iterate polynomials. This can lead to the fascinating subject of
efficient algorithms; when I wrote such programs years ago on what would now
be considered `slow' computer, I had to use
Horner's algorithm
to get things to run in a reasonable time.
 Thursday, December 3. One reason I
enjoy additive
number theory so much is that many of its problems are simply stated
(though frequently the techniques to analyze them are quite involved). Today's
problem on comparing the size of sumsets to difference sets is typical of many
of the problems in the field. There are certain regions where the analysis can
be handled with standard techniques encountered in courses early in
undergraduate study. It's worth reviewing the techniques used to study this
problem, as it's a great summary of what we've done and how they can be used.
 We consider a binomial model, where each integer in {1,...,N} is in A with
probability p(N). We assumed in class that p(N) = N^{δ},
δ in (1/2, 1); we'll see later why this assumption was needed. To see the
phase transition
we need to study all choices of δ in (0,1) and not just in (1/2, 1);
unfortunately those other regions require recent advanced bounds towards
strong concentration
(this is a link to a great recent paper on the subject by Van Vu), and
thus cannot be covered in a first course on probability. (For more on this
subject, see the Wikipedia entry on
Chernoff bounds.)
 The first step of the proof was to estimate the size of A. We used
binary indicator
random variables to study the size of a randomly chosen A. We have X_i = 1
if i is in A (which happens with probability N^{}^{δ})
and X_i is 0 otherwise. By
linearity of expectation
(this is a link to notes I've written on the subject),
if X = X_1 + ... + X_N then E[X] = N E[X_i] for any i, or E[X] = N^{1δ}_{.
}The variance is Sqrt(N^{1δ}).
 Thus a typical A has about N^{1δ}_{
}elements. We need to quantify `how close'. We could use the Central
Limit Theorem, as we have a binomial with a large N; however,
Chebyshev's
inequality more than suffices. We have Prob(X  N^{1δ} > .5 N^{1δ})
< 1 / N^{1δ}. To see this, note that the standard deviation is sqrt(N^{1δ}),
and thus we are a HUGE number of standard deviations
away, namely sqrt(N^{1δ}). For example, if N = 10^100 and δ = 4/5,
then we are 10,000,000,000 standard deviations away, and thus the probability
is quite negligible.
 The next step was to compute how many
candidates we have for new sums and new differences; this is counting the
number of pairs (m,n) with m < n. We excude the diagonal case of pairs (m,m)
as there are few of these.
 The final step is showing that very few of
the pairs give the same sum or difference. This required some way to count how
many m, n', m, n' there are such that all are in A and nm = n'm' say. We
proceeded using binary indicator random variables again, and this time we had
to use covariance as the
variables
were dependent. A nice exercise is to prove the claim Var(U+V) <= 2 Var(U)
+ 2Var(V).
 For more details on these questions, see the
following papers:
When almost all sets are difference domianted
Constructing MSTD sets

We ended
today by seeing an application of
geometric
random variables to model run production in
baseball. I personally don't believe that this is the right model, but it
is mathematically tractable and leads to a nice prediction (which we'll see on
Tuesday). I think using
Weibulls is
better, as I do in
the following paper.

Finally, as
an aside I mentioned
fast primality testing. A
deterministic, fast
primality test was developed a few years ago by a computer scientist and his
two undergraduates; this is one of the only examples I know of low fruit
being missed for so long. See the references at the end of the link above for
more information.
The original paper is available here; I believe
this link is to the version published in the Annals. If anyone wants to
know some interesting stories about the paper, its publication and its impact,
let me know.
 Tuesday, December 1. The theme of
today and Thursday's lecture is going to be
approximation
theory. The goal is to replace complex expressions with simpler ones which
are readily evaluated; in order to have a result and not just a heuristic,
though, we must be able to control the error terms. You've seen examples along
these lines before; we use
Taylor series to
replace complicated functions with simple polynomials (usually constant,
linear or quadratic) (a special version of Taylor's theorem is the
Mean Value Theorem).
This only works, of course, if we can control the error. In our analysis
today, there were several places where said the terms were so small, even when
summed, that they could be ignored. In proving the
Modulo 1 Central Limit Theorem (this is a link to the paper), one of the
key steps was played by
Poisson
Summation, which allowed us to replace a slowly converging sum with a
rapidly converging one. We went from summands of size exp(πx^{2}/N)
to summands of size exp(πNx^{2}).
Note that the latter sum is quite small once n does not equal zero, and leads
to an error that can be dominated by the geometric series.
 The details of the error estimates can be
found in two places. See Chapter 9 of my book
(An Invitation to Modern Number Theory) (it's page 36 of the handout,
which is page 232) for the calculation of the probability of X > σ^{1+δ}
when X ~ N(0,σ).
Chebyshev's inequality or theorem says that since this is σ^{δ }
standard deviations away from the mean, the probability is at most 1/σ^{δ}.
The actual probability is significantly less. It isn't surprising that the
probability is so much smaller than what Chebyshev gives; the normal has
extremely rapid decay, and Chebyshev is supposed to hold for any distribution.
In our proof, we had two change of variables. The first was to to let u = x/σ.
This converted the problem to finding the area under the standard normal that
is at least σ^{δ}. The second was to let w = x  σ^{δ} or x =
w + σ^{δ}. This allowed us to exploit the fact that we are integrating
over large x. We could have used the
CauchySchwartz inequality to do a little better, but we already have a
good estimate which suffices for many applications.
 Notice that in the argument we used Taylor's
theorem to replace the complicated exp(π(x+n)^{2}/N) with the simpler
exp(π^{ }n^{2}/N), we then showed the error term had a
miniscule contribution, and then used Poisson Summation to finish the
argument.
 The additive number theory topic is a
fascinating, accessible subject. I particularly enjoy the fact that there are
two different heuristics one can use to try and decide if there should be more
sumdominated or differencedominated sets. One argument is that x+x and y+y
give different sums but xx and yy both give 0; this supports the fact that
sets should be sumdominated; on the other hand, additive is commutative and
subtraction is not, and thus x+y = y+x but xy and yx are distinct. The
question becomes: for a randomly chosen A, are we more likely to have diagonal
terms like x+x (there are n choose 1 or n of these if A has n elements) or
nondiagonal terms such as xy (there are n choose 2 or about n^{2}/2
of these); clearly it is the latter that should win. I will discuss an open
problem related to difference equations and constructing explicit examples of
sumdominated sets on Thursday.

My paper with Hegarty explores other models for these questions, where the
probability of choosing a k in {1,...,N} is independent of k but depends on N.
Depending on how fast the probability decays with N, we see different
behavior, and there is a critical threshold (or perhaps a
phase transition
is a better phrasing) where fascinating behavior happens (the
wikipedia article has several examples of these).
 Phase transitions are frequently hard to
study, but they are where the action is and are extremely important. Examples
range from population dynamics to the solidliquidgas charts we grew up on to
the birth of the
large
component in graph theory (the paper linked here is one of the most
important in the field; see also
this paper by Erdos and
Renyi). If you are interested in seeing wonderful applications of
probabilistic methods, read or skim these papers! If you want to write
a paper with me on this, you can have an
Erdos number of 4
(which should be lowerable to 3 when I get a moment to finish a project with a
senior colleague).

Erdos numbers are lots
of fun to compute (MathSciNet
(choose collaboration distance under free tools) will do this), and lead to
fascinating questions about how to search complex spaces for answers. It's
similar to the Kevin Bacon
number (both are based on the
small world
phenomenon /
six degrees
of separation) (there's also the
ErdosBaker
number, where very few people have this number finite).
An interesting paper is here; you can play the Kevin Bacon game at the
Oracle of Bacon.
 Tuesday, November 24. Today was a
payoff day. After developing a lot of the general theory of probability, we
were able to use it to solve and analyze problems of practical import,
specifically, Benford's
law of digit bias.
 Several good papers:
Hill's The first digit phenomenon;
Nigrini's I've got your number.
 We saw that small data sets can be misleading. For example, there were
fewer 9s than predicted for the first 60 terms in the sequence {2^n}, but we
saw that this was due to the fact that 2^10 is approximately 10^3, and thus
the set {leading digit of 2^n base 10} is almost, but not quite, periodic with
period 10. We saw periodic behavior in powers of
π, due to the fact that π^{175} is almost a power of 10. The
convergence to Benford's law is controlled by how well approximated an
irrational number is by rationals; this is a fascinating topic, and worthy of
further study and thought. We measure how well approximated irrationals are by
rationals by seeing how large of a denominator we need to get a given order of
accuracy. This leads to
irrationality exponents or measure; in fact, this idea is used to prove
that Liouville
numbers are
transcendental numbers. If you would like to know more about these, let me
know and I'll provide Chapter 5 of my book.
 The key ingredient in proving many systems
are Benford is to show that if x_n is the original data set, then y_n = log_10
x_n is
equidistributed modulo 1. How do we prove this? If x_n = a^n for some
fixed a, then y_n = n log_10 a. A
theorem of
Kronecker (generalized by Weyl) states that n alpha mod 1 is
equidistributed if and only if alpha is irrational (in addition to the
analysis and number theory proofs, there is also an
ergodic proof). For
some problems, it isn't enough to know that it becomes equidistributed, but we
also need to know how rapidly it becomes equidistributed; in many instances
this is answered by the theory of
linear forms
of logarithms. This is frequently related to how well certain irrationals
are approximated by rationals. In my paper with Alex Kontorovich on the
3x+1 problem, the key
step in proving Benford behavior was showing that log_10 2 had finite
irrationality exponent (we bounded it by about 10^{602}, a very large
but also a very finite number!).
 To determine if the observed data is well
described by our prediction, it is common to use a
chisquare test (click
here for a nice online chisquare calculator). There is a lot of beautiful
theory on such tests; my favorite involves structural zeros (what happens when
certain events cannot be observed, such as a tie in a nonSelig sanctioned
baseball game). If you are interested, let me know and I can send you some
papers which discuss the theory;
it is briefly mentioned in my baseball paper.
 The proof of
denseness of n alpha mod
1 for alpha irrational is significantly easier than equidistribution,
involving
Dirichlet's Pigeonhole Principle (the proof is sketched in the
accompanying slides for today).
 We showed
linear recurrence
relations are Benford (or we mostly showed this) so long as the largest
root of the characteristic polynomial exceeds 1. A nice exercise is to do this
calculation rigorously; this is done in Chapter
9 of my book.
 For more on the hydrology data and Benford's
law, see my
paper with Mark Nigrini (and see the references there for Mark Nigrini's
papers on tax fraud). Our newest paper with a
new Benford test just appeared (the
mathematics is proved in a separate paper, available here).
 Finally, we ended with a discussion of what
the Central Limit Theorem modulo 1 looks like.
I prove this in detail in this paper. We will discuss the proof of
Poisson
Summation on Tuesday, but will not prove it. (If you want to see a proof,
let me know and I'll give you the relevant sections from my book on Fourier
analysis). The proof we'll give of the CLT modulo 1 is not the most general
result possible, as we will assume the Y_i's have finite variances 
this is not needed, as is shown in our paper! The proof is a bit harder
(not surprisingly), but our friend the Cauchy distribution is not forbidden!
 There are other generalizations of the
central limit theorem. One particularly nice version involves
Haar measure. Consider
the set of N x N
unitary matrices U(N), or its subgroups the
orthogonal matrices
and the symplectic
matrices. It turns out there is a way to define a probability measure on
these spaces (this is the Haar measure), and there are generalizations of the
central limit theorem in these contexts: The nfold convolution of a regular
probability measure on a compact Hausdorff group G converges to normalized
Haar measure in weakstar topology if and only if the support of the
distribution not contained in a coset of a proper normal closed subgroup of G.
 For convenience, the following is a
collection of the papers I've written on Benford's law. As you can tell, I
love the subject. There are many problems that are very amenable to
undergraduate investigations; if you want to try your hand at research, let me
know.
 Benford's law, values of Lfunctions and the 3x+1 problem (with Alex
Kontorovich),
Acta
Arithmetica. (120 (2005), no. 3, 269–297).
pdf.
 Benford's Law applied to hydrology data  results and relevance to other
geophysical data (with Mark Nigrini),
Mathematical Geology (39 (2007), no. 5, 469490).
pdf
 The Modulo 1 Central Limit Theorem and Benford's Law for Products (with
Mark Nigrini),
International
Journal of Algebra. (2 (2008), no. 3, 119130).
pdf
 Order statistics and Benford's law (with Mark Nigrini),
International Journal of Mathematics and Mathematical Sciences (Volume
2008 (2008), Article ID 382948, 19 pages, doi:10.1155/2008/382948)
pdf
 Chains of distributions, hierarchical Bayesian models and Benford's Law
(with D. Jang, J. U. Kang, A. Kruckman and J. Kudo),
Journal of Algebra, Number Theory: Advances and Applications. (volume 1,
number 1 (March 2009), 3760)
pdf
 Data diagnostics using second order tests of Benford's Law (with Mark
Nigrini),
Auditing: A Journal of Practice and Theory. (28 (2009), no. 2,
305324. doi: 10.2308/aud.2009.28.2.305)
MSWord file
 Thursday, November 19.
All good things
must come to an end, and today ends our proofs of the standard Central Limit
Theorem. One can generalize it further by weakening the assumptions (we can
allow the random variables to have different distributions, though
independence is clearly important, as we do not expect X + X + ... + X to
converge to a normal distribution in general). We will discuss another variant
of the Central Limit Theorem when we study Benford's law later. Our previous
proofs involved either directly working with the moment generating function
(if it had a nice closed form expression) or Taylor expanding the moment
generating function. Unfortunately the moment generating function need not
always exist, which is why it is advantageous to use the Fourier transform
approach. In the literature the
Fourier transform
of a probability density is called the
characteristic function of the density, and always exists. If M_{X}(t)
= E[e^{tX}] is the moment generating function and
φ_{X}(t) is the characteristic function,
then φ_{X}(t) = M_{X}(2πit),
so the two are related.
 We started out by reviewing why the
convolution of two
densities is the density of the sum of the corresponding random variables.
This property is the reason convolutions play such an important role in the
theory. The Fourier
transform of a convolution is the product of the Fourier transforms. This
converts a very difficult integral into the product of two Fourier transforms,
and frequently these integrals can be evaluated. The difficulty is that, at
the end of the day, we must then invert, and to prove the
Fourier
Inversion Theorem is no trivial task. Proving our error estimates for the
integrals that converge to the convolution involved either
Taylor's theorem with remainder or the
Mean Value Theorem.
 Additional nice and useful properties of the
Fourier transform is that the
derivative of the Fourier transform is the Fourier transform of the original
function multiplied by 2πix; this is very useful in solving
differential
equations.. In particular, if p is our density and FT[p](y) is the Fourier
transform at y, then FT[p]'(0) = E[X] and FT[p]''(0) = E[X^{2}]. One
formulation of quantum mechanics replaces position and momentum with
differential operators; in this interpretation, the famous
uncertainty principle is just a statement about a function and its Fourier
transform! (See
here for the physics explanation of the uncertainty principle.) Note the
Taylor series expansion of FT[p] near the origin depends on the mean and the
variance; if we normalize those appropriately, the `shape' of the distribution
is not seen until we get to the third order term in the expansion. The absence
of these shape parameters in the linear and quadratic terms of the Taylor
expansion is what is responsible for the universality.
 It is worth emphasizing that, yet again, we
needed to interchange an integration and a differentiation;
click here for conditions on when this is permissible.
 We reduced the problem to understanding (
FT[p](y / sqrt(N)) )^{N}; from one point of view it should be close to
1 (as we are evaluating at almost 0, and FT[p](0) = 1), and from another point
it should be large (as we are raising it to the Nth power). We Taylor expanded
FT[p] and used the
compound interest definition of exp(x).
 The proof was completed by showing that the
result was the Fourier transform of the standard normal. It would be nice to
see if this can be done by integrating by parts. One way to compute it is to
note it equals Int_{∞ to ∞}(1/sqrt(2π)) Exp(t^{2}/2) Exp(2πity). As
Exp(t^{2}/2) is even and Exp(2πity) = cos(2πty)  i sin(2πty), only
the integral against the cosine piece contributes. We can compute the
contribution by Taylor expanding cos(2πty) and doing some algebra, using in
particular the definition of the
factorial and
double factorial. There is a slicker proof that avoids algebra by
appealing to complex analysis. We know the moment generating function of the
standard normal is M_{X}(t) = E[e^{tX}] = exp(t^{2}/2).
But φ_{X}(t) = M_{X}(2πit);
as the moment generating function agrees with exp(t^{2}/2) for
real t, the functions must equal for all values by results from complex
analysis. Plugging 2πit in, we get M_{X}(2πit)
= exp(2π^{2}t^{2}) as claimed.
 We ended the day by starting to discuss
Benford's law. We'll talk about this in far greater detail on Tuesday;
however, see
the paper by Mark Nigrini.
 Finally, in the previous class we mentioned
the harmonic sum 1 +
1/2 + 1/3 + 1/4 + .... There are lots of proofs that it diverges, ranging from
1 + 1/2 + (1/3+1/4) + (1/5 + ... + 1/7) + ..../ With a little work we see each
quantity in parentheses is at least 1/2, and so the sum diverges.
 Tuesday, November 17. We finally gave
a proof of the Central Limit Theorem! Our initial proof was for the special
situation of sums of independent
Poisson random
variables (click here for a handout with
the details of this calculation). The proof technique there used many ingredients in typical
analysis proofs. Specifically, we Taylor expand, use common functions, and
somehow argue that the higher order terms do not matter in the limit with
respect to the main term (though they crucially affect the rate of
convergence).
 The Central
Limit Theorem has a rich history and numerous applications. What makes it
so powerful and applicable is that the assumptions are fairly week,
essentially finite mean, finite variance, and something about the higher
moments. The natural question is what exactly do we mean by convergence? There
are several different notions.
 These types of convergence are explained in detail in Chapter 7 of our
book, especially section 7.2. Almost sure convergence and convergence in the
rth mean imply convergence in probability which implies weak convergence. The
BorelCantelli
problem from Chapter 1 is quite useful in proving almost sure convergence. For
us, we are just showing that the moment generating function converges to the
moment generating function of the standard normal, with the rate of
convergence depending on the third moment (or fourth moment if the third
moment vanishes; note the fourth moment is never zero). As many distributions
have zero third moment, the fourth moment frequently controls the speed. This
is why instead of looking at the
kurtosis (fourth moment)
we often look at the excess kurtosis, which is the kurtosis of our random
variable minus the kurtosis of the standard normal. This is because it is this
difference that frequently controls the speed of convergence.
 A classic result about how rapidly we have convergence to the standard
normal is the
BerryEsseen Theorem.
 Taylor series
played a key role in our proofs; the idea is that we can locally replace a
complicated function by a simpler function, so long as we can control the
error estimates.
 We discussed the probabilities of the standard normal taking on values in
certain ranges (or outside these ranges).
There are many
different conventions used;
click here
for one such table.
 Another key ingredient in our proof was the
exponential
function, in particular its series expansion.
 We also summified our expression by using the identity P = exp(log P);
this is very useful whenever P is a product as logarithms convert products to
sums. This is a great way to do nothing! We saw how well this worked to
understand quantities such as P = lim_{N > ∞} (1 + x / N^{2 })^{
N}. We took the logarithm and log P_{N}^{ }= N log(1 + x
/ N^{2 }); we then
Taylor expanded
the logarithm and found log P_{N}^{ }= x / N + terms of
size N^{2}, N^{3}, .... Exponentiating gives us P_{N}^{
}= exp(x / N) exp(terms of size N^{2}, N^{3}, ...), and
we thus obtain information on the speed of convergence.
 The proof for the Poisson random variable was very similar to the proof
for arbitrary random variables whose
moment
generating functions exist in a neighborhood of t = 0. The difference, of
course, is that while we always want to summify, it is particularly simple for
the Poisson case as its moment generating function is a double exponential,
specifically exp( λ (exp(t)  1) ). This is a
particularly nice function to take a logarithm of, and in fact this is why I
always do this example.
 It is worth thinking about why we (I) made a
mistake in class about the variance of the Poisson. The mean and the standard
deviation are supposed to be in the same units, so if the mean is λ then
shouldn't the standard deviation be λ, because if the variance were λ then the
standard deviation would be λ^{1/2} and that would have the wrong
units, right? Wrong. For an exponential with density f(x) = λ exp(λx) the
mean and standard deviation are both 1/λ, and we can see that this is the
correct λ dependence by scale issues: we exponentiate λx, so λx must be
unitless so if x is in meters say then λ is in 1/meters, and thus this is the
correct λ dependence for the mean and standard deviaton. What goes wrong for
the Poisson? Remember the density there is f(n) = λ^{n} e^{λ }
/^{ }n!; here λ is alone in the exponential and is thus unitless! This
means we can't use the unit analysis to say that the standard deviation and
the mean have the same λ dependence.
 One can prove the CLT directly in the case of
Bin(N, 1/2). As a
binomial
random variable is the sum of
Bernoulli random
variables, we see that Bin(N,1/2) should become normally distributed as N
tends to infinity. This can be proved directly, and uses
Stirling's formula
to estimate the
binomial coefficients.
 Thursday, November 12. Today we
finally applied our results from complex analysis to analyze the moment
problem, namely how many moments must two distributions share to force them to
be the same? We've already seen an example of two distinct densities that have
the same integral moments, so more is needed. In fact, those two densities
agree for all halfintegral moments as well. One answer turns out to involve
accumulation points; namely, if our densities are sufficiently nice then if
they agree for a sequence of moments that accumulates, then the densities are
equal. The proof uses our accumulation theorem from complex analysis, and the
fact that there is a unique inverse Fourier transform of a Schwartz function.
 Looking at the two densities with the same integral moments, we find they
also have the same halfintegral moments, but that's where the agreement ends.
 In general this is called
The Moment Problem;
there are lots of variants. One of my favorites, possibly due to the name, is
the Hamburger
Moment Problem, which asks us when is a given sequence of numbers the
integral moments of a probability density.
 A key step in our proof was that there is a unique inverse Fourier
transform of a Schwartz function. This is similar to the following: if we
consider the map f(x) = x^{2 }defined on the real numbers, then there
are two x's that are mapped to 1, and hence there is no inverse. If instead,
however, we restrict the map to be just on the interval [0, ∞) then there is a
unique inverse. Restricting our functions to be Schwartz is similar to this.
 Another key step was interchanging differentiation and integration. It is
very important to check to make sure we can do this interchange; it is
frequently referred to as
differentiating under the integral sign. While these theorems are stated
for derivatives with respect to real variables, we can modify these to hold
for differentiating with respect to a complex variable z by using the
CauchyRiemann
equations (the derivative with respect to z is related to a linear
combination of derivatives with respect to x and with respect to y).
 Another key step was seeing that x^{z} log(x) h(x) was integrable;
the difficulty is that log(x) tends to negative infinity as x tends to zero;
fortunately the presence of the x^{z} factor saves the day, as x to
any positive power decays faster to zero than log(x) grows to minus infinity
(as x tends to 0). One way to see this is to let y = 1/x and use
L'Hopital's rule.
 We next talked about standardizing a random variable, sending X to (X 
E[X]) / StDev(X). This allows us to compare apples and apples. Note of course
not all random variables can be standardized; the Cauchy distribution for
instance cannot. We only compute tables of the standard normal; by
standardizing we can deduce the probabilities of any normal random variable
from a table of probabilities of the standard normal. This is similar to the
change of
basis formula for logarithms. Knowing log_{b}(x) = log_{c}(x)
/ log_{c}(b), if we know
logarithms base c we then
know them base b, and thus it suffices to create just one table of logarithms.
 To prove the average (X1 + ... + X_{N}) / N of iidrv with finite
mean and variance converges to the random variable's mean is not too bad; one
can do this by applying
Chebyshev's
Theorem. If, however, we want to know the rate of convergence, we
need more than Chebyshev; this is the content of the Central Limit Theorem. We
saw some numerics today from the rates of convergence of standardized
uniforms, Laplaces (twosided exponentials), normals and Millered Cauchy's.
We'll discuss rates of convergence in detail later, and we'll see that they
are controlled by the third moment (or the fourth moment if the third
vanishes). The third moment is called
skewness, the fourth is
called kurtosis. Actually,
when the third moment vanishes it is
excess kurtosis
that's more useful; we'll see more on this when we look at the Taylor series
expansion of the logarithm of the moment generating function.
 We ended today by computing the
moment generating function of the standard normal, seeing that it is exp(t^{2}/2).
The key step in the proof is
completing the
square (there are lots of nice examples on the Wikipedia entry). It takes
awhile to see how to simplify algebra / how to write algebra in a good way.
When we have something like x^{2}/2 + xt and we know we want the
argument of the exponential to be negative, it is natural to write it as
(1/2)(x^{2}  2tx), and this is screaming at us to add 0 via t^{2}
 t^{2}.
 Tuesday, November 10. Today we
continued our quick tour of complex analysis, and the results we stated today
will be used on Thursday to get a better sense of why we can have the
ridiculous situation of two probability distributions being unequal yet having
the same integral moments.
 We stated one of the truly amazing results from complex analysis, namely
that if the zeros of a complex function defined on an open set U have an
accumulation point
in U, then the function is identically zero on U. This is profoundly different
than real analysis. For example, we saw that the function x^{3}
sin(1/x) is differentiable as a function of a real variable and vanishes at 0
and all points 1/ πn for n an integer;
however, this function is not complex differentiable.
 We tried to compute the complex derivative of z^{3} sin(1/z), but
saw that it was not differentiable as the limit depended on how we approached
the origin. In general, it is very hard to show a limit exists without getting
something nice like h^{4}/h, as we have to investigate all
possible paths; however, it frequently isn't too bad to show a limit doesn't
exist by taking two cleverly chosen paths. It is a very strong condition to
assume a function is complex differentiable; this is why, unlike real
analysis, the existence of one complex derivative implies that the function is
infinitely differentiable and equals its Taylor series.
 We briefly discussed again the
3x+1 problem (see Lagarias' bibliographies on the subject,
part 1
and part 2, for a summary of
much of what is known). My paper (with Alex Kontorovich) connecting the 3x+1
problem to Benford's law
is available here.
 We discussed two of the most important integral transforms, the
Laplace Transform
and the Fourier
Transform; these two transforms are related to each other and to another
one, the Mellin
transform (we've seen the Mellin transform when studying the
Gamma function, as
the Gamma function is the Mellin transform of the exponential function). These
are all integral
transforms, which are frequently used to solve a variety of problems. The
ones we are studying have the wonderful property that they can be expressed as
integrating against a fixed function (called the
kernel); for
many important applications this is true, but not always (see
Picard's iteration method to solve first order differential equations).
Each of these transforms has its advantages and disadvantages; depending on
the problem you are studying, some make the algebra easier and some make it
harder. Note it is not always the case that the transform exists; for example,
the moment generating function of X is E[e^{tX}] =
ʃ e^{tx} f(x) dx, which does not make
sense in a neighborhood of the origin for a
Cauchy random
variable (we have many wonderful proofs allowing us to pass from knowledge
of moment generating functions to knowledge of the density when the moment
generating function converges in a neighborhood of the origin). The Fourier
transform of a probability distribution, however, always exists for all
values; this is called the
characteristic function, and as it always exists, one can see why this
would be of use and interest. In general it isn't too bad to compute these
integral transforms, but it is hard to invert them. Frequently we must
restrict the space of functions we're studying in order to have a nice
inversion statement. One space often studied is the
Schwartz space. This
leads to a nice formula for the
Inverse
Fourier Transform.
 When talking about the difficulty of
inverting a transform, we briefly mentioned how a similar situation is
beautifully exploited in
cryptography. Many
cryptosystems are based on a
trapdoor algorithm,
namely taking some process that is easy one way but hard to invert unless you
know a key or trapdoor or some extra bit of information not publically
available. The standard, but by no means only, example is the that it is easy
to multiply two numbers, but currently it is hard to
factor numbers. Many
of these cryptosystems use just elementary math to state how they work, but
very advanced math to discuss their security. Two of my favorites are
RSA and
elliptic
curve systems. See also the homepage for my winder study on cryptography:
Math 10: LQWURGXFWLRQ WR FUBSWRJUDSKB.
 One can actually multiply two numbers, or two
matrices, much faster than you'd expect. Below is a summary of some
very efficient algorithms, which allow us to do some basic operations much
faster than you might expect.
 Thursday, November 5. In today's
lecture we developed some more of the theory of generating functions, seeing
the connections with probability. This is a very rich and powerful theory, and
what we've seen is only some of its tremendous applications.
 We proved that G_{X+Y}(s) = G_{X}(s) G_{Y}(s) and
M_{X+Y}(s) = M_{X}(s) M_{Y}(s), as well as additional
properties, such as a formula for G_{aX+b}(s). These proofs have much
in common with Calc I and Calc II. Namely, we spend a lot of time doing some
algebra G_{X+Y}(s) = G_{X}(s) G_{Y}(s) once;
the advantage is that once we have done it, we can simply use the result in
later problems. For example, if asked to differentiate x cos(x) we don't write
down the definition of the derivative, but rather we use the product rule. The
reason is that it is advantageous to do the calculation once in general, get
the result, and then in the future jump directly to that point for the
function of interest. It is similar for moment generating functions; we spend
the time now doing the calculations so we can just apply these results later.
 Earlier we showed by brute force that the sum of two independent Poissons
is a Poission with parameter equal to the sum of the parameters. We can now
provide an alternative, shorter proof with moment generating functions, as the
moment generating function of a discrete random variable taking on values in
{0, 1, 2, ...} is unique. The reason the algebra is so much simpler in using
the MGF is that we did the hard work in proving M_{X+Y}(s) = M_{X}(s)
M_{Y}(s), and are now just reaping the rewards.
 We gave an example of two densities that have the same moments but are not
equal; this is the analogue of the pathological function from real analysis. A
really good extra credit problem is to compute their integral moments (i.e.,
their kth moments for positive integer k) and see that these agree. Do you
think any of the nonintegral moments agree?
 We then introduced much of the terminology in complex analysis, include a
complex variable,
complex
differentiability (which implies that our function satisfies the
CauchyRiemann
equations), open sets
and closed sets, and
the
major theorem that f is a holomorphic function if and only if f is analytic
(in other words, if a function has even one complex derivative than it has
infinitely many and it equals its Taylor series expansion!). Much of this
language (such as open and closed sets) is required for advanced discussions
in analysis
and topology. One of my
favorite applications of all of this is
Furstenberg's celebrated proof of the infinitude of primes through a
topological argument (ie, through open and closed sets!).
 We ended by looking at a plot of x^{3 }sin(1/x). When I traced out
the top part of the plot and asked what its shape looked like, many in the
class responded that it looked like a parabola. This is a terrific example of
how the way a question is framed influences our answer. The correct way to
look at the plot is to look at half of the bottom and then half of the top,
and you see a cubic. We are frequently not aware of how things around us are
being framed and thus how we are being forced / guided to a given answer or
world view  it is worth stopping and thinking about this every now and then.
If you are interested in these topics, I recommend the following two videos:

Speaking of videos and being mislead, you might enjoy listening to the song
I'm my own grandpa (text
is available here). It's a good exercise to work through the lyrics and
see that it is correct  frequently in math we are given theorems where if a
condition is removed one of two things happens: (1) the result is now false;
(2) the proof is now harder. (For this example,
see the Wikipedia
page on I'm my own grandpa for an analysis). For example, today we showed
that M_{X+Y}(s) = M_{X}(s) M_{Y}(s) if X and Y
are independent random variables; it's a good exercise to show that this need
not be true if we remove the assumption that X and Y are independent.
 Occasionally, though, proofs become easier if we remove conditions,
as these conditions are getting us to look at the problem in the wrong way.
For example, look up the definition of
algebraic numbers
and
transcendental numbers.A wonderful result is that
e and
π
are both transcendental numbers. Further, we can prove that at least one of
e+π and e π is transcendental (though we believe both are). Seeing this
result, it is natural to think that properties of e and π enter into the
proof. In fact, there is nothing special about e and π; if x and y are any two
transcendental numbers than at least one of x+y and xy are transcendental!
Thus, even though we might think the proof involves special formulas /
properties of e and π, such as perhaps the relation exp(πi) = 1), it does
not!
 Tuesday, November 3. In today's
lecture we saw another example of divine inspiration in solving difference
equations. We then turned to sums of independent normal random variables, and
ended by discussing different types of generating functions.
 We showed earlier in the semester how to solve difference equations using
the method of divine inspiration. Today we discussed an application to a
random walk problem with two absorbing boundaries (at 0 and N), namely, if we
start at k how long do we expect to walk until we hit a boundary? The
difference equation that arises is close to, but slightly different than, the
one we encountered before for the probability of winning. This complication
sadly means our original guess of the solution does not work, nor does the
next most natural choice. For
more details, including the solution, look at the solution to Wentao's second
proposed problem (Section 57, page 49). A nice challenge problem is to
derive the solution in the special case that p = 1/2 (obviously the most
important solution, which makes it annoying that the method in class fails
there!). For more on
`guessing' how to be divinely inspired, see here (especially Section 3).
 There is a deep and rich theory of sums of normal random variables (and
their squares), which is described in greater detail in a statistics class.
Two items from today of special note are the definition of the sample variance
and the independence of the sample mean and sample variance.
 The sample mean is defined by X
= Sum_{i = 1 to N} X_{i} / N and the sample variance by S^{2}
= (Sum_{i = 1 to N} (X_{i}  X)^{2}
/ (N1). The main theorem is that (N1) S^{2} is a chisquare
distribution with N1 degrees of freedom. It is not immediately clear why we
divide by N1 and not N; after all, there are N data points, and we do divide
by N for the variance of a finite set of data. There are valid statistical
reasons for this (wanting an
unbiased estimator;
I strongly urge you to read the wikipedia entry, as there is a nice bit on the
proof, using (what else) adding zero; see also
Cochran's theorem).
I use the following heuristic to explain why it's N1 and not N; namely,
consider the extreme case of N=1. In this case, while one observation can
be used to estimate the true mean, it is absurd to think one observation can
be used to estimate the true variance! The reason is that we need to look
at differences, at fluctuations about the mean, to get a hand on the variance
 how can we do this with just one data point?
 A major theorem is that the sample mean and sample variance are
independent. This is not at all clear from the definition (as the sample
variance involves the mean). This leads to studying the statistic t =
(X 
μ) / (S^{2} / sqrt(N)); this is
known as the tstatistic and has the
tdistribution with N
degrees of freedom (here μ is the mean of the
identically distributed normal random variables). As N tends to
infinity this converges to the standard normal, but is very useful for finite
N when we have independent Gaussian random variables with unknown variance.
 We discussed
generating functions /
moment
generating functions /
characteristic functions. These functions encode information about
problems of interest; for wonderful
applications to number theory, see the final section of the course notes
(these techniques can be applied to attack
Waring's Problem
and Goldbach's Problem,
among others). One of the biggest uses of these is that they simplify the
application of algebra, as they are significantly easier to work with. In many
cases we can find closed form expressions, and the derivatives of these are
then related to means, variances, and moments. It is typically very rare to be
able to get a nice, closed form expression of something in the real world (for
some nice examples of where this is possible, see some of my sabermetrics
papers: the Weibull approach
to winning percentages and the
log5 method (for a more
marketing / economics example,
see my paper with Eric Bradlow and Kevin Dayaratna; this paper appeared in
the journal of Quantitative Marketing and Economics, and you might notice the
cookie problem in the appendix!).
 In our analysis of generating functions, we reiterated the warning that
analysis is hard. Namely, the function f(x) = exp(1/x^{2}) if x is
not zero and 0 otherwise has all of its derivatives vanish at 0, but its
Taylor series agrees with the original function only at x=0 (which is nothing
to be proud of!). Complex analysis is quite different; there
if a
function is complex differentiable once then it is infinitely complex
differentiable, and it equals its Taylor series in a neighborhood of the point.
This fact is one reason why we frequently use
characteristic functions instead of generating or moment generating
functions.
 Thursday, October 29. Unquestionably
one of the gems of probability and statistics is the
Central Limit
Theorem. The proof and applications involve understanding the sum of
independent random variables, often identically distributed. This leads to the
following fundamental, natural question: Given random variables Xi with
densities fi, is there a nice formula for the density of X1 + ... + Xn in
terms of f1 through fn?
 As a first case, we considered X1 + X2 with each Xi ~ Uniform(0,1). To get
a feeling for the answer, we looked at rolling two fair die and the
distribution of the resulting sums. We found Prob(R1 + R2 = k) = (6 
k6)/36 for 2 <= k <= 12 and 0 otherwise. This is a triangle, it's symmetric
about the mean, the density is largest at the mean, .... It is unlikely that
these features depend on the die having 6 sides, and thus it is reasonable to
expect X1 + X2 to be a triangle supported in [0,2] with maximum density at the
mean of 1.
 We proved this by using
convolutions and then brute force integration. Convolutions are incredibly
powerful and useful in probability, and provide a very useful way to explore
many problems. The convolution is defined by (f1
* f2)(x) = Integral_{t = oo to oo} f2(t) f2(xt)dt. If fi is the density of
Xi, this is the density of X1+X2. We proved this by using the cumulative
distribution function of Y = X1+X2 (which was a double integral) and then
differentiating. The key step was interchanging the derivative and the
integral. In general we cannot interchange orders of operations (sqrt(a+b) is
typically not sqrt(a) + sqrt(b)), but sometimes we're fortunate (click
here for a nice article on Wikipedia on when this is permissible).
 There is enormous structure behind
convolutions of probability distributions. Let f be the density function for
the random variable X, and g the density function for the random variable Y.
As X+Y = Y+X, we find f
* g = g
* f (ie, the operation is commutative), and f
* (g
* h) = (f
*g)
* h (the operation is associative). Convolution is also closed (if f and g are
densities, so is f
* g). Note this is beginning to look like a group; namely, we have a
collection of objects (in this case, probability densities or maps from the
reals to the reals) and a way to combine them (convolution) that is closed,
associative, and even commutative. If we just had an identity element and
inverse, we would have a
group (a
commutative group, in fact). Groups occur throughout the sciences and the
world, two of my favorite are the
Rubik's cube and the
Monster group. As
there is a lot of structure in groups, it's natural to ask whether or not we
can find an identity element and inverses.
 The identity element is not hard to find. We
define the Dirac delta
functional δ(x) as follows: for any probability density f(x), Integral_{x
= oo to oo} f(x)δ(x) dx = f(0). One may view δ(x) as the density
corresponding to a unit point mass located at 0; similarly we would have
Integral_{x = oo to oo} f(x) δ(xa) dx = f(a), corresponding to a unit point
mass at a. We have actually seen Dirac delta functionals before. For example,
let X be Bernoulli(p). This means Prob(X=1) = p, Prob(X=0) = 1p and any other
x has Prob(X=x) = 0. If we let f(x) denote the probability mass function, we
have f(x) = p δ(x1) + (1p) δ(x). It turns out that the Dirac delta
functional (which does integrate to 1, which can be seen by taking f(x) = 1 in
Integral_{x = oo to oo} f(x) δ(xa) dx) acts as the identity. We now show f
* δ = f. We have (f
* δ)(x) = Integral_{t = oo to oo} f(t) δ(xt) dt = f(x).
 Thus the only obstacle in whether or not we
have a group (with group operation given by convolution) is whether or not
there is an inverse. Is there? Perhaps there is an inverse if we restrict the
types of probability distributions we study (for example, maybe we only look
at densities defined on a compact interval).
 We introduced the
Fourier Transform
today. Be careful: there are at least three natural definitions; I prefer
f^(ξ) = Integral_{x = oo to oo} f(x) e^(2πixξ) dx. There are many great
properties of the Fourier transform; one of the most important properties is
that the Fourier
transform of a convolution is the product of the Fourier transforms, or (f
*g)^(ξ) = f^(ξ) g^(ξ). The proof required us to use Fubini's theorem to
interchange the order of integrations, and some basic facts of complex
analysis (which we'll review again below). For those familiar with group
theory, what we have looks a lot like a
group homomorphism
(we have to say a lot like as we haven't proved that there are inverses).
 The reason the Fourier transforms are so
useful is the following: imagine there is an inverse Fourier transform for
every nice function. If we want to study the sum X1 + ... + Xn, we know its
density is f1
* ...
* fn; assuming the Xi are independent, identically distributed random
variables then the fi are all equal, say f. The Fourier transform converts
convolution to multiplication, and thus (f
* ...
* f)^(ξ) = f^(ξ)^n. If Finv denotes the inverse Fourier transform, then (f
* ...
* f)(x) = Finv(f^(ξ)^n)(x). Thus, if we can invert the Fourier transform of
f^(ξ)^n, then we have a formula for the density of the sum!
 It is not immediately clear that to
understand real functions of real variables that we need to study
complex numbers. If
i = sqrt(1) and z = x + i y, then the complex conjugate of z is defined by x
 i y. The length of a complex number z is defined by z = sqrt((a+ib)(aib))
= a^2 + b^2. Recall the
exponential
function exp is defined by e^z = exp(z) = sum_{n = 0 to oo} z^n/n!. This
series converges for all z. The notation suggests that e^z e^w = e^(z+w); this
is true, but it needs to be proved. (What we have is an equality of three
infinite sums; the proof uses the binomial theorem.) Using the Taylor series
expansions for cosine and
sine, we find e^(iθ) = cos θ + i sin θ. From this we find e^(iθ) = 1; in
fact, we can use these ideas to prove all trigonometric identities! For
example:
 Inputs: e^(iθ) = cos θ + i sin θ and e^(iθ)
e^(iφ) = e^(i (θ+φ))
 Identity: from e^(iθ) e^(iφ) = e^(i (θ+φ)) we
get, upon substituting in the first identity, that (cos θ + i sin θ) (cos φ +
i sin φ) = cos(θ+φ) + i sin(θ+φ). Expanding the left hand side gives (cos θ
cos φ  sin θ sin φ) + i (sin θ cos φ + cos θ sin φ) = cos(θ+φ) + i sin(θ+φ).
Equating the real parts and the imaginary parts gives the identities
 cos(θ+φ) = cos θ cos φ  sin θ sin φ
 sin(θ+φ) = sin θ cos φ + cos θ sin φ
 One can prove other identities along these
lines....
 Finally, a common theme in mathematics is the
need to simplify tedious algebra. Frequently we have claims that can be proven
by long and involved computations, but these often leave us without a real
understanding of why the claim is true. If you want, let me know and I'll show
you my 4050 page proof of
Morley's
theorem; Conway has
a beautiful proof which you can read here (it's after the irrationality of
sqrt(2)).
 Thursday, October 27. Today's lecture
was a mix of applications of old material and a sales pitch of things to come.
 The main theme of the first part was the
Change of
Variable Formula. The key (and most difficult ingredient) is the
Jacobian, which tells us how the volume element changes. We did the
calculation in great detail for
polar coordinates,
though of course the argument holds in greater generality. One strange
application of our analysis today was a formula for the sum of two independent
random variables X_{1}, X_{2} which are Exponential(λ).
We let Y_{1} = X_{1} + X_{2} and Y_{2} = X_{1}/X_{2}
(and Y_{3} = X_{1}  X_{2}). We obtained a joint
density in each case for Y_{1} and Y_{2} or Y_{3}, and
by integrating out Y_{2} or Y_{3} we were left with the
density of Y_{1}! When we study
convolutions we'll find
better, simpler, more tractable formulas for the density of sums of random
variables, but it is fascinating to see what we get here. The general, big
picture idea that's floating around all of this is that we're transforming
functions to functions, be it through Jacobians, convolutions, or
integral transforms
(such as the Laplace
and Fourier
transforms, which we'll meet soon).
 In the proof of the onedimensional change of variable formula, one of the
key ingredients was the
Fundamental Theorem of Calculus. We needed this to find the cumulative
distribution function of Y = g(X); we then differentiated to get the density.
Thus, while at the end of the day we do not need to know F_{X}, it was
important to have it for an intermediate step of the calculations.
 Another ingredient in the proof of the onedimensional change of variable
formula was that if g(h(y)) = y then h'(y) = 1 / g'(h(y)). This is a nice
application of the
chain rule to inverse functions (as g(h(y)) = y and h(g(x)) = x, we say g
and h as inverses of each other). We used this relation to find the derivative
of the arctangent
function. When we first encounter such functions in Calc I or Calc II,
they seem unnatural, primarily chosen to provide tests of how well you have
mastered differentiation. These functions, however, do naturally arise in many
applications. My favorite examples are in determining the cumulative
distribution function (and hence the normalization constant) for a
Cauchy random
variable (which has density (π(1+x^{2}))^{1}).
Distributions such as the Cauchy are terrific for testing how general results
are in probability and statistics;
I have a nice paper using a distribution which is a variant of the Cauchy
to show the limitations of the famous
CramerRao
inequality for determining optimal statistical tests. I've also seen
analogues arise in nuclear physics. The second occurrence of arctangent
today was in the change of variable formulas from polar to Cartesian.
 One wonderful application of the onedimensional change of variable
formula is to generating random variables given a uniform random number
generator. There is a huge industry that tries to construct random number
generators from different distributions; it becomes much harder when we have
dependent, multivariate joint densities (ie, we have several random variables
and the joint density does not factor).
Random.org is nice website collecting various algorithms for different
types of randomness, ranging from cards to jazz to numbers (I
strongly urge you to check out this website, which generates postmodern papers
randomly; if you enjoy that,
you should also see the
most famous essay in the subject, which is by the physicist Alan Sokal:
"Transgressing the Boundaries: Toward a Transformative Hermeneutics of Quantum
Gravity"  to get an idea of how absurd it is,
go to the html file and search for "In 1982, when Irigaray's
essay").
 We talked a bit about how many shuffles are needed to randomize a deck of
cards.
The classic paper is by David Bayer and Persi Diaconis (if you cannot read it, let me know
and I'll get it for you). If you want, I can also share some illegal bridge
bidding conventions that involve encrypting your bid so that only your partner
can decode it!
 We talked a bit about the limiting behavior for sums of random variables.
A natural thing to do to any random variable is to normalize or standardize
it. Thus, instead of studying Y one should study (Y  mean(Y)) / StDev(Y)
(provided the mean and standard deviation exist). This new quantity has mean 0
and variance 1, and thus we should be able to compare it to other similar
quantities (ie, we're now comparing apples and apples, not apples and
oranges).
 We ended with a discussion of
Pepys' problem.
This is perhaps our first example leading towards the
Central Limit
Theorem. A terrific challenge problem is to prove, elementarily, that as n
tends to infinity we have a 50% chance of winning, significantly less than the
approximately 66% chance when n is 1.
 Thursday, October 22. The
multidimensional CauchySchwartz inequality is proved in an entirely
analogous way as the onedimensional case. The key idea is that we again get a
quadratic polynomial in one unknown variable b, look at its discriminant, and
then the inequality pops out. This is just one of many useful inequalities;
another very powerful one is the
arithmetic mean  geometric mean (for more proofs, see my
handout here).
 We proved the
correlation
coefficient was at most 1 in absolute value by applying the
CauchySchwartz inequality. The proof technique (for us) is more important
than the result. Namely, we do not believe Integral_{oo to oo} x^{2}
dx should be finite; it needs to be hit with the density of a random variable
X. The CauchySchwartz inequality takes two functions say A and B as input and
relates the integral of AB to integrals of A^{2} and B^{2}.
What's nice about this is we can write our density f_{X,Y}(x,y) as (f_{X,Y}(x,y))^{1/2}
(f_{X,Y}(x,y))^{1/2}. We give one factor to each, and as we
square we now hit our quantities of interest against a probability density,
and therefore there is a chance that the integrals will be finite. Another
place where this technique can be used is in proving the CramerRao inequality
(see
here for my proof). If you are interested in statistics,
you should read up on the
CramerRao inequality; one application is it can sometimes tell you when
you've found a least variance unbiased estimator for a given population
parameter.
I have a paper on a situation where, unfortunately, the CramerRao provides no
useful information, though typically in practice it does provide some
information on the system under consideration.
 The three person hat problem we discussed is one of my favorites. It has
powerful connections to
error correcting
codes. See also the
slides from M. Bernstein's talk at the SUMS conference at Brown a few years
ago. I strongly urge you to read / skim her slides  there is
terrific animation and discussion of what is going on. This is one of the
nicest applications of joint mass functions, marginals, and dependence that I
know, and the result is quite surprising. This problem is also covered in the
optional book for the course,
Impossible (Chapter 6: Buckling the Odds, page 50); if you don't have the
book and want to read it, let me know and you can borrow my copy. It is well
worth the time to carefully study and ponder this problem. Note the expected
value of each person's guessing is that they are correct 50% of the time and
wrong 50% of the time, exactly as you would predict. The interesting thing is
that we are able to congregate the wrong answers and spread out the right
ones. What's really going on here is a nice
conditional
probability. There are 8 possible outcomes for the distribution of the
hats: WWW, WWB, WBW, BWW, WBB, BWB, BBW, BBB. Each of these happens 1/8th of
the time. Let's assume we see two hats of the same color; without loss of
generality, let's say those hats are white. Is the probability of our having a
white hat equal to 1/2 (as our hat color is independent) or is it 3/4, as now
the only possibilities are WWW, WWB, WBW, BWW? It is important to note that
until we open our eyes, we don't know that we will see two hats of the
same color. This is the key observation.
 The final item for the day was a discussion of Exercise 3.3.8. The most
important part of this problem was going from infinitely many possible
strategies to a small,finite number (in this case, five!). For example, one
strategy could be take 6 on the first time toss, otherwise take a 5 or 6 on
the second, otherwise take a 3, 4, 5 or 6 on the third, and so on. It was
important to eliminate these possibilities. This is somewhat similar to what
happens in the
drowning swimmer problem (I have a
Mathematica notebook on the problem here). There are three nice additional
asides related to this.
 Tuesday, October 20. Today's lecture
was devoted to building some of the background theory and results we will need
for later in the semester. Note much of today's class uses a key result from
Calculus III, namely integrals in the plane can be evaluated by iterated
integrals. This material is standard and should be in any textbook for Calc
III. WIkipedia has a good entry on
iterated integrals.
The main idea is that we want to convert an integral over a twodimensional
region (which we can evaluate with Riemann sums and upper and lower bounds and
limits) into iterated integrals. If our function and region is nice, this can
be done. See the entry on
order
of integration for precise statements of the theorems and conditions.
Sometimes one needs to use the multidimensional change of variables formula:
see the link on
substitution of variables (click on the
entry
on Jacobians for more information about this important ingredient).
 The first item of the day was to determine the normalization constant
for normal
distributions. One of the simplest ways to compute the normalization
constant is to square the integral and convert to
polar coordinates.
The main ingredients are: the area element dxdy transforms to rdrdθ,
and the integrand is radial (it becomes exp(r^{2}/2)r).
 We next considered the
Gamma function,
which generalizes the standard factorial function. We gave a proof of its
functional equation, Γ(s+1) = sΓ(s); this allows us to take the Gamma
function (initially defined only when the real part of s is positive) and
extend it to be welldefined for all s other than the nonpositive integers.
For more on the Gamma function and another proof of the value of
Γ(1/2), see my (sadly
handwritten) lecture notes. This approach uses the
Beta distribution.
 One nice application of the Gamma function
and normalization constants is a proof of
Wallis' formula,which
says π/2 = (2·2 / 1·3) (4·4 / 3·5) (6·6 / 5·7) ···. I have a proof which is
mostly elementary (see
my article in the American Mathematical Monthly). Not surprisingly, the
proof uses one of my favorite techniques, the theory of normalization
constants (caveat: it does have on advanced ingredient from
measure theory,
namely
Lebesgue's Dominated Convergence Theorem).
 Many functions in mathematical physics
initially exist only for some values of the parameters but can be continued
elsewhere; my favorite is the
Riemann zeta
function (and the extension uses the Gamma function). What is amazing
(and not initially apparent) is that the following frequently occurs. We
have some function and we only care about its values at the real numbers (or
maybe even just the integers); nevertheless,it is often easier to study it
as a function of a complex variable (z = x +
iy), as then we
have all the tools and techniques of
complex analysis
at our disposal. A terrific example is the
Prime Number
Theorem (which says that, to first order, the number of primes at most x
is about x/log x). This is a statement about integers, yet the `easiest' and
`best' proofs all use the Riemann zeta function at complex arguments (and,
as you may reasonably ask, why should we need to use complex numbers to
count integers!). What follows is an aside on an aside  this is clearly
not needed for the course!

The complex analytic proof of the Prime
Number Theorem uses several key
facts. We need the functional equation of the Riemann zeta function (which
follows from
Poisson summation and properties of the Gamma function), the
Euler product (namely
that zeta(s) is a product over primes), and the important fact that the
Riemann zeta function does not have a zero on the line Re(s) = 1! If this
happened, then the main term of x from integrating zeta'(s)/zeta(s) * x^s/s
arising from the pole of zeta(s) at s=1 would be cancelled by the contribution
from this zero! Thus it is essential that there be no zero of zeta(s) on Re(s)
= 1. There are many proofs of this result. My
favorite proof is based on a
wonderful trig identity: 3 + 4 cos(x) + cos(2x) = 2 (1  cos(x))^2 >= 0 (many
people have said that w^2 >= 0 for real w is the most important inequality in
mathematics). If people are interested I'm happy to give this proof in class
next week (or see Exercise 3.2.19 in our textbook; this would make a terrific
aside if anyone is still looking for a problem). There is an elementary proof
of the prime number theorem (ie, one without complex analysis). For those
interested in history and some controversy, see
this article by Goldfeld for a terrific analysis of the history of the
discovery of the elementary proof of the prime number theorem and the priority
dispute it created in the mathematics community. We mentioned Riemann
computed zeros of zeta(s) but didn't mention his achievement; the method only
came to light about 70 years later when Siegel was looking at Riemann's
papers. Click
here for more on the RiemannSiegel formula for computing zeros of zeta(s).
Finally, terrific advice given to all young mathematicians (and this advice
applies to many fields) is to read the greats. In particular, you should read Riemann's
original paper. In case your mathematical German is poor, you can click
here for the English translation of Riemann's paper. The key passage is on
page 4 of the paper: One now finds indeed approximately this number of real
roots within these limits, and it is very probable that all roots are real.
Certainly one would wish for a stricter proof here; I have meanwhile
temporarily put aside the search for this after some fleeting futile attempts,
as it appears unnecessary for the next objective of my investigation.
 We
then turned to determining when two random variables are independent or
dependent. The key lemma is that two random variables X and Y are
independent if and only if their joint density f_{X,Y}(x,y) factors
as the product of their marginals, namely f_{X}(x) f_{Y}(y).
If the density factors the proof is straightforward; if not, our book leaves
it as an exercise to the reader. As this is the first serious proof class
for many, I wanted to go through the argument. The proof basically follows
from the
definition of continuity. Let g(x,y) = f_{X,Y}(x,y)  f_{X}(x)
f_{Y}(y). We assume g(x_{0},y_{0}) >
e > 0. By
continuity, if (x,y) is close to (x_{0},y_{0}) then g(x,y)
is close to g(x_{0},y_{0}). Continuity says we can always
find a δ such that if the distance from (x,y) to (x_{0},y_{0})
is at most δ then g(x,y)  g(x_{0},y_{0}) <
e/2.
We have
two natural ways to measure the distance: dist((x,y),
(x_{0},y_{0}))
= x  x_{0} + yy_{0} or sqrt((xx_{0})^{2}
+ (yy_{0})^{2}). Regardless, we choose our δ so that our
small square centered at (x_{0},y_{0}) with sides of length
2δ have
g(x,y) 
g(x_{0},y_{0}) <
e/2,
which
means on this square g(x,y)  g(x_{0},y_{0}) >
e/2.
Integrating completes the proof. We then did an
example: f_{X,Y}(x,y) =(e2)^{1} x exp(xy) for x, y in
[0,1] and 0 otherwise. The density does not factor, and we see that the
random variables are in fact dependent. It is a very good exercise to do the
details of this computation.
 We then moved to a proof of the
CauchySchwartz inequality. There are many proofs of this important
result;
see here for some lecture notes I wrote years ago for another class giving
another one of the standard proofs. We will discuss the applications of
the CauchySchwartz inequality in greater detail on Thursday.
 Finally, we discussed generalizations of
the coupon or prize problem from the homework. It is not immediately clear
what the right order of magnitude is as to how long you need to wait before
you are essentially assured of having two of each prize (or more generally k
of each prize). As a nice exercise, prove that as c tends to infinity, with
probability tending to 1 you are assured of having at least two of each
prize if you wait as long as 2 c H_{c}, where H_{c} = 1 +
1/2 + 1/3 + ... + 1/c is the c^{th}
harmonic number.
Can you replace the constant 2 with something smaller? (We know it must be
at least 1 would 1 +
e work for
any e?
 Thursday, October 15. We covered an
enormous amount of theory and applications today, and it's worth reflecting on
the advantages and disadvantages of all we did.
 We did a few more examples of the power of binary indicator random
variables and linearity. We used it to derive the formulas for the mean and
variance of a binomial(n,p) random variable by writing it as a sum of
independent Bernoulli(p) random variables. We can of course derive these
values by differentiating identities. It is worth remarking that many of the
identities in combinatorics are proved by showing that two different ways of
counting the same thing are equivalent, and then if we evaluate one we get the
other for free. We did another example of using binary indicator random
variables and linearity of expectation in modeling how often Fermat numbers
are prime. (See the additional comments from Thursday, October 8 for more on
Fermat numbers.) One must be careful when using such models to predict
properties of prime numbers and numbers, as these models miss arithmetic (for
example, if we are too crude we'll predict there are infinitely many triples
such that n, n+2 and n+4 are all prime, which is clearly absurd as at least
one of these three must be divisible by 3). These models can be improved and
some of the arithmetic can be incorporated  if you want to know more, let me
know.
 We proved
Chebyshev's theorem, one of the gems of probability. The natural scale to
measure fluctuations about the mean is the standard deviation (the squareroot
of the variance). Chebyshev's theorem gives us bounds on how likely it is to
be more than k standard deviations from the mean. The good thing about this
result is that it works for any random variable with finite mean and variance;
the bad news is that because it works for all such distributions, its results
are understandably much weaker than results tailored to a specific
distribution (we will see later that its predictions for binomial(n,p) are
magnitudes worse than what is true). It is somewhat similar in spirit between
the differences in
Divide and
Conquer and
Newton's Method to find roots of functions; Divide and Conquer is
relatively slow (taking about 10 iterations to gain another 3 decimal digits
accuracy), while Newton's Method doubles the number of decimal digits each
iteration! Why is there such a pronounced difference? The reason is that
Divide and Conquer only assumes continuity, while Newton's Method also
requires differentiability. Thus it is not surprising that we can do better
with stronger assumptions.
 We ended by discussing
Monte Carlo
integration, which has been hailed by some as one of the (if not the) most
influential papers in the 20th century. We only touched the briefest part of
the theory here. We showed how it can be combined with Chebyshev's inequality
to give really good results on numerically evaluating integrals. Specifically,
if N is large and we choose N points uniformly, we can simultaneously assert
that with extremely high probability (such as at most 1  N^{1/2}) the error
is extremely small (at most N^{1/4}). If you want to know more, please see me
 there are a variety of applications from statistics to mathematics to
economics to .... Below are links to two papers on the subject to give you a
little more info:
 Tuesday, October 6. Today we saw the
power of binary indicator random variables and expected values. We use random
variables and probability to model deterministic systems. The reason for this
is that frequently it is very hard to compute exactly what happens, but such
modeling does a very good job. For more on these methods, see the following
handout by Professor Rosen of Brown
University (and the references therein).
 Counting the number and distribution of distinct prime factors or prime
factors as n varies is a beautiful problem. This is described in great detail
in Hardy and Wright's classic `Theory of Numbers'. Many of these elementary
functions are
briefly described here; Mathworld has a good article on
distinct
prime divisors. A beautiful result is that the number of distinct prime
divisors is, in some sense, normally distributed under an appropriate limit.
This is the
ErdosKac theorem (see
also the Wikipedia entry). A key ingredient is that
Sum_{p < x} 1/p is about log log x. While this follows from the
Prime Number
Theorem (which says Sum_{p < x} log p is about log x) and
partial summation
(the discrete version of integration by parts), as discussed in class it also
follows from a careful analysis of the sum and product expressions (whose
equivalence is basically equivalent to the property of
unique factorization or the Fundamental Theorem of Arithmetic) for the
Riemann zeta
function. (Note: if you want to know why it is natural to count primes
with a logarithmic weight, let me know and I can give you a handout from
my book.)
 It is believed that there are only finitely many
Fermat primes. The
Fermat numbers F_{n} = 2^(2^n) + 1 have many interesting properties.
One is that no two Fermat numbers share a common factor, which as a nice
exercise gives another proof of the infinitude of primes! Fermat primes also
arise in determining
which regular ngons can be constructed with a straightedge and a compass.
 We also used indicator random variables and expectation to model
probability problems, such as how many kings we expect to get in 7 cards from
a wellshuffled deck. It is incredible how powerful these ideas are  there
are versions of probability theory which have expectation as the fundamental
concept (there is a comment along these lines in our book).
 It is worth emphasizing that, when modeling answers with indicator random
variables, we do not need the variables to be independent if we are
only concerned with calculating the expected value; if we want some idea of
the scale of fluctuations, then it's very different.
 We also discussed the similarities between how
Taylor coefficients
uniquely determine a nice function and how
moments
uniquely determine a nice probability distribution. It is sadly not the case
that a sequence of moments uniquely determines a probability distribution;
fortunately in many applications some additional conditions will hold for our
function which will ensure uniqueness. For the nonuniqueness of Taylor
series, the standard example to use is f(x) = exp(1/x^2) if x is not zero and
0 otherwise. To compute the derivatives at 0 we use the
definition of the derivative
and L'Hopital's rule.
We find all the derivatives are zero at zero; however, our function is only
zero at zero. We will see analogues of this example when we study the proof of
the Central Limit
Theorem.
 Finally, we mentioned the importance that the integrals and sums in the
moments converge absolutely; if they didn't, then our answers would depend on
how we tend to infinity. For example, consider the
Cauchy distribution
1 / (pi(1+x^2)). Let g be any function such that g(A) is larger than A. Assume
A is large so the integrand is basically 1/pi x. If we integrate from A to
g(A) we get essentially Integral_{t=A to g(A)} dx / pi x = (1/pi) log( g(A) /
A). If g(A) = 2A then we would get essentially log(2) / pi, but if g(A) = A^2
then we find there is no way to have some finite interpretation.
 Thursday, October 1. Today we
discussed joint distributions as well as common densities, meromorphic
continuation, proof techniques, ....
 The binomial
distribution is a special case of the more general
multinomial
distribution; many of the properties of the multinomial can be obtained by
repeated applications of the binomial distribution. For example, say we have
the unimaginatively named candidates A, B, C and D running for office. We may
initially break them into two groups: A and not A; we then further divide not
A into B and not B, then not B is divided into C and not C. The binomial
coefficients are replaced with
multinomial
coefficients: here (n  k1, k2, ..., kj) means n! / k1! k2! *...* kj!,
with each ki a nonnegative integer such that k1+...+kj = n.
 One application (but by no means the most important!) of multinomials is
figuring out how many different words you can make when you rearrange the
letters of MISSISSIPPI. If you feel this isn't important, consider instead
base pairs from biology
 this tells us how many different strands we can have!
 We proved that the multinomial probabiities do give us a density  they
are clearly nonnegative, but do they sum to 1? The proof is quite nice, and
it uses one of my favorite techniques, multiplying by 1, MANY times. It is
important to get a sense of how these results are proved. The trick is to look
for binomial or multinomial coefficients  this is why we multiplied by
(nt)!/(nt)!. We then had Sum_{e = 0 to nt (nt choose e); we rewrote this
by multiplying by 1^e 1^{nte} and then recognized this as (x+y)^m where
x=y=1 and m=nt. Thus we could evaluate the e sum by using the binomial
theorem, and then another application of the binomial theorem completed the
job. Remember how important it was to have the sums correct  t was
independent of e and the t! could be brought out of the sum; however, h was
not as h = nte. There are many symbolic programs available to prove binomial
identies; if you would like a copy of a Mathematica program that does this,
just let me know (click
here for some of the theory).
 We then discussed the
geometric series
formula. The standard proof is nice; however, for our course the
`basketball' proof is very important, as it illustrates a key concept in
probability. Specifically, if we have a
memoryless game, then
frequently after some number of moves it is as if the game began again. This
is how we were able to quickly calculate the probability that the first
shooter wins, as after both miss it is as if the game just started.
 The geometric series formula only makes sense when r < 1, in which case
1 + r + r^2 + ... = 1/(1r); however, the right hand side makes sense for all
r other than 1. We say the function 1/(1r) is a
(meromorphic)
continuation of 1+r+r^2+.... This means that they are equal when both are
defined; however, 1/(1r) makes sense for additional values of r. Interpreting
1+2+4+8+.... as 1 or 1+2+3+4+5+... a 1/12 actually DOES make sense, and
arises in modern physics and number theory (the latter is zeta(1), where
zeta(s) is the
Riemann zeta function)!
 We have only discussed a few of the myriad distributions that arise in
modeling the world:
Bernoulli,
Binomial,
Poisson,
Exponential,
Uniform. There
are many others, such as the
Normal, the
Cauchy, as well
as one of my favorites, the
Weibull. The
more distributions you know, the more you can model the world. If time and
interest permit, we'll talk about how I used the
three parameter Weibull to
model baseball games.
 We ended the day by introducing the concept of
expectation or expected
value of a random variable (also called the mean or the average value).
This is one of the central concepts in the course, and it is amazing how many
problems reduce to understanding expectations of random variables. We will see
in Tuesday's class how properties of expectation aid us greatly in
applications. For example, consider a Binomial(n,p) random variable X (so X is
the number of heads in n tosses of a coin which is heads with probability p).
The sum we MUST evaluate for the average is Sum_{k = 0 to n} k (n choose k)
p^k (1p)^{nk}. While it should be clear that this must be just np (each coin
has a p% chance of landing on heads, and we have p of them), this must be
proved. We'll discuss two different techniques to do this on Tuesday
(differentiating identities and linearity).
 Tuesday, September 29. We discussed
the definition of cumulative distribution functions (CDFs) and the associated
densities, called the probability mass function in the discrete case and the
probability density function in the continuous case. We showed that if we know
the CDF then we know the mass/density function, and viceversa. The big
theorem is that in the continuous case, the mass function is the derivative of
the CDF. Our proof used either
Taylor series expansions or the Mean Value Theorem; it is possible to
prove the claim with significantly less at the cost of more analysis. We see
in the proof that we really want our probability density function to be either
continuous, piecewise continuous or bounded. We showed how to use the
Fundamental
Theorem of Calculus to quickly calculate the density of Y = phi(X) given X
has CDF F_X with density f_X. Namely, letting X = h(Y) we have the density is
f_Y(y) = f_X(y) h'(y).
 e^x e^y = e^{x+y} is one of the most beautiful and important formulas in
math; it is NOT trivial to prove, and requires some real combinatorics. Again,
it would be horrible notation if this were false. The purpose of the second
clicker question is to illustrate the dangers of generalizing from numbers to
matrices  the lack of commutativity leads to very different behavior. For
matrices, e^A e^B in general is NOT e^{A+B} unless A and B commute. We define
the commutator by [A,B] = AB  BA; this measures how far A and B are from
commuting (note some places write the commutator differently, so my apologies
if other people don't use this notation!). The
BakerCampbellHausdorff
formula describes what e^A e^B; see
also the Zassenhaus
formula for a nice explicit
formula.

We compared
sizes of functions. We write f(x) << g(x) to mean there is a constant C such
that, for all x sufficiently large, f(x) <= C g(x). We showed x^r << e^x for
any r > 0, and log(x) << x^r as well (using the previous results with now x =
e^(y/r)). To get x log(x) > 0 as x > 0 we wrote x as 1/n and then used the
previous results. This example illustrates lazy mathematicians at our best,
reducing to previous problems.

We gave a
poor mathematician's analysis of the size of n!; the best result is
Stirling's formula
which gives n! is about n^n e^{n} sqrt(2 pi n) (1 + error of size 1/12n +
...). We obtained our upper and lower bounds by using the comparison method in
calculus (basically
the integral test); we could get a better result by using a better
summation formula, say
Simpson's method or
EulerMaclaurin.
We will return to Simpson's method later in the course, as one proof of it
involves techniques that lead to the creation of low(er) risk portfolios! Ah,
so much that we can do once we learn expectation..... Of course, our analysis
above is not for n! but rather log(n!) = log 1 + ... + log n; summifying a
problem is a very important technique, and one of the reasons the logarithm
shows up so frequently. If you are interested, let me know as this is related
to research of mine on Benford's law of digit bias.

Finally, we
mentioned the QWERTY keyboard
(see also
this article on other common items around us and how they came to be).
There are many applications to knowing letter frequencies, especially the
probability that given one letter that the next letter takes on each value.
These frequencies are used to break simple cryptographic cyphers that involve
permutting the 26 letters. See for instance the
wikipedia article on
frequency analysis, as well as a
downloadable program to
perform the analysis.
 Thursday, September 24. Today we
finished the definitions from Chapter 2, in particular random vectors, joint
distributions and marginal densities. Later in the semester we will spend a
lot of time looking at the joint distribution of random variables. The key
result is that, if the random variables are independent, the joint density is
the product of the individual densities; obviously this is not necessarily the
case if the random variables are not independent (we will of course define
what it means for random variables to be independent).
 We saw how difficult it can be to code, let alone efficiently code, a
problem which is simply stated. There is a lot of trouble if the number of
variables is also varying  it is easy to work with this theoretically, but
harder to implement. (If you've taken linguistic classes, this might be
similar in spirit to quantifiers on quantifiers). I'll write up code that
handles this case with lots of comments  this won't be the only way to
attack the problem, but it will be one. The important fact to note is that we
are often only able to observe small values, and thus there is a danger that
we may extrapolate incorrectly. Click here for the mathematica code.
 We will go over distribution functions and finding the distribution
function and density of random variables that are functions of other random
variables in greater detail on Tuesday. The idea is that if G is a nice
function and we know the (cumulative) distribution function of X, then we
should know the (cumulative) distribution function of Y = G(X); similarly, if
we know the probability density of X then we should know the probability
density of Y = G(X). We will do all this again slowly for our exponential
example and in general. The key input in the analysis is the
Fundamental
Theorem of Calculus; for us, the version we need is: Let F(x) = Int_{t = oo
to x} f(t) dt; then F'(x) = f(x). While we have talked about how the
antiderivative is not unique, there is a `natural' choice of a continuous
density f.
 The card `trick' we did today is explained in great detail in the optional
book for the course,
Impossible. If you don't have that book but want to see the details, let
me know and I'll provide it. There is a lot of good math in this problem, plus
of course it's a fun trick! For more on
the Amazing James Randi, click here.
 We also discussed
Buffon's needle. We'll analyze this problem in greater detail later; if
one wants to see a truly elegant proof, let me know and I'll provide a copy of
the proof from THE Book (if
you haven't heard of THE Book, click this link!). We didn't solve it
today, but instead used it as a way to discuss joint random variables. Our
partial solution is a nice application of
dimensional
analysis, which allows us to see how the solution must depend on the
parameters without actually solving it! This is a hard but worthwhile skill to
cultivate.
 Tuesday, September 22. In today's
lecture we continued learning the language (random variables, continuous and
discrete, probability mass functions and densities). The key fact is that
random variables must be real valued. This is so that we can add them or take
averages et cetera. Thus we never have X_i(omega) be
H if the ith toss is
a head and T if the ith toss is a tail, but rather 1 if the ith toss is a head
or 0 otherwise. In this case X_i is a
binary indicator
variable, and we can add such random variables together (if you can tell
me what a head plus a tail is, I'd love to know!).
 For our probability spaces (Ω, F, P),
we typically take the σfield F to be 2^{Ω} if Ω is either
finite or countable; recall that 2^{Ω} means the set of all subsets of
Ω. This is not the only σfield we may look at, but it is the most useful for
these problems. For example the following is always a σfield: {Ø, Ω}. Another
possibility is to take, for any set A, F to be {Ø, A, A^{c},
Ω}. The point is we want our σfield to be as large as possible (i.e., we want
to define the probability of as many subsets of Ω as we can). If Ω is
infinite, say [0,1] or the real line (∞, ∞), we take the σfield to be what
is generated by open intervals (a,b). In other words, we start with all open
intervals and see what sets we can form by going through the definitions of a
σfield. For example, countable intersections belonging means [a, b] is in the
σfield because it equals the intersection of (a  1/n, b + 1/n).
Click here to get a sense of
what kind of sets we can form by these processes. For our purposes, we
will only be assigning probabilities to finite sets, countable sets, or
intervals, squares and similar figures; however, it is good to be aware of the
advanced analysis.
 The cumulative distribution function is one
of the key tools of the subject, and gives a sense of why continuous random
variables are easier to analyze than discrete; namely, for continuous we have
the Fundamental Theorem of Calculus at our disposal to pass from a cumulative
distribution function to a density; we do not have differentiation available
in the discrete case. Note that a cumulative distribution function does not
determine a unique density; however, it almost does so, as any two densities
must integrate to the same value on any interval. (The
technical jargon is to say that the density is determined up to a function
which is zero almost everywhere.) If there is interest, let me know and
I'll talk a bit about the basics of measure theory (and show that almost no
numbers are rational in the sense of measure).

Gambler's ruin: We
solved the problem using difference equations. If there is a repeated root,
however, our method breaks down and we need to be divinely inspired again. You
are not responsible for knowing how to solve these problems, but if you are
interested here are some facts. For the general relation, say a_{n+1} = 3 a_n
+ 10 a_{n1}, we guess a_n = r^n. We find that this is a solution if r^{n+1} 
3 r^n  10 r^{n1} = 0 or r^2  3r  10 = 0, which holds if (r5)(r+2) = 0, ie,
r = 5 or 2. Simple algebra shows that c_1 r_1^n + c_2 r_2^n is a solution for
any c_1, c_2. If we specify two boundary conditions that determines the c_i's,
and we're done. If the two roots happen to be equal, we need to be a bit more
clever (or divinely inspired);
see the final page of my handout from Math 209. I prefer the solution we
discussed in class, using symmetry to solve it when we start at $k (0 < k < N)
with N = 2^{m} for some integer m. As a good challenge problem, see if
you can come up with an
elementary proof
when N is not a power of 2. I can do this for some (but as of right now not
all) N..
See
here for an elementary proof of the prime number theorem.
 Finally, we mentioned the Riemann zeta
function briefly: ζ(s) = sum_{n = 1 to∞} 1/n^{s}= (1  1/p^{s})^{2}.
This is intimately tied to the distribution of the primes (which isn't
surprising as it related something we want to know about (the primes) to
something very well understood (the integers). Key in the analysis is the
distribution of zeros of ζ(s); the famous
Riemann Hypothesis
(about to turn 150 (there will be festivities on campus, and one of the
most casual asides you'll ever see!) asserts all the nontrivial zeros have
real part 1/2. The Riemann zeta function arose earlier in the probability a
generically chosen odd number is squarefree and is 1/ζ(2) = 6/π^{2}.
(See also the wikipedia
entry and the references at the end for a proof of the value of this sum /
product.) This is the answer to our problem as we may interpret it as the
probability that our number isn't divisible by 4, by 9, by 25.... The
formula I mentioned is the
RiemannSiegel
formula.
 Thursday, September 17.
 We started with computing the number of poker hands with at least two
aces. The danger in problems like this is double counting. Note that ncr[4, 2]
ncr[50, 3] / ncr[52, 5] is very close to the correct answer of (ncr[4,
2] ncr[48, 3] + ncr[4, 3] ncr[48, 2] + ncr[4, 4] ncr[48, 1]) / ncr[52, 5]
(.0452 vs .0417); here ncr[x,y] is x! / y! (xy)! (n choose r).. The double
counting is a lower order term, but it is enough to lead to a noticeable
difference. It's natural to think the answer is ncr[4,2] ncr[50,3], as this
means choose two of the four aces, and then choose any three of the remaining
50 cards. The problem is that if we choose an ace in the last three cards, we
have double counted it. Thus the correct answer should be, and is, slightly
lower. The tops sum to 52 and the bottoms sum to 5; this is a good, quick rule
to help make sure you are looking at the problem the right way.
 The study of
Independence is one of the central themes in probability. While many real
world or mathematical processes are not independent, frequently one can build
a good model by assuming independence. Later in the semester we'll see how we
can use this to model iterates of the
3x+1 map or to predict the
answers to many problems in number theory (such as the number of distinct
prime factors certain special numbers have). Other examples include the
probability a number is squarefree. For independence it is essential that all
combinations be independent; as we saw in class, pairwise independence does
not imply independence. We did a very good job as a class in terms of choosing
numbers randomly from 1 to 9; the second part, where we were shooting for half
the class average, is different. This belongs to social science and
game theory. I would
say the random variables are still independent; however, your answer is
governed by a different rule depending on whom is in the class.
 The answer to Nick's question, as correctly pointed out by a classmate, is
that the definition of independence states that events {A_i}_{i in I} are
independent if Prob( intersection_{j in J} A_j) = prod_{j in J} Prob(A_j) for
any J a subset of I. For example, if I = {1,2,3} then J could be {1}, {2},
{3}, {1,2}, {1,3}, {2,3}, or {1,2,3}. We can rephrase the question to: assume
we have events such that Prob(A intersect B intersect C) = Prob(A) Prob(B)
Prob(C), and all these events have positive probability. Must A and B be
independent?
 We calculated the solution to the roulette problem by using difference
equations. The largest root of the characteristic polynomial for 5 consecutive
blacks (with red and black equally likely) is about .982974. For more on
solving difference equations, see pages 2 and 16 my
lecture notes from Math 209 (Differential Equations), as well as
the Wikipedia entry.
While solving a problem such as this is hard in general (we have to compute
the roots of the characteristic polynomial), it is possible to get some sense
of the properties of the solution. The trick we discussed of marching down in
blocks is similar to the Murphy's law problem in the homework.
 Tuesday, September 15. We started by
reviewing some of the definitions (σfield
(many books use the word algebra instead of field),
probability measure
and probability space).
The point is that not every subset is an admissible event (in other words, not
all subsets are assigned a probability). For the most part this is no problem,
as points, intervals, squares et cetera provide a rich theory. The general
case requires advanced analysis, in particular measure theory /
Lebesgue
integration. These technicalities are important in avoiding the
BanachTarski
paradox, which is due to the
Axiom of Choice
(which allows us to construct
nonmeasurable sets);
it is for this reason that I only believe in the
Countable
Axiom of Choice. For the specific points of today's class, here are some
additional comments / readings.
 Limit exchange: one of the hardest parts of
mathematics is justifying interchanging two operations; today we looked at
when the probability of a limit is the limit of the probabilities. To give
some sense that we must sometimes be careful, we considered nonnegative
functions f_n(x) converging to zero pointwise but always integrating to 1 (let
f_n(x) be the triangle function from 1/n to 3/n, taking on the value n at
2/n). It is not always permissible to interchange a limit and an integral (see
the Dominated
Convergence Theorem or the
Monotone Convergence Theorem from analysis for some situations where this
may be done); similarly it is not always possible to interchange orders of
integration (see
Fubini's Theorem for when this may be done), and we can only sometimes
interchange a derivative and a multidimensional integral (see
here for some conditions on when we may). The main takeaway is that we
must be careful interchanging probabilities and limits, but this shouldn't be
surprising. For example, we do not expect to be able to interchange most
operations: sqrt(a+b) in general is not sqrt(a) + sqrt(b).
 We talked a bit about what it means to choose
an element uniformly from random on a circular or square dart board. We cannot
deal with uncountable unions (see the wikipedia entries on
countable and
uncountable sets). If
you want to learn even more about countable and uncountable, see
Chapter 5 of my book (An Invitation
to Modern Number Theory). For the purposes of our class, we really only need
to worry about finite and countable. We have good intuition on what a finite
set is; the quick definition of countable is that it can be placed in a
onetoone correspondence with the positive integers. In other words, we have
a first element, a second element, and so on. It turns out that almost every
real number is irrational; further, almost no numbers are
algebraic (solving
a finite polynomial with integer coefficients). The standard proof is
Cantor's
diagonalization argument (this and many other items are included in
Chapter 5 of my book).
 We discussed the
inclusion
/ exclusion principle, one of my favorite methods in general and
especially important in probability as it is very easy to accidentally double
count events. We used this to show that the probability a number is
squarefree converges to 6/π^{2}; more generally, the probability that
it is kpower free for k at least 2 is 1/zeta(k), where zeta(s) = Sum_{n = 1
to oo} 1 / n^s = Product_{p prime} (1  1/p^s)^{1} (if Re(s) > 1) is the
Riemann zeta
function. If you complete the inclusionexclusion calculation we did, you
find that it can be written as the product above (with s=2 and the product
truncated); talk to me if you want more details. Sadly these arguments cannot
be used to prove results about how many primes there are (it comes down to
dealing with the error terms in dropping the
floor function,
though this has not stopped lots of amateurs from using this to `prove' some
of the big open problems in number theory). One of the more interesting uses
of this principle is in
Brun's sieve, where he uses inclusionexclusion to show that there cannot
be too many twin primes.
Perhaps the strangest application of this is that this is how the famous
Pentium Bug was discovered! The homework problem asks you to find the
probability that when we reorder n people, at least one is correct. The
textbook also handles the more general case, namely when we reorder and have
at least r correct.
 We also talked about conditional probability and the surprising problem
about how likely it is for you to have a rare disease if you test positive. If
you have taken statistics before, this is similar to
Type I and Type II
errors. Depending on what you are concerned with will affect what you want
to improve.
 The last part of the class dealt with
combinatorics. Our solution to the cookie problem is quite elegant, and in
some respects reminiscent of geometry class (remember all those proofs where
the teacher cleverly adds auxiliary lines; the difference here is we just add
more cookies). While it is possible to solve many combinatorial problems by
brute force in principle, in practice this is not a good way to go  it is
time consuming, and quite likely that one makes a mistake. Typically one finds
a way to interpret a given quantity two ways; we can compute one of them and
thus we obtain a formula for the other. For example, we showed the number of
ways of dividing C cookies among P people is (C + P  1 choose P1); here all
the identical cookies are divided. What if we don't assume all the cookies are
divided  what is the answer now? It is just Sum_{c = 0 to C} (c + P  1
choose P  1); this is because we are just going through all the cases (we
give out no cookies, 1 cookie, ...). What does this sum equal? Imagine now we
have another person, say the
Cookie Monster (this
is one of Cameron's favorite clips), who gets all the remaining cookies. Then
dividing at most C cookies among P people is the same as dividing exactly C
cookies among P+1 people, and hence our sum equals (C + P+1  1 choose P+1 
1).
 Finally, we ended with the lottery problem.
If we cannot use any of the 50 numbers more than once, there are (50 choose 6)
= 15,890,700 ways. What if we can use the same number multiple times  how
many combinations are there now? Writing the answer cleanly would give it
away, so I'll just say that if we have to choose 6 numbers from {1,...,50} and
we can use each number up to 6 times, and if order doesn't matter, then the
number of combinations is 28,989,675, which is less than a factor of two more!
For comparison, note that (300 choose 6) is the significantly larger
962,822,846,700, which is over 60,000 times larger than (50 choose 6)! If you
want to see the solution, let me know.
 Thursday, September 10. First off, click
here for additional comments about the objectives for the course, including some entertaining and educational
videos about the times we live in and the importance of asking the right
questions. We mentioned
just some of the many places where probability is applicable.
 Click here if
you want to know more about the log5 method, namely which
of the (p +/ pq) / (p + q +/ 2pq) models the probability that team A beats
team B. The `derivation' is a nice exercise in elementary probability theory,
if you buy the modeling assumptions. As you'll see throughout the course and
beyond, one of the most difficult issues in the real world is deciding what
are the important and irrelevant factors.
 A terrific example of this is the
Clinton  Obama tie in Syracuse;
click here for the article from the `By the numbers' guy from the Wall
Street Journal (a great site to read). There are several ways to do the
calculation. The way that was reported in the press assumed that the statewide
percentages (57% Clinton, 40% Obama) should also successfully model the
distribution in Syracuse. How significantly different are these from 5050?
Note that 5050 led to a probability of
1/137 while
the other is one in a million. The `By the numbers' article also mentions
another way to try and solve this: Say there are 12002 voters, then there are
12003 possibilities, with each candidate ranging from 0 to 12002 votes, and
thus the probability of a tie is 1/12002. The flaw in this argument is that
not all outcomes are equally likely. For example, if we roll a pair of fair
die there is only one way to roll a 2, but six ways to roll a 7. The number of
ways with 2n people for n to choose Clinton is (2n choose n); the number of
ways for them all to choose Clinton is (2n choose 2n) = 1.

The 3x+1 problem is one of my
favorites in mathematics (Jeff Lagarias has excellent annotated
bibliographies: see here and
here). It has a lot of the features you'd like a problem to
have: you can state it easily, a high school or junior high school student can
understand it, yet to make progress requires real mathematical sophistication
and machinery. If anyone is interested in research on the 3x+1 problem, I have
a work in progress that is very accessible and should be doable with what you
know.
 Benford's law of
digit bias is one of my favorite research topics (if anyone is interested,
I might also have accessible projects here). If time and interest permit, I'll
show you how you can prove this digit bias in a variety of interesting
systems. I was interviewed by the Wall Street Journal about applying Benford's
law to detect fraud in the Iranian elections (click
here for articles on the Iranian elections).
 If you want to see details about the paper for the movie industry,
click here, while for my sabermetrics paper (which we may discuss in the
class),
click here.
 We discussed the
Birthday Problem (Wikipedia gives the Taylor expansion argument from
taking logarithms) and its generalization to Pluto. This is but one of many
possible generalizations. What if we ask for how many people we need to have
at least a 50% chance that at least three will share a birthday? Or that there
will be at least two pairs of people sharing birthdays? Questions like these
are great extra credit / challenge problems: if you're interested, just let me
know.
 The
doubleplusone strategy is but one of many overlaps between probability
and gambling. Other famous ones (recently) include
card counting in
blackjack. There are many references; see Thorpe's
original article as well as his
book. Another fun read is
Bringing Down The House.
 Combinatorics: we discussed (n
choose r), Most of the combinatorics we'll do involves this and n!.One
nice application from today is proving the
Binomial Theorem
(I must admit to remembering its
mention in a Holmes story)
 Again, click here for additional comments
about the objectives for the course, including some entertaining and educational
videos about the times we live in and the importance of asking the right
questions.