Sampling from Univariate Distributions

Introduction to Data Science Random Number Generation Sampling from Univariate Distributions

So far, we have learnt that ITS and ARS are generic methods that extract random numbers from all probability distributions. However, it is inefficient due to it indirectness and calculations required.

Here are some faster methods to generate RNGs directly from “popular” known distributions.

NmDev’s S2 can generate RNGs from many distributions: Gaussian/normal, beta, gamma, exponential, poisson, binomial, rayleigh etc.

Distributions to describe yes/no (boolean) events

(Beta and Binomial Distributions)

Binomial Distribution

Ever been faced with a yes/no question? Binomial Distributions are used to find out the number of “yes”-es when these question is asked on many “people”/scenarios!

Unlike other distributions such as the exponential distribution, you need to know the total number of times you conduct the yes/no trials.

A common way to show this is the probability mass function!

$\Pr(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$

where:

n is the number of trials (occurrences)
k is the number of successful trials
p is probability of success in a single trial
Pr(X = k) Probability Mass function

To understand this ~~nasty~~ looking equation, let us imagine a scenario:

It is your school’s 50th anniversary and you are in charge of creating a new school badge 🏫 You have to conduct a survey on foot to get an accurate depiction of everyone’s opinion of the new badge as no one reads their emails!

It is almost impossible to survey everyone in the school. Hence, you want to model your school based on a sample of 100 students. This is possible because you are certain that the opinion of each student is independent of the previous one! Each student either 👍 or 👎 the logo.

💬 “I think this school logo is not good. It’s too messy in design”
💬 ” This school logo has a refreshing design! I like it!”
💬 “I think this school logo looks pretty bad! The colours don’t contrast well.

You are dispirited that 2/3 students so far did not like the badge. You wonder if whether meeting students whom dislike the badge first will affect your sample results!

We need to account for the fact that we could have just has easily met the satisfied participant first, leading to ${3\choose1}=3$ possible ways you could have met them and ended up with only 1 success.
Refer to here for a more detailed explanation.

Thus, you realise that how you meet the students does not matter as they have been accounted for in the calculations (for binomial distribution). Hence, you continue your survey, in hope that more will approve your design.

⏰ 1 month later…

You now have the results, 62 out of 100 students are agreeable to the proposal. This translates to a 0.62 probability of success!

Your teacher remarks that he would be willing to accept the new logo design if there is a 10% probability that less than 3000 (number of successful trials) of at least 5000 (total number of trials) of students in your school are likely to be agreeable. You plug the numbers into the formula above and ….

Wait! You can’t use this formula! This is a PMF, meant only for a specific value. Ah, so this is why we need to decide on the appropriate functions with the specific distribution that we choose to use. In this case, we will use other methods such as the Z-Test (which we will not go into detail on). We also cannot use the quantile function as it does not take into account confidence intervals of a population deduced from a sample, and hence Z-Test would be preferred.

Nevertheless, let’s explore what a quantile function does! Given a certain probability, we can work out the percentile whereby people are agreeable to the new proposal “success”

Note: The quantile function in NMDev actually divides a probability distribution into infinite continuous intervals with equal probabilities based on the inverse cumulative distribution function. This is not to be confused with quartile, which splits the distribution into four equal sections.

				
					%use s2
var dist = BinomialDistribution(100,0.62)
// 90th percentile where 90% of data have values less than this variable
// in other words, 10th percentile is where 90% of data have values more than this variable
dist.quantile(0.1)

				
					56.0
// 90% probability that 56 of 100 people agreeable to this proposal in this sample

Beta Distribution

$\alpha$ parameter – refers to the number of successes
$\beta$ parameter – refers to the number of failures

The probability density function for beta distribution is given below

$\frac{1}{\beta(\alpha,\beta)}\cdot{x^{\alpha-1}(1-x)^{\beta-1}}$

where

$\frac{1}{\beta(\alpha,\beta)} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}$

– $\Gamma$ represents the gamma function

Notice that the parameters ${x^{\alpha-1}(1-x)^{\beta-1}}$ looks suspiciously similar to that of the binomial distribution $\binom{n}{k}p^k(1-p)^{n-k}$. While binomial distributions aims to model the number of successes given a certain probability $p$, the beta distribution models probability of random variable being successful.

Here in the beta distribution, the independent variable (or probability in binomial distribution) is the random variable

The following image reflects the changes as the number of successes $\alpha$ or failures $\beta$ are changed

We can further see how this translates our understanding of the distribution by showing it through a box plot. A box plot is useful to showing the spread and mean as the values change.

(α,β)
👈 A: (1.0, 1.0)
👈 B: (2.0, 2.0)
👈 C: (3.0, 3.0)
👈 D: (4.0, 4.0)

Our spread in this box plot *slowly* reduces as the α,β values increases. Why is that? Meanwhile, the mean stays relatively the same

Next, we look at another example of a decreasing beta.

👈 A: (1.0, 1.0)
👈 B: (1.0, 2.0)
👈 C: (1.0, 3.0)
👈 D: (1.0, 4.0)

Observe how the mean of of the box plot gradually decreases, indicating a decrease in “success”.

Now, hold for a moment. What would the box plot look like if alpha is increased while beta is kept constant? Remember that alpha measures the success of the distribution.

Try it out in kotlin (the code is below)! 😉 You can hover over the graph to get the specific stats.

Code for Creation of Box-plot as shown above

Code is run one after another – in the jupyter cells

Side Note: We will be using the double format, a 64-bit storage type, as they are a closest for storing an approximation of a real number, which helps in improving statistical accuracy.

We can create the box plot above by first generating separate samples.

%use s2, lets-plot

// initialise number of times to insert numbers
val N:Int = 500

// initiate different values of beta and alpha
var rng = Cheng1978(1.0, 1.0, UniformRNG())
var rng2 = Cheng1978(1.0, 2.0, UniformRNG())
var rng3 = Cheng1978(1.0, 3.0, UniformRNG())
var rng4 = Cheng1978(1.0, 4.0, UniformRNG())

rng.seed(1234567890L)

//initialise array to contain the generated doubles
var a = DoubleArray(N)
var b = DoubleArray(N)
var c = DoubleArray(N)
var d = DoubleArray(N)

//insert random numbers into array
for (i in (0..N-1)){
    a[i] = rng.nextDouble()
    b[i] = rng2.nextDouble()
    c[i] = rng3.nextDouble()
    d[i] = rng4.nextDouble()
}

I know it is a little hardcoded, but its to keep it simple:)

Now, we can generate the box plot, by combining the List of samples together while mapping the corresponding values to a separate distribution name (“a”, b , “c ” or “d” ).

// map data directly to Listtype required for boxplot
val data = mapOf<String, Any>(
    "cond" to List(N) { "A" } + List(N) { "B" } + List(N) { "C" } + List(N) { "D" },
    "success" to a.toList() + b.toList() + c.toList() + d.toList()
)

// find the mean of each group
val means = (data["cond"] as List zip data["success"] as List<Double>)
        .groupBy(keySelector = { it.first }, valueTransform = { it.second })
        .mapValues { it.value.average() }
val cdat = mapOf(
    "cond" to means.keys,
    "success" to means.values
)
println(cdat)

// construct a box plot
val plot = ggplot(data) {x="cond"; y="success"} + // supply data; set legends
    ggsize(300, 200) + // plot size
    geom_boxplot() // draw a basic box plot
// display the plot
plot

Code to create a beta distribution in NmDev S2

We chose the Cheng 1978 algorithm as it is generally faster. A more general application of the beta distribution can also be found here.

				
					// change the alpha and beta values accordingly
val alpha:Double = 0.1
val beta:Double = 0.2
var rng = Cheng1978(alpha, beta, new UniformRNG())

We can also generate a sequence of random numbers by sampling from the beta distribution, and compare the mean, variance, skew and kurtosis with the “theoretical” distribution.

%use s2

val size:Int = 10000000
val alpha:Double = 0.1
val beta:Double = 0.2
var rng = Cheng1978(alpha, beta, UniformRNG())

rng.seed(1234567890L)

var x = DoubleArray(size)

for (i in (0..size-1)){
    x[i] = rng.nextDouble()
}
val mean = Mean(x)
val varianc = Variance(x)
val skew = Skewness(x)
val kurtosis = Kurtosis(x)

val betadist = BetaDistribution(alpha,beta)

println("theoretical mean = %f, sample %s".format(betadist.mean(), mean))
println("theoretical variance = %f, sample %s".format(betadist.variance(), varianc))
println("theoretical skew = %f, sample %s".format(betadist.skew(), skew))
println("theoretical kurtosis = %f, sample %s".format(betadist.kurtosis(), kurtosis))

				
					theoretical mean = 0.333333, sample mean: 0.333094; N: 10000000
theoretical variance = 0.170940, sample var: 0.170869, stdev: 0.413363, N: 10000000
theoretical skew = 0.701066, sample skew: 0.702149; N: 10000000
theoretical kurtosis = -1.304348, sample kurtosis: -1.302781; N: 10000000

Why use a beta distribution? The parameters of number of successes or failures can easily be adjusted.

This is useful if you only have a limited test run, and are only able to get a certain number of data. As time goes on, the successes or failures can be updated, whilst still allowing random numbers generated to reflect the changes.

Distributions relating to the Poisson Process

What is a poisson process? It basically helps us to describe events which occur independently in the future with a given frequency or rate set by us!

Exponential Distribution

Trains in Singapore travelled about 26.7 million km in year 2020, about 73150km per day.

For simplicity sake, we take the mean distance traveled between breakdowns as their average of the train lines 4.322 million km between failure. This translates to a rate of 59 days between a failure longer than 5 minutes.

Note: This is not a good approximation as there are flaws. One of them is assuming a constant train ridership –> which affects the amount of distance covered!!

A Poisson distribution like the one we did earlier are good for monitoring number of events in a given time period (eg. day, year). If the number of events per unit time follows a Poisson distribution, then the amount of time between events follows the exponential distribution. Exponential Distributions are thus good for predicting success, failure or arrival times.

The probability density function for the exponential distribution is described like this:

$f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases}$

$\lambda \ge 0 $ is the rate parameter
Exponential‘s parameter λ is the same as that of Poisson process (λ).

Using the example above, we can plot the exponential distribution describing the scenario of train breakdowns, where the failure rate is $\frac{1}{59}=0.0169$

%use s2

// the number of data points to generate
val N = 300

// the rate of the exponential distribution
val lambda = 0.0169
// construct a probability distribution
val F: ProbabilityDistribution = ExponentialDistribution(lambda)
// construct a random number generator to do inverse transform sampling
val rng = InverseTransformSampling(F)
// generate the random numbers
val values = DoubleArray(N)

val mean = Mean()

for (i in 0 until N) {
    val x: Double = rng.nextDouble()
    mean.addData(x)
    values[i] = x
}

%use s2, lets-plot

val valuess = mapOf<String, List<Double>>(
    "V1" to values.toList(),
)

val plot = lets_plot(valuess) { x = "V1" } + // supply data; set legends
    ggsize(500, 250) + // plot size
    geom_histogram(binWidth=5, bins=350)   // draw a basic histogram
	
println("Mean of the distribution: ${mean.value()}")

plot

Note: I specify the number of bins in the histogram, even though it is not entirely necessary. You can play around with this value to see how the shape of the histogram changes

The mean stays fairly close to the occurence of failure that we gave earlier. Yet the distribution is fairly skewed towards the left.

Assumption: One assumption made in this distribution is that the “failure rate” does not take into account previous events. That is, if the train had broken down recently, the distribution will not be able to take that into account. You cant “backdate” as occurrence of events in this distribution are “independent”.

Gamma Distribution

Previously, we tried to “predict” or model the disruptions taking more than 5 minutes of trains on the MRT system. Let’s say that after the 5th disruption within a year, we will need to hire a expert from overseas to investigate the causes of the disruptions. Hence, we will want to find out when the 5th disruption will occur.

One can think of this distribution as a continuation of the previous one. Similar to the exponential distrbution, it takes into account a parameter λ which is the rate of events following the Poisson process. However, a new parameter called k is the number of events we are waiting for.

The gamma distribution is described by the gamma function Γ (duh), where if arrivals of events follow a Poisson process with a rate λ, the wait time until k arrivals follows Γ(k, λ).

Actually, the gamma function is used for many other distributions (eg. Beta distribution, Chi-squared distribution as well.

At the heart of it is the idea of improving the ability to generate factorials.
Recall, how do you come up with say the factorial for 5? You likely would have thought that 5! = 5 × 4 × 3 × 2 ×1. But what if we needed the factorial for 5.31?

The gamma function is a formula to allow us to generate the factorial from even numbers with decimal points. This is especially important since many of the random numbers that we can generate are not simply integers, but Double Types (number with decimal points) as well.

Let’s plot the distribution for this gamma distribution. Notice that the value of λ (in the code below) is 59.0 instead of 0.0169. The reason for this is that the event rate parameter of the gamma distribution can be characterised by either θ (the mean wait time) or λ (the frequency of the events in a given time period).

Thus to make it more intuitive to understand, we take the reciprocal of the event rate, which gives us the time till the occurrence of the 5th event!

%use s2,lets-plot

// the number of data points to generate
val N = 300

// set parmameters for gamma distribution
// the rate and # of ev
val theta = 59.0
val k = 5.0

// construct a probability distribution
val F: ProbabilityDistribution = GammaDistribution(k, theta)
// construct a random number generator to do inverse transform sampling
val rng = InverseTransformSampling(F)
// generate the random numbers
val values = DoubleArray(N)

//initialize stats
val mean = Mean()

//insert values into dataset
for (i in 0 until N) {
    val x: Double = rng.nextDouble()
    mean.addData(x)
    values[i] = x
}


val xvalues = mapOf<String, List<Double>>(
    "T (Time)" to values.toList(),
)


val plot = lets_plot(xvalues) { x = "T (Time)" } + // supply data; set legends
    ggsize(500, 250) + // plot size
    geom_histogram(binWidth=10, bins=300)   // draw a basic histogram
	
println("Mean of the distribution: ${mean.value()}")

plot

This allows us to visualise how likely we are to exceed 5x breakdowns within a given time period.

Normal (aka Gaussian) Distribution

All the previous distributions we learn have histograms which have a certain skew (left or right)

But in real-life…

Most independent random things that happen (eg. results of the most recent Science Test!) will tend towards a mean $\mu$, after a certain number of events have happened. This is possible even if the random event (eg. how hard I prepared for the test) itself does not happen in a normal distribution.

The spread or “width” of the bell curve is defined by its standard deviation $σ$
The variance of a distribution is represented as $\sigma^2$

For the standard normal distribution with mean of 0 and standard deviation of 1, almost 68% of the data falls within one standard deviation from mean and 95 % within 2 std. deviations.

In Kotlin, we can generate standard normal samples and normal samples based on a specified mean $μ$ and variance $\sigma^2$, as shown below

				
					%use s2

val uniform = MersenneTwister()
uniform.seed(1234567890L)

val rng1 = Zignor2005(uniform)
val N:Int = 1000

/* Create an array of size N which will 
soon be used to store doubles*/
var arr1 = DoubleArray(N)

// kotlin range(s) are inclusive of both starting and ending boundaries
for (i in 0..(N-1)) {
    arr1[i] = rng1.nextDouble()
}

//check the statistics of the random samples
val var1 = Variance(arr1)

println("mean = %f, stdev = %f".format(var1.mean(), var1.standardDeviation()))

/* Repeat for the normal distribution */ 
val rng2 = NormalRNG(1.0,2.0,rng1) // mean = 1, std deviation = 2

/* Create an array of size N which will 
soon be used to store doubles*/
var arr2 = DoubleArray(N)

// kotlin range(s) are inclusive of both starting and ending boundaries
for (i in 0..(N-1)) {
    arr2[i] = rng2.nextDouble()
}

//check the statistics of the random samples
val var2 = Variance(arr2)

println("mean = %f, stdev = %f".format(var2.mean(), var2.standardDeviation()))

				
					mean = 0.011830, stdev = 1.009379
mean = 0.940225, stdev = 2.037228

Thinking time 🤔: Notice that the values are close to their respective means and variance. What happens if you increase $N$, the number of samples?

Previous Topic

Back to Lesson

Next Topic