Imagine we have dataset X and it has ‘n’ features and let’s assume we have ‘k’ classes and each classes is represented as ck if k=2 i will be considered as binary classification. Now we have x right, so we want to find out what is the probability given x what is its class label:- P(Ck | x) reading it aloud it states, probability of class label given datapoint x, and ‘x’ is x1,x2,x3….xn as it has ‘n’ features. So using bayes theorem we can write it as
Now let’s unravel numerator and denominator for better understanding, so intuitively what we will be doing is, highest probability value would be our class label, so in a nutshell denominator would be same for every possible k right, it won’t be a factor of having highest probability score. So let’s ignore the denominator and focus on numerator. So what we have learned uptill now and conditional probability we can say P(Ck)*P(x |Ck) = P(Ck ∩ x) often written as P(Ck ,x) this distrubution is called joint probability model.
There is something called the chain rule for conditional probability.
As we know (A ⋂ B) = (B ⋂ A) in same way
P(Ck ,x1,x2…xn) = P(x1,x2…xn,Ck)
Now if we think x1 as A and all of this term(x2,x3,…xn,Ck)as our B.
So probability of A ⋂ B is P(A | B)*P(B) by conditional probability right!!!
P(x1,x2…xn,Ck)=P(x1 | x2…xn,Ck) * P(x2…xn,Ck) ….. (1)
Now we will keep our first term as is and focus on 2nd term which is
P(x2,x3….xn,Ck)
Now if we use x2 as A and all of this term(x3,x4,…xn,Ck)as our B.
What we get is P(x2,x3…xn,Ck)=P(x2 | x3,x4…xn,Ck) * P(x3…xn,Ck)
After replacing it in equation 1 we get
P(x1,x2…xn,Ck)=P(x1 | x2…xn,Ck) * P(x2 | x3,x4…xn,Ck) * P(x3…xn,Ck)
So when we follow this same step till the last term we get is
P(x1,…xn,Ck)=P(x1 | x2…xn,Ck)*P(x2 | x3,…xn,Ck)*P(x3…xn,Ck)…P(xn-1 | xn,,ck)*P(xn | ck)*P(ck)
What naive bayes says is let’s make a conditional independence assumption.
Let’s take a complex example, if P(A |B,C)= P(A | C)
It means A is conditional independent of B
So P(xi | xi+1 , x1+2,…xn, ck)= P(xi | ck) it means xi is independent of xi+1 , x1+2,…xn given class label ck
Hence, expanding and re-ordering it we will get
P(ck | x1, x2 …,xn) = P(ck) * P(x1 | ck) * P(x2 | ck)…..
To get a more better understanding you can refer
https://en.wikipedia.org/wiki/Naive_Bayes_classifier