Quasi-Newton Methods

Introduction to Data Science Unconstrained Optimization Quasi-Newton Methods

Quasi-Newton methods is a framework of algorithims that is built on the Newton-Rhason method and the line search in this method is based on a \( n*n \) matrix S, which serves as the same purpose as the inverse Hessian matrix in the Newton-Rhapson Method. The Quasi-Newton method is yet another framework of algorithims that do not rquire the Hessian. The expression in Quasi-Newton method can be given as

\( x_{k+1}=x_k-\alpha_kS_kg_k \)

When \( S_K=I \), it is the algorithim for the steepest-descent method. When \( S_k=H^{-1} \), the algorithim is for the Newton-Rhapson method. The above equation of \( x_{k+1} \) converges fastest when \( S_k=H^{-1} \). Quasi-Newton methods combines the advantages of the steepest-descent and the conjugate-direction methods. They are among the most efficient methods and are used extensively in many applications.

Assume that the difference in successive \( x_k \) is \( \delta_k=x_{k+1}-x_k \) and the difference in successive gradient is \( g_k \) is \( \gamma_k=g_{k+1}-g_k \), then we have \( \gamma_k=H\delta_k \). The Newton-Rhapson method determines the direction vector as \( d_k=-S_kg_k \). The magnitude \( \alpha_k \) can be determined by a line search minimizing the function \( f(x_k+\alpha d_k) \). Now here we want to avoid inverting a matrix such as computing \( S_k=H^{-1} \) and checking that \( S_k \) is positive definite in each iteration, rather we simply want an updating rule for \( S_{k+1} \) such as

\( S_{k+1}S_k+C_k \).

Here \( C_k \) is a \( n*n \) correction matrix that can be computed form available data. In any definition of \( C_k \), \( S_{k+1} \) must satisfy these properties:

1. \( S_{k+1}\gamma_k=\delta_k \).

2. Vectors \( \{\delta_0,\delta_1,…,\delta_{n-1}\} \) must be conjugate direction vectors.

3. A positive definite \( S_k \) must produce a positive definite \( S_{k+1} \).

The second property states that the advantages of tyhe conjugat-direction method method applies to Quasi-Newton method. The third method states that the direction vector is valid in each iteration. There are many ways to define \( C_k \) and they give rise to different variants of the Quasi-Newton method.

Rank-One Method

The Rank-Method method is featured by the correction matrix \( C_k \) has a unity rank. Now we assume that

\( S_{k+1}\gamma_k=\delta_k \).

Let

\( S_{k+1}=S_k\gamma_k+\beta\xi_k\xi_k^T\gamma_k \)

And

\( \gamma^T_k(\delta_k-S_k\gamma_k)=\beta_k\gamma^T_k\xi_k\xi_k^T\gamma_k=\beta_k(\xi^T_k\gamma_k)^2 \)

\( (\delta_k-S_k\gamma_k)(\delta_k-S_k\gamma_k)^T=\beta_k(\xi_k^T\gamma_k)^2\beta_k\xi_k\xi_k^T \)

Then we have

\(\displaystyle \beta_k\xi_k\xi_k^T=\frac{(\delta_k-S_K\gamma_k)(\delta_k-S_k\gamma_k)^T}{\gamma_k^T(\delta_k-S_k\gamma_k)} \)

Therefore,

\( S_{k+1}=S_k+\beta_k\xi_k\xi_k^T \).

There are two problems with this method. First, the positive definite \( S_k \) may not give a positive definite \( S_{k+1} \). In such a case, the next direction is not a good direction. Second, the denominator in the correction formula may approach or become equal to zero. If that happens, the method breaks down as \( S_{k+1} \) is undefined.

Davidon-Fletcher-Powell Method

The Davidon-Fletcher-Powell (DPF) method is similar to the rank-one method but it has one important advantage, that if the initial matrix \( S_0 \) s positive definite, then the subsequent matrices are always positive definite. Unlike the Raank-One method, every new direction is a descent direction.

The DPF formula is

\(\displaystyle S_{k+1}=S_k+\frac{\delta_k\delta_k^T}{\delta_k^T\gamma_k}-\frac{S_k\gamma_k\gamma_k^Ts_k}{\gamma_k^TS_k\gamma_k} \)

The correction is an \(n*n \) symmetric matrix of rank two.

Broyden-Fletcher-Goldfarb-Shanno Method

The BFGS method formula is

\( \displaystyle S_{k+1}=S_k+(1+\frac{\gamma^T_kS_k\gamma_k}{\gamma_k^T\delta_k})\frac{\delta_k\delta_k^T}{\gamma^T_k\delta_k}-\frac{\delta_k\gamma_k^TS_k+S_k\gamma_k\delta_k^T}{\gamma_k^T\delta_k} \)

The BFGS method has the following properies:

1. \( S_{k+1} \) becomes identical to \( H^{-1} \) for \( =n-1 \).

2. Direction vectors \( \{\delta_i\}_{i1,2,…,n-1} \) form a conjugate set.

3. \( S_{k+1} \) is the positive definite if \( S_k \) is positive definite.

4. \( \delta_k^T\gamma_k=\delta^T_kg_{k+1}-\delta^T_kg_k>0 \).

BFGS method is the best method among all.

Huang Family Method

Huang method is a general formula that includes the rank-one, DFP, BFGS, Pearson and McCormick ethods. It has the following form

\( \displaystyle S_{k+1}=S_k+\frac{\delta_k(\theta\delta_k+\phi S^T_k\gamma_k)^T}{(\theta\delta_k+\phi S^T_k\gamma_k)^T\gamma_k}-\frac{S_k\gamma_k(\psi\delta_k+\omega S^T_k\gamma_k)^T}{(\psi\delta_k+\omega S^T_k\gamma_k)^T\gamma_k} \)

\( \theta,\, \phi,\, \psi \) and \( \omega \) are independent parameters.

By letting \( theta=1,\, \phi=-1,\, \psi=1,\, \omega=-1, \) we have the Rank-One formula.

By letting \( theta=1,\, \phi=0,\, \psi=0,\, \omega=1, \) we have the DFP formula.

By letting \( theta=1,\, \phi=0,\, \psi=1,\, \omega=0, \) we have the McCormick formula:

\(\displaystyle S_{k+1}=S_k+\frac{(\delta_k-S_k\gamma_k)\delta^T_k}{\delta_k^T\gamma_k} \)

By letting \( theta=0,\, \phi=1,\, \psi=0,\, \omega=1, \) we have the Pearson Formula:

\( \displaystyle S_{k+1}=S_k+\frac{(\delta_k-S_k\gamma_k)\gamma_k^TS_k}{\gamma_k^TS_k\gamma_k} \)

By letting \( \displaystyle \frac{\phi}{\theta}=\frac{-\gamma_k\delta^T_k}{\gamma_k\delta_k^T+\gamma^T_kS_k\gamma_k},\,\psi=1,\, \omega=0, \) we have the BFGS formula.

The following code minimizes the Himmelblau function \( f(x_1,x_2)=(x_1^2+x_2-11)^2+(x_1+x_2^2-7)^2 \) using different Quasi-Newton methods.

				
					//The Himmelblau function: 
// f(x) = (x1^2 + x2 - 11)^2 + (x1 + x2^2 - 7)^2
 
val f: RealScalarFunction f= object : new RealScalarFunction() {
override fun Double evaluate(Vector x) {
double x1 = x.get(1),
double x2 = x.get(2),

double result = pow(x1 * x1 + x2 - 11, 2),
result += 12 * pow(x1 + x2 * x2 - 7, 2),
return result
}
override public int dimensionofDomain() {
return 2
}
override public int dimensionofRange() {
return 1
}
}

val g: RealVectorFunction g = object : new RealVectorFunction() {
override fun evaluate(Vector x): {
double x1 = x.get(1),
double x2 = x.get(2),

double w1 = x1 * x1 + x2 - 11,
double w2 = x1 + x2 * x2 -7,
    
double[] result = new double[2],
result[0] = 4 * w1 * x1 + 2 * w2,
result[1] = 2 * w1 + 4 * w2 * x2,
return new DenseVector (result)
}  
override public int dimensionofDomain() {
return 2
}
override public int dimensionofRange() {
return 2
}
}

val problem: C2OptimProblem = new C2OptimProblemImpl(f, g)
val solver : QuasiNewtonMinimizer RankOneMinimizer = new RankOneMinimizer(
1e-16,
15)
val soln1: RankOneMinimizer.solve(problem)
val xmin1: soln1.search(new DenseVector(new double[]{6, 6}))
val fmin1= f.evaluate(xmin1)
System.out.println(String.format("f(%s) = %.16f", xmin1.toString(), fmin1))

val solver : QuasiNewtonMinimizer DFPMinimizer = new DFPMinimizer(
1e-16,
15)
val soln2: DFPMinimizer.solve(problem)
val xmin2: soln2.search(new DenseVector(new double[]{6, 6}))
val fmin2= f.evaluate(xmin2)
System.out.println(String.format("f(%s) = %.16f", xmin2.toString(), fmin2))

val solver : QuasiNewtonMinimizer BFGSMinimizer = new BFGSMinimizer(
false,
1e-16,
15)
val soln3: BFGSMinimizer.solve(problem)
val xmin3: soln3.search(new DenseVector(new double[]{6, 6}))
val fmin3= f.evaluate(xmin3)
System.out.println(String.format("f(%s) = %.16f", xmin3.toString(), fmin3))

val solver : QuasiNewtonMinimizer HuangMinimizer = new HuangMinimizer(
0,
1,
0,
1,
1e-16,
15)
val soln4: HuangMinimizer.solve(problem)
val xmin4: soln4.search(new DenseVector(new double[]{6, 6}))
val fmin4= f.evaluate(xmin4)
System.out.println(String.format("f(%s) = %.16f", xmin4.toString(), fmin4))

val solver : QuasiNewtonMinimizer PearsonMinimizer = new PearsonMinimizer(
1e-16,
15)
val soln5: PearsonMinimizer.solve(problem)
val xmin5: soln5.search(new DenseVector(new double[]{6, 6}))
val fmin5= f.evaluate(xmin5)
System.out.println(String.format("f(%s) = %.16f", xmin5.toString(), fmin5))

The output is:

				
					f([3.000000, 2.000000]) = 0.000000000000000
f([3.000000, 2.000000]) = 0.000000000000000
f([3.000000, 2.000000]) = 0.000000000000000
f([3.000000, 2.000000]) = 0.000000000000000
f([3.000000, 2.000000]) = 0.000000000000000

The Quasi-Newton methods are much more efficient than any of the Conjugate-Drection methods. They take less iterations than the Conjugate-Direction methods and give etter accuracy and precision.

Previous Topic

Back to Lesson

Next Topic

Quasi-Newton Methods​

Rank-One Method

Davidon-Fletcher-Powell Method

Broyden-Fletcher-Goldfarb-Shanno Method

Huang Family Method