While in a majority of biomedical modalities, images are produced directly, in MRI they are only obtained after a reconstruction process. In this chapter, we present the current approaches to MRI reconstruction. We recall in Section 3.1 the continuous model of the data-collection process in MRI and derive its discretized version. In Section 3.2, we review the different approaches that lead to linear reconstructions and we introduce the key concepts of the inverse problem formalism. This approach maps image reconstruction into an optimization problem with the possibility to impose a priori constraints to distinguish the solution from other possible candidates and improve reconstruction quality. We legitimate in Section 3.3 the use of sparsity-promoting priors in MRI and explain how they can be imposed via a proper regularization term. Finally, we review in Section 3.4 the algorithmic procedures that are theoretically capable of achieving the desired reconstructions while being suited to the practical constraints encountered in MRI.
A radio-frequency pulse is emitted to initiate nuclear magnetic resonance (NMR). It excites the spins in a 2-D plane or a 3-D volume, depending of the type of acquisition format. After excitation, the excited spins behave as radio-frequency emitters and have their precessing frequency and phase modified depending on their positions. This is achieved thanks to the time-varying magnetic gradient fields that are applied during the relaxation, defining a trajectory k in the k-space domain. The modulated part of the signal received by a coil of sensitivity Si(r) is given by
mi(k) = | ∫ |
| Si(r)ρ(r)e−2jπk·rdr. (1) |
The signal ρ is referred to as object. This signal is proportional to the spin density, but might also depend upon other local characteristics. More details on the derivation of relation (1) are provided in Chapter 2.
For an array of R receiving coils with sensitivities denoted by S1⋯ SR and a k-space trajectory sampled at N points kn, we represent the measurements concatenated in a global RN× 1 vector
m= | ⎛ ⎝ | ⎛ ⎝ | m1,1,…,mN,1 | ⎞ ⎠ | ,… | ⎛ ⎝ | m1,i,…,mN,i | ⎞ ⎠ | ,… | ⎛ ⎝ | m1,R,…,mN,R | ⎞ ⎠ | ⎞ ⎠ | . |
From here on, we consider that the Fourier domain and, in particular, the sampling points kn, are scaled to make the Nyquist sampling interval unity. This can be done without any loss of generality if the space domain is scaled accordingly. Therefore, we model the object as a linear combination of pixel-domain basis functions ϕp that are shifted replicates of some generating function ϕ, so that
|
Given a sampled version of the coil sensitivity si[p], the sensitivity-weighted object is modeled by
Siρ= |
| si[p]c[p]ϕp. (4) |
The standard implicit choice for ϕ is Dirac’s delta even if it is hardly justified from an approximation theoretic point of view. Different discretizations have been proposed, for example by Sutton et al. [38] with ϕ as a boxcar function or later by Delattre et al. [39] with B-splines. It is only recently [40] that the details have been worked out to get back the image for general ϕ that are non-interpolating, which is the case for instance for B-splines of degree greater than 1. The image to be reconstructed—i.e., the sampled version of the object ρ(p)—is obtained by filtering the coefficients c[p] with the discrete filter
P | ⎛ ⎝ | ejω | ⎞ ⎠ | = |
|
| ⎛ ⎝ | ω+2πh | ⎞ ⎠ | , (5) |
where ϕ^ denotes the Fourier transform of ϕ.
Since a finite field of view (FOV) determines sets of coefficients c and si with a finite number M of elements, we handle them as a vectors c and si, keeping the discrete coordinates p as implicit indexing.
Due to sparsity properties that are discussed later in this chapter, it might be preferable to represent the object in terms of wavelet coefficients. In the wavelet formalism, some constraints apply to ϕ. It must be a scaling function that satisfies the properties for a multiresolution [41]. In that case, the wavelets can be defined as linear combinations of the ϕp and the object is equivalently characterized by its coefficients in the orthonormal wavelet basis. We refer to Mallat’s reference book [42] for a full review on wavelets. There exists a discrete wavelet transform (DWT) that bijectively maps the coefficients c to the wavelet coefficients w that represent the same object ρ in a continuous wavelet basis. In the rest of the chapter, we represent this DWT by the synthesis matrix W. Note that the matrix-vector multiplications c = Ww and w = W−1c have efficient filterbank implementations.
The data-formation model (1) and the object parameterization (4) are combined to model the measurement corresponding to every point kn sampled in k-space. Accordingly, the measurement vector m is related to the coefficients c through the linear relation
m = Ec, (6) |
where the MRI encoding matrix E is formed as
E = | ⎛ ⎝ | IR⊗E0 | ⎞ ⎠ |
| T, (7) |
with the symbol ⊗ standing for the Kronecker product, IR representing the R× R identity matrix, and E0 being the encoding matrix for the same MRI scan with a single homogeneous receiving coil
E0 = diag | ⎛ ⎜ ⎜ ⎝ | ⎡ ⎢ ⎢ ⎣ |
| (2πk1),…, |
| (2πkN) | ⎤ ⎥ ⎥ ⎦ | ⎞ ⎟ ⎟ ⎠ |
| T. (8) |
There, vn are vectors indexed like c such that vn[p] = exp(−2jπkn·p).
Due to the presence of noise and other scanner inaccuracies, the introduction of a new term b, accounting for an additive perturbation, makes the data-formation model
m = E c+b (9) |
more realistic. Equivalently, if the parameters of interest are the wavelet coefficients w, the model writes
m = M w+b (10) |
with M = EW.
In MRI, the major source of noise is a radio-frequency signal originating from the thermal motion in the object under investigation. When observed with a receiving array of coils, this noise presents non-negligible correlations across channels. In other terms, the R× R channel cross-correlation matrix Θ has non-null off-diagonal entries. Accordingly, the additive perturbation is generally modeled as the realization of a centered multivariate Gaussian process b∼ N(0, Ψ) with covariance matrix Ψ=Θ⊗IN.
The problem of imaging is to recover the M coefficients c (or equivalently w) from the N corrupted measurements m. In this section, we review the popular approaches that lead to reconstructions that depend linearly upon the observations. We show that they are functionally equivalent. Two of these approaches rely on a stochastic interpretation of the problem, where the matrices Ψ and Υ are the known covariance matrices of the noise b and the object c, respectively. The corresponding global variances are given by vn = Tr(Ψ)/N=Tr(Θ) and vs = Tr(Υ)/M. We define the normalized covariance matrices as Ψ0=Ψ/vn and Υ0=Υ/vs. Most linear solutions involve a balancing parameter λ which is necessarily positive and can be interpreted in terms of the signal-to-noise ratio λ−1=Tr(Υ)/Tr(Ψ)=M vs/(N vn).
Depending on the scanner settings, the encoding matrix E is generally neither square nor invertible. In such cases, the Moore-Penrose pseudoinverse offers a solution to the reconstruction problem. The reconstruction matrix is then defined as
E† = |
|
| −1EH. (11) |
The Hermitian matrix transpose is used, denoted by the superscript H, because in MRI matrices have complex-valued entries. The problem of inverting a non-square matrix is tackled by considering the backprojected1 problem
EHm = EHEc, (12) |
because the matrix EHE is square.
Considering the singular value decomposition E=UΣVH, where Σ is an RN× M matrix whose diagonal entries are the singular values σn, one gets E†=VΣ†UH, with singular values
σn†= | ⎧ ⎨ ⎩ |
|
The major concern with pseudoinverse reconstruction resides in the propagation of noise. Indeed, very small but non-null singular values lead to drastic amplification of the corresponding noise components. This effect is quantified by the condition number, defined as κ(E)=maxn σn/minn σn. This number, which is greater or equal to 1, is also representative of the numerical challenge faced when inverting E. A linear inverse problem is termed “ill-conditioned” when the corresponding condition number is large. When the null space of E is not limited to {0}, the problem is said to be ill-posed.
The aim of regularized reconstruction schemes is to improve reconstruction with respect to the pseudoinverse approach by limiting the propagation of noise in the images.
It is remarkable that (12) rewrites as ∇c(||m−Ec ||22)=0, with ∇c standing for the gradient operator with respect to c. The More-Penrose pseudoinverse provides a least-squares solution c⋆=E†m to the reconstruction problem because it ensures EHm=EHEc⋆. This least-squares solution makes sense when the noise term b is independent and identically distributed.
Instead, when the noise correlation matrix Ψ0 is available, this knowledge can be exploited using the weighted pseudoinverse
EX† = |
|
| −1EHX (13) |
with the weighting matrix X=Ψ0†. The interest of that type of solution is that it takes into account noise correlations and that it relies less on the noisier samples. Thanks to the relation EHXEEX†m=EHXm, the weighted pseudoinverse provides a (weighted) least-squares solution.
The approach proposed by Phillips [43] and Twomey [44], for finite dimensional problems, and by Tikhonov [45], for infinite dimensional problems, defines the reconstruction as the minimization of the functional
⎪⎪ ⎪⎪ | m − E c | ⎪⎪ ⎪⎪ | X2 + λ | ⎪⎪ ⎪⎪ | Rc | ⎪⎪ ⎪⎪ | 2, (14) |
where the notation ||· ||X with X positive-definite stands for a weighted norm such that ||v ||X2=vHXv. The functional is a trade-off between a fidelity term, which enforces consistency with the measurements, and a regularization term, which penalizes non-regular solutions with respect to the regularization matrix R. The tuning parameter λ balances the influence of these two terms. The role of the regularization term is to limit the amplification of noise that can be dramatic for ill-conditioned problems (in MRI, see for instance [46]). In practice, it is often designed with a derivation operator to favor smooth solutions. Similar to the weighted pseudoinverse solution, the weighting matrix can be chosen as X=Ψ0† yielding a reconstruction matrix that gives importance to the samples in inverse proportion to their level of noise. Another common choice is to take X diagonal such as to compensate for an inhomogeneous k-space sampling density [37]. This choice facilitates the reconstruction.
The minimization of a quadratic functional yields a linear solution. Indeed, by taking the gradient of the functional and setting it to zero, we find that the reconstruction matrix writes
FQUAD = |
| −1EHX. (15) |
Here, the reconstruction problem is tackled within a stochastic framework. The unknowns c and b are modeled as realizations of centered multivariate Gaussian distributions: c∼ N(0, Υ) and b∼ N(0, Ψ).
According to the numerical model (9), the measurements also follow a multivariate Gaussian distribution m∼ N(0, EΥEH+Ψ).
The maximum a posteriori solution (MAP) c is the vector that maximizes the posterior distribution given the measurements m. Using Bayes’ theorem, the probability density function of the posterior distribution of c writes
p(c ∣ m) ∝ p(m ∣ c) p(c). |
In the present stochastic setting, the probability density function can be expanded in
p(c ∣ m) ∝ exp | ⎛ ⎝ | − | ⎪⎪ ⎪⎪ | m−Ec | ⎪⎪ ⎪⎪ | Ψ†2 | ⎞ ⎠ | exp | ⎛ ⎝ | − | ⎪⎪ ⎪⎪ | c | ⎪⎪ ⎪⎪ | Υ†2 | ⎞ ⎠ | . (16) |
Finally, the MAP solution is the vector c that minimizes the functional
⎪⎪ ⎪⎪ | m−Ec | ⎪⎪ ⎪⎪ | Ψ0†2+λ | ⎪⎪ ⎪⎪ | c | ⎪⎪ ⎪⎪ | Υ0†2. (17) |
We introduced the normalized covariance matrices in the later expression in order to have the parameter λ, which is the inverse of the signal to noise ratio, appear explicitly.
Similarly to the previous approaches, the functional to be minimized is composed of quadratic terms. As a consequence, the solution is linear, characterized by the reconstruction matrix
FMAP = |
| −1EHΨ0†. (18) |
The Gaussian model used in the MAP approach is hardly justified for true MRI images c. This assumption can be substituted by the constraint that the reconstruction is affine with respect to the measurements. Accordingly, we write the reconstructed image Fm+g.
To determine adequate parameters F and g, one can rely on the two first order statistics of the unknown data; that are, the expectation vectors c and b, and covariance matrices Υ and Ψ. According to the data-formation model (9), the expectation and covariance of the reconstruction error e=Fm+g−c are given by
E | ⎡ ⎣ | e | ⎤ ⎦ | = F(Ec+b) + g − c (19) |
and
E | ⎡ ⎣ |
| ⎤ ⎦ | = (FE−I)Υ(FE−I)H+FΨFH + E | ⎡ ⎣ | e | ⎤ ⎦ |
| H. (20) |
An unbiased reconstruction2 is obtained when g=c−F(Ec+b). For the choice of F, one would reasonably like to minimize the variance of the reconstruction error. Given that the estimator is unbiased, the variance also corresponds to the expectation of the mean-square error. It is is given by the trace of the covariance matrix
Var | ⎡ ⎣ | e | ⎤ ⎦ | = Tr | ⎛ ⎝ | (FE−I)Υ(FE−I)H | ⎞ ⎠ | +Tr | ⎛ ⎝ | FΨFH | ⎞ ⎠ | . (21) |
Interestingly, this relation reveals two distinct contributions to the error.
The matrix F that minimizes the error variance, also referred to as mean-square error, can be computed using matrix calculus. Using the normalized covariance matrices, it writes
FMMSE = Υ0EH(EΥ0EH+λΨ0)−1. (22) |
First, Equations (15) and (18) show that quadratic regularization and MAP approaches are equivalent provided that X=Ψ0† and Υ0†=RHR.3
Second, the three following equalities reveal the connection between MAP and LMMSE solutions, (18) and (22), in the case where both matrices Υ0 and Ψ0 are invertible:
|
Last, the weighted pseudoinverse solution with X=Ψ0† corresponds to the other solutions in the limiting case where λ tends to 0. This is also the case for the regular Moore-Penrose pseudoinverse when the noise is independent and identically distributed; that is to say Ψ0=IRN/R. As already mentioned, the pseudoinverse solutions are only valid when noise propagation is negligible. This situation occurs with well-conditioned (κ(E)≈ 1) reconstruction problems that are largely overdetermined (M≪ RN) and/or subject to very little noise (Tr(Υ)≫Tr(Ψ)).
There is no particular reason for Ψ0 to be singular. Most of the time, the correlation between pixels in the image are not modeled; this translates in a matrix Υ0 which is diagonal. When no signal is expected from some pixels of the image, (for instance outside a predetermined ROI), it could be tempting to set to 0 the corresponding entries in Υ0, resulting in a singular matrix. However, a reasonable problem setting would exclude such entries in the unknown vector c, restoring the invertibility of Υ0.
We just saw that the linear approaches to reconstruction can be derived from the solution of some optimization problems. The corresponding functionals were quadratic, yielding closed-form solutions. In this section, we consider other approaches that are popular in MRI and which involve non-quadratic regularization terms.
The solution c⋆ is defined as the minimizer of a cost function that involves two terms: the data fidelity F(b) and the regularization R(c) that penalizes undesirable solutions. This is summarized as
c⋆= arg |
| F(m − Ec) + λ R(c), (23) |
where the regularization parameter λ≥ 0 balances the two constraints. In MRI, the noise term b=m − Ec is usually assumed to be the realization of a Gaussian process with normalized covariance matrix Ψ0. From a Bayesian point of view, this justifies the choice F(b)=||b ||Ψ0†2=bHΨ0†b as a proper log-likelihood term. A more practical motivation for this choice is that a quadratic fidelity term yields a simple closed-form gradient that greatly facilitates the design and performance of reconstruction algorithms.
When the k-space sampling is dense enough and the signal-to-noise ratio is high, the quadratic regularization terms (presented in the previous section) yield satisfying reconstructions. But, the constraints to reduce the scan duration favor setups with reduced SNR and k-space trajectories that present regions of low sampling density. In these situations where the reconstruction problem is more challenging, the reconstructed image can often be enhanced by the use of a more suitable regularization term R(c).
Total Variation (TV) was introduced as an edge-preserving denoising method by Rudin et al. [47]. It is now a very popular approach to tackle image enhancement problems.
The TV regularization term corresponds to the sum of the Euclidean norms of the gradient of the object. In practice, it is defined as R(w)=||∇c ||ℓ1. In this context, the operator ∇ returns pixelwise the ℓ2-norm of finite differences. The use of TV regularization is particularly appropriate for piecewise-constant objects such as the Shepp-Logan (SL) phantom used for simulations in tomography and MRI. Textured and noisy images exhibit a much larger total variation.
Another popular idea is to exploit the fact that the object can be well represented by few non-zero coefficients (sparse representation) in an orthonormal basis of M functions φp. Formally, we write that
It is well-documented that typical MRI images admit sparse representation in bases such as wavelets or block DCT [6]. We illustrate this property in Figure 3.1.
The ℓ1-norm is a good measure of sparsity with interesting mathematical properties (e.g., convexity). Thus, among the candidates that are consistent with the measurements, we favor a solution whose wavelet coefficients have a small ℓ1-norm. Specifically, the solution is formulated as
w⋆= arg |
| C(w), (24) |
with
C(w) = | ⎪⎪ ⎪⎪ | m − M w | ⎪⎪ ⎪⎪ | ℓ22+λ | ⎪⎪ ⎪⎪ | w | ⎪⎪ ⎪⎪ | ℓ1. (25) |
This is the general solution for wavelet-regularized inverse problems considered by [19] as well as by many other authors.
MRI gives rise to a large-scale inverse problem in the sense that the number of degrees of freedom—that is to say, the unknown pixel values—is large. Consequently, the matrices are generally too large to be stored in memory not to mention the fact that direct matrix multiplication involves too many operations.4 We summarize in this section the strategies that make the reconstruction in MRI feasible with reasonable computer requirements and acceptable computation times.
The matrix-vector multiplications y=E0x and y=E0Hx are two basic operations in MRI reconstruction. They can be implemented efficiently using the FFT algorithm. For non-Cartesian samples kn, the gridding method, based on FFT and interpolation, can provide accurate computations (see [48] for instance). Algorithms 1 and 2 describe the implementation of the operations y=E0x and y=E0Hx, respectively.
Gridding
(x, p1,…, pM, k1,…, kN) (going to k-space domain)
Gridding
(x, p1,…, pM, k1,…, kN) (going to k-space domain)
Gridding
(x, −k1,…, −kN, p1,…, pM) (going to spatial domain)
An interesting work by Wajer [49] identifies E0HE0 as a convolution matrix associated to the kernel
G[p] = |
| ⎪ ⎪ ⎪ ⎪ |
| (2πkn) | ⎪ ⎪ ⎪ ⎪ |
| exp | ⎛ ⎝ | 2jπ kn·p | ⎞ ⎠ | . (26) |
When the kernel is precomputed for the lattice points belonging to the set S={ p−q ∣ p∈FOV, q∈FOV}, one can avoid the use of Algorithms 1 and 2. An efficient implementation of the operation y=E0HE0x, which uses zero-padded multidimensional FFTs, is described in Algorithm 3.
FFT
(G) (DFT coefficients)
ZPAD
(x,S) (zero-padding x to the dimensions of G)
FFT
(x) (computing DFT coefficients)
IFFT
(x) (inverse DFT)
Most of the time, in parallel MRI, the covariance matrices are block diagonal. In that case, they are sparse matrices and one can benefit from the related efficient memory storage and matrix operations. As already mentioned, Ψ0 is fully characterized by the channel cross-correlation matrix Θ0=Θ/vn such that Ψ0=Θ0⊗IN. Its pseudoinverse or inverse is then given by Ψ0†=Θ0†⊗IN. The matrix-vector multiplications with EHΨ0† and EHΨ0†E are implemented as described in Algorithms 4 and 5, respectively.
The conjugate gradient method (CG) [50] is an iterative algorithm that is among the most efficient in solving large-scale linear problems Ac=b, characterized by symmetric and positive-definite matrices A. The only operations involving the matrix A are matrix-vector multiplications Ax. In parallel MRI, it is the method of reference [37] to perform linear reconstructions. The quadratic-regularized solution characterized by the reconstruction matrix in (15) is computed with CG solving the linear problem defined by the matrix A=EHXE+λRHR and vector b=EHXm.
The idea of the method is to decompose the solution in a basis of mutually conjugate vectors; that is to say c=∑i αipi, with piHApj=0 for i≠ j. At iteration i, the estimate is ci=∑j≤ iαjpj and the corresponding residue writes ri=b−Aci. For the next direction, the choice pi+1=ri−∑j≤ i(pjHAri)pj/||pj ||A ensures the conjugacy constraint. In this direction, the coefficient αi+1=Re(pi+1HAri)/||pi+1 ||A is optimal with respect to the cost C(c)=cHAc−cHb−bHc. An efficient implementation of the method is described in Algorithm 6.
The CG algorithm theoretically converges within a finite number of iterations. In practice, this result is compromised by the propagation of round-off errors. In the context of MRI, the property of practical interest is the linear convergence rate achieved by CG. Indeed, the distance to the desired solution decreases as a power of the iteration number, with the convergence rate
0≤ r(A)= | ⎛ ⎜ ⎝ | √ |
| −1 | ⎞ ⎟ ⎠ | / | ⎛ ⎜ ⎝ | √ |
| +1 | ⎞ ⎟ ⎠ | <1. |
When the condition number κ(A) is large, the rate r(A) gets close to the unity, characterizing a slower convergence. Using the weighted norm ||x ||A=√xHAx, the distance is upperbounded by
⎪⎪ ⎪⎪ | ci−c⋆ | ⎪⎪ ⎪⎪ | A≤ 2 | ⎪⎪ ⎪⎪ | c0−c⋆ | ⎪⎪ ⎪⎪ | Ar(A)i. (27) |
With the regular Euclidean distance, the bound is looser
⎪⎪ ⎪⎪ | ci−c⋆ | ⎪⎪ ⎪⎪ | 2 ≤ 2κ(A) | ⎪⎪ ⎪⎪ | c0−c⋆ | ⎪⎪ ⎪⎪ | 2r(A)i. (28) |
The Iteratively Reweighted Least-Squares algorithm (IRLS), which is also known as the positive form of half-quadratic minimization [51], can be used to compute the solutions defined as
c⋆= arg |
| ⎪⎪ ⎪⎪ | m−Ec | ⎪⎪ ⎪⎪ | X2 + λ | ⎪⎪ ⎪⎪ | Rc | ⎪⎪ ⎪⎪ | ℓpp. (29) |
In this context, the functional is strictly convex for p>1. This condition ensures the unicity of the minimizer.
The principle of IRLS is to design an upperbounding quadratic proxy for the regularization term, tailored to the neighborhood of ci. In practice, one chooses the functional
Zi(c)=p/2 | ⎪⎪ ⎪⎪ | Rc | ⎪⎪ ⎪⎪ | Di2 + (1−p/2) | ⎪⎪ ⎪⎪ | Rci | ⎪⎪ ⎪⎪ | ℓpp, (30) |
where Di is a diagonal matrix with entries |(Rci)n|p−2. It has the following desirable properties
An implementation of the IRLS is described in Algorithm 7.
CG
(Ai,a,ci) (using Algorithm 6)
Let us remember that for p≤ 1 the minimization problem might not admit a unique solution. When the minimizer c⋆ is unique, it is also the unique fixed-point of the algorithm. As long as p<2, the sequence of functional values C(ci)=||m−Eci ||X2 + λ||Rci ||ℓpp generated by the IRLS is monotonically decreasing. This guarantees the convergence since the sequence is lower-bounded by the finite quantity C⋆=minc||m−Ec ||X2 + λ||Rc ||ℓpp.
The IRLS algorithm can be simply adapted in order to solve the minimization with mixed-norm regularization terms. A particular case is the total variation penalty which corresponds to the ℓ1-norm of the pixel-wise ℓ2-norm of the spatial gradient [52]. The IRLS algorithm for TV regularization was first proposed by Wohlberg and Rodríguez [53]. It is described in Algorithm 8.
CG
(Ai,a,ci) (using Algorithm 6)
Duality-based algorithms proved to be an efficient alternative to achieve TV regularization [54,55].
The Iterative Shrinkage/Thresholding Algorithm (ISTA)[18,17,19], also known as thresholded Landweber (TL), aims at minimizing the functional
C(w)= | ⎪⎪ ⎪⎪ | m−Mw | ⎪⎪ ⎪⎪ | X2 + λ | ⎪⎪ ⎪⎪ | w | ⎪⎪ ⎪⎪ | ℓ1. (31) |
Here, we use the notation w because ISTA is often applied on wavelet coefficients.
An important observation to understand ISTA is to see that the nonlinear shrinkage operation, sometimes called soft-thresholding, solves a minimization problem[56], with
|
By separability of norms, this applies component-wise to vectors of ℂN:
Tλ(u) = arg |
| ⎪⎪ ⎪⎪ | u−w | ⎪⎪ ⎪⎪ | ℓ22 + λ | ⎪⎪ ⎪⎪ | w | ⎪⎪ ⎪⎪ | ℓ1. |
This means that the ℓ1-regularized denoising problem (i.e., when M and X are identity matrices) is precisely solved by a shrinkage operation.
The ISTA generates a sequence of estimates wi that converges to the minimizer w⋆ of (31) when it is unique. The idea is to define at each step a new functional C′(w,wi) whose minimizer wi+1 will be the next estimate
wi+1 = arg |
| C′(w,wi). (33) |
Two constraints must be considered for the definition of C′.
In accordance with Constraint 1, C′ can take the generic quadratically augmented form
C′(w,wi) = C(w) + | ⎪⎪ ⎪⎪ | w−wi | ⎪⎪ ⎪⎪ | Λ−MHXM2, (34) |
with the constraint that (Λ−MHXM) is positive definite, where the weighting matrix Λ plays the role of a tuning parameter.
Then, ISTA corresponds to the trivial choice Λ=L/2I, with the value of L chosen to be greater or equal to the Lipschitz constant of the gradient of ||Mw ||X2, so that L≥ 2λmax( MHXM).
Let us define a = MHXm, A = MHXM, and
zi=wi+2(a−Awi)/L. (35) |
Then, using standard linear algebra, we can write
|
This shows that Constraint 2 is automatically satisfied.
Note that both the intermediate variable zn in (35) and the threshold values will vary depending on L.
Beck and Teboulle[20, Thm. 3.1] showed that this algorithm decreases the cost function in direct proportion to the number of iterations i.
C(wi)− C(w⋆)≤ |
| ⎪⎪ ⎪⎪ | wi0−w⋆ | ⎪⎪ ⎪⎪ | ℓ22. (36) |
Selecting L as small as possible will clearly favor the speed of convergence. It also raises the importance of a “warm” starting point.
Among the variants of ISTA, FISTA, proposed by Beck and Teboulle[20], ensures state-of-the-art convergence properties while preserving a comparable computational cost. Thanks to a controlled over-relaxation at each step, FISTA quadratically decreases the cost function, with
C(wi)− C(w⋆) ≤ |
| ⎪⎪ ⎪⎪ | w0−w⋆ | ⎪⎪ ⎪⎪ | ℓ22. (37) |
More details on FISTA, as a particular case of FWISTA with the trivial choice Λ=L/2I, can be found in Section 5.2.3.
An implementation of FISTA is given Algorithm (10).