1 Introduction
Dimension reduction is a fundamental concept in science and engineering for feature extraction and data visualization. Exploring the low-dimensional structures of high-dimensional data attracts broad attention. Popular dimension reduction methods include Principal Component Analysis (PCA) [1,2], Non-negative Matrix Factorization (NMF) [3], and t-distributed Stochastic Neighbor Embedding (t-SNE) [4]. A main procedure in dimension reduction is to build a linear or nonlinear mapping from a high-dimensional space to a low-dimensional one, which keeps important properties of the high-dimensional space, such as the distance between any two points [5].
The Random Projection (RP) is a widely used method for dimension reduction. It is well-known that the Johnson-Lindenstrauss (JL) transformation [6,7] can nearly preserve the distance between two points after a random projection f, which is typically called isometry property. The isometry property can be used to achieve the nearest neighbor search in high-dimensional datasets [8,9]. It can also be used to build the Restricted Isometry Property (RIP) in compressed sensing [10,11], where a sparse signal can be reconstructed under a linear random projection [12]. The JL lemma [6] tells us that there exists a nearly isometry mapping f, which maps high-dimensional datasets into a lower dimensional space. Typically, a choice for the mapping f is the linear random projection.
f(x)=1MRx,(1)
where x∈RN, and R∈RM×N is a matrix whose entries are drawn from the mean zero and variance one Gaussian distribution, denoted by N(0,1). We call it Gaussian random projection (Gaussian RP). The storage of matrix R in Eq. (1) is O(MN) and the cost of computing Rx in Eq. (1) is O(MN). However, with large M and N, this construction is computationally infeasible. To alleviate the difficulty, the sparse random projection method [13] and the very sparse random projection method [14] are proposed, where the random projection is constructed by a sparse random matrix. Thus the storage and the computational cost can be reduced.
To be specific, Achlioptas [13] replaced the dense matrix R by a sparse matrix whose entries follows:
Rij=s⋅{+1,with\;probability12s,0,with\;probability1−1s,−1,with\;probability12s.(2)
This means that the matrix is sampled at a rate of 1/s. Note that, if s=1, the corresponding distribution is called the Rademacher distribution. When s=3, the cost of computing Rx in Eq. (1) reduces down to a third of the original one but is still O(MN). When s=N≫3, Li et al. [14] called this case as the very sparse random projection (Very Sparse RP), which significantly speeds up the computation with little loss in accuracy. It is clear that the storage of very sparse random projection is O(MN). However, the sparse random projection can typically distort a sparse vector [9]. To achieve a low-distortion embedding, Ailon et al. [9,15] proposed the Fast-Johnson-Lindenstrauss Transform (FJLT), where the preconditioning of a sparse projection matrix with a randomized Fourier transform is employed. To reduce randomness and storage requirements, Sun et al. [16] proposed the following format: R=(R1⊙…⊙Rd)T, where ⊙ represents the Khatri-Rao product, Ri∈Rni×M, and N=∏i=1dni. Each Ri is a random matrix whose entries are i.i.d. random variables drawn from N(0,1). This transformation is called the Gaussian tensor random projection (Gaussian TRP) throughout this paper. It is clear that the storage of the Gaussian TRP is O(M∑i=1dni), which is less than that of the Gaussian random projection (Gaussian RP). For example, when N=n1n2=40000, the storage of Gaussian TRP is only 1/20 of Gaussian RP. Also, it has been shown that Gaussian TRP satisfies the properties of expected isometry with vanishing variance [16].
Recently, using matrix or tensor decomposition to reduce the storage of projection matrices is proposed in [17,18]. The main idea of these methods is to split the projection matrix into some small scale matrices or tensors. In this work, we focus on the low rank tensor train representation to construct the random projection f. Tensor decompositions are widely used for data compression [5,19–24]. The Tensor Train (TT) decomposition, also called Matrix Product States (MPS) [25–28], gives the following benefits—low rank TT-formats can provide compact representations of projection matrices and efficient basic linear algebra operations of matrix-by-vector products [29]. Based on these benefits, we propose a Tensor Train Random Projection (TTRP) method, which requires significantly smaller storage and computational costs compared with existing methods (e.g., Gaussian TRP [16], Very Sparse RP [14] and Gaussian RP [30]). We note that constructing projection matrices using Tensor Train (TT) and Canonical Polyadic (CP) decompositions based on Gaussian random variables is proposed in [31]. The main contributions of our work are three-fold: first our TTRP is conducted based on a rank-one TT-format, which significantly reduces the storage of projection matrices; second, we provide a novel construction procedure for the rank-one TT-format in our TTRP based on i.i.d. Rademacher random variables; third, we prove that our construction of TTRP is unbiased with bounded variance.
The rest of the paper is organized as follows. The tensor train format is introduced in Section 2. Details of our TTRP approach are introduced in Section 3, where we prove that the approach is an expected isometric projection with bounded variance. In Section 4, we demonstrate the efficiency of TTRP with datasets including synthetic, MNIST. Finally Section 5 concludes the paper.
2 Tensor Train Format
Let lowercase letters (x), boldface lowercase letters (x), boldface capital letters (X), calligraphy letters (X) be scalar, vector, matrix and tensor variables, respectively. x(i) represents the element i of a vector x. X(i,j) means the element (i,j) of a matrix X. The i-th row and j-th column of a matrix X is defined by X(i,:) and X(:,j), respectively. For a given d-th order tensor X, X(i1,i2,…,id) is its (i1,i2,…,id)-th component. For a vector x∈RN, we denote its ℓp norm as ‖x‖p=(∑i=1N|x(i)|p)1p, for any p≥1. The Kronecker product of matrices A∈RI×J and B∈RK×L is denoted by A⊗B of which the result is a matrix of size (IK)×(JL) and defined by
A⊗B=[A(1,1)BA(1,2)B⋯A(1,J)BA(2,1)BA(2,2)B⋯A(2,J)B⋮⋮⋱⋮A(I,1)BA(I,2)B⋯A(I,J)B].
The Kronecker product conforms the following laws [32]:
(AC)⊗(BD)=(A⊗B)(C⊗D),(3)
(A+B)⊗(C+D)=A⊗C+A⊗D+B⊗C+B⊗D,(4)
(kA)⊗B=A⊗(kB)=k(A⊗B).(5)
2.1 Tensor Train Decomposition
Tensor Train (TT) decomposition [29] is a generalization of Singular Value Decomposition (SVD) from matrices to tensors. TT decomposition provides a compact representation for tensors, and allows for efficient application of linear algebra operations (discussed in Section 2.2 and Section 2.3).
Given a d-th order tensor G∈Rn1×⋯×nd and a vector g (vector format of G) , the tensor train decomposition [29] is
g(i)=G(ν(i))=G(i1,i2,…,id)=G1(i1)G2(i2)⋯Gd(id),(6)
where ν:N↦Nd is a bijection, Gk∈Rrk−1×nk×rk are called TT-cores, Gk(ik)∈Rrk−1×rk is a slice of Gk, for k=1,2,…,d, ik=1,…,nk, and the “boundary condition” is r0=rd=1. The tensor G is said to be in the TT-format if each element of G can be represented by Eq. (6). The vector [r0,r1,r2,…,rd] is referred to as TT-ranks. Let Gk(αk−1,ik,αk) represent the element of Gk(ik) in the position (αk−1,αk). In the index form, the decomposition (6) is rewritten as the following TT-format:
G(i1,i2,…,id)=∑α0,⋯,αdG1(α0,i1,α1)G2(α1,i2,α2)⋯Gd(αd−1,id,αd).
To look more closely to Eq. (6), an element G(i1,i2,…,id) is represented by a sequence of matrix-by-vector products. Fig. 1 illustrates the tensor train decomposition. It can be seen that the key ingredient in tensor train (TT) decomposition is the TT-ranks. The TT-format only uses O(ndr2) memory to O(nd) elements, where n=max{n1,…,nd} and r=max{r0,r1,…,rd}. Although the storage reduction is efficient only if the TT-rank is small, tensors in data science and machine learning typically have low TT-ranks. Moreover, one can apply the TT-format to basic linear algebra operations, such as matrix-by-vector products, scalar multiplications, etc. In later section, it is shown that this can reduce the computational cost significantly when the data have low rank structures (see [29] for details).
Figure 1: Tensor train format (TT-format): extract an element G(i1,i2,…,id) via a sequence of matrix-byvector product
2.2 Tensorizing Matrix-by-Vector Products
The tensor train format gives a compact representation of matrices and efficient computation for matrix-by-vector products. We first review the TT-format of large matrices and vectors following [29]. Defining two bijections ν:N↦Nd and μ:N↦Nd, a pair index (i,j)∈N2 is mapped to a multi-index pair (ν(i),μ(j))=(i1.i2,…,id,j1,j2,…,jd). Then a matrix R∈RM×N and a vector x∈RN can be tensorized in the TT-format as follows. Letting M=∏i=1dmi and N=∏j=1dnj, an element (i,j) of R can be written as Eq. (7) (see [29,33] for more details):
R(i,j)=R(ν(i),μ(j))=R(i1,…,id;j1,…,jd)=R1(i1,j1)⋯Rd(id,jd),(7)
where (i1,…,id) enumerate the rows of R, (j1,…,jd) enumerate the columns of R, and Rk(ik,jk) is an rk−1×rk matrix (and r0=rd=1), i.e., (ik,jk) is treated as one “long index” for k=1,…,d. An element j of x can be written as Eq. (8):
x(j)=X(μ(j))=X(j1,…,jd)=X1(j1)⋯Xd(jd),(8)
where Xk(jk) is an r^k−1×r^k matrix and r^0=r^d=1 for k=1,…,d. We consider the matrix-by-vector product (y=Rx), and each element of y can be tensorized in the TT-format as Eq. (9):
y(i)=Y(i1,…,id)=∑j1,…,jdR(i1,…,id,j1,…,jd)X(j1,…,jd)=∑j1,…,jd(R1(i1,j1)⋯Rd(id,jd))(X1(j1)⋯Xd(jd))=∑j1,…,jd(R1(i1,j1)⋯Rd(id,jd))⊗(X1(j1)⋯Xd(jd))=∑j1,…,jd(R1(i1,j1)⊗X1(j1))⏟O(r0r1r^0r^1)⋯(Rd(id,jd)⊗Xd(jd))⏟O(rd−1rdr^d−1r^d)=Y1(i1)⏟O(n1r0r1r^0r^1)⋯Yd(id)⏟O(ndrd−1rdr^d−1r^d),(9)
where Yk(ik)=∑jkRk(ik,jk)⊗Xk(jk)∈Rrk−1r^k−1×rkr^k, for k=1,…,d. The complexity of computing each TT-core Yk∈Rrk−1r^k−1×mk×rkr^k, is O(mknkrk−1rkr^k−1r^k) for k=1,…,d. Assuming that the TT-cores of x are known, the total cost of the matrix-by-vector product (y=Rx) in the TT-format can reduce significantly from the original complexity O(MN) to O(dmnr2r^2),m=max{m1,m2,…,md}, n=max{n1,n2,…,nd},r=max{r0,r1,…,rd},r^=max{r^0,r^1,…,r^d}, where N is typically large and r is small. When mk=nk,rk=r^k, for k=1,…,d, the cost of such matrix-by-vector product in the TT-format is O(dn2r4) [29]. Note that, in the case that r equals one, the cost of such matrix-by-vector product in the TT-format is O(dmnr^2).
2.3 Basic Operations in the TT-Format
In Section 2.2, the product of matrix R and vector x which are both in the TT-format, is conducted efficiently. In the TT-format, many important operations can be readily implemented. For instance, computing the Euclidean distance between two vectors in the TT-format is more efficient with less storage than directly computing the Euclidean distance in standard matrix and vector formats. In the following Eqs. (10)–(15), some important operations in the TT-format are discussed.
The subtraction of tensor Y∈Rm1×⋯×md and tensor Y^∈Rm1×⋯×md in the TT-format is
Z(i1,…,id):=Y(i1,…,id)−Y^(i1,…,id)=Y1(i1)Y2(i2)⋯Yd(id)−Y^1(i1)Y^2(i2)⋯Y^d(id)=Z1(i1)Z2(i2)⋯Zd(id),(10)
where
Zk(ik)=(Yk(ik)00Y^k(ik)),k=2,…,d−1,
and
Z1(i1)=(Y1(i1)−Y^1(i1)),Zd(id)=(Yd(id)Y^d(id)),
and TT-ranks of Z equal the sum of TT-ranks of Y and Y^.
The dot product of tensor Y and tensor Y^ in the TT-format [29] is
⟨Y,Y^⟩:=∑i1,…,idY(i1,…,id)Y^(i1,…,id)=∑i1,…,id(Y1(i1)Y2(i2)⋯Yd(id))(Y^1(i1)Y^2(i2)⋯Y^d(id))=∑i1,…,id(Y1(i1)Y2(i2)⋯Yd(id))⊗(Y^1(i1)Y^2(i2)⋯Y^d(id))=(∑i1Y1(i1)⊗Y^1(i1))(∑i2Y2(i2)⊗Y^2(i2))…(∑idYd(id)⊗Y^d(id))=V1V2⋯Vd,(11)
where
Vk=∑ikYk(ik)⊗Y^k(ik),k=1,…,d.(12)
Since V1,Vd are vectors and V2,…,Vd−1 are matrices, we compute ⟨Y,Y^⟩ by a sequence of matrix-by-vector products:
v1=V1,(13)
vk=vk−1Vk=vk−1∑ikYk(ik)⊗Y^k(ik)=∑ikpk(ik),k=2,…,d,(14)
where
pk(ik)=vk−1(Yk(ik)⊗Y^k(ik)),(15)
and we obtain
⟨Y,Y^⟩=vd.
For simplify we assume that TT-ranks of Y are the same as that of Y^. In Eq. (15), let B:=Yk(ik)∈Rr×r,C:=Y^k(ik)∈Rr×r,x:=vk−1∈R1×r2,y:=pk(ik)∈R1×r2, for k=2,…,d−1, and we use the reshaping Kronecker product expressions [34] for Eq. (15):
y=x(B⊗C)⇔Y=CTXB,
where we reshape x,y into X=[x1 x2 ⋯ xr]∈Rr×r, Y=[y1 y2 ⋯ yr]∈Rr×r, respectively. Note that the cost of computing Y=CTXB is O(r3) while the disregard of Kronecker structure of y=x(B⊗C) leads to an O(r4) calculation. Hence the complexity of computing pk(ik) in (15) is O(r3), because of the efficient Kronecker product computation. Then the cost of computing vk in (14) is O(mr3), and the total cost of the dot product ⟨Y,Y^⟩ is O(dmr3).
The Frobenius norm of a tensor Y is defined by
‖Y‖F=⟨Y,Y⟩.
Computing the distance between tensor Y and tensor Y^ in the TT-format is computationally efficient by applying the dot product Eqs. (11) and (12),
‖Y−Y^‖F=⟨Y−Y^,Y−Y^⟩.(16)
The complexity of computing the distance is also O(dmr3). Algorithm 1 gives more details about computing Eq. (16) based on Frobenius norm ‖Y−Y^‖F.
In summary, just merging the cores of two tensors in the TT-format can perform the subtraction of two tensors instead of directly subtraction of two tensors in standard tensor format. A sequence of matrix-by-vector products can achieve the dot product of two tensors in the TT-format. The cost of computing the distance between two tensors in the TT-format, reduces from the original complexity O(M) to O(dmr3), where M=∏i=1dmi,r≪M,andm=max{m1,m2,…,md}.
3 Tensor Train Random Projection
Due to the computational efficiency of TT-format discussed above, we consider the TT-format to construct projection matrices. Our tensor train random projection is defined as follows.
Definition 3.1. (Tensor Train Random Projection). For a data point x∈ RN, our tensor train random projection (TTRP) is
fTTRP(x):=1MRx,(17)
where the tensorized versions (through the TT-format) of R and x are denoted by R and X (see Eqs. (7) and (8)), the corresponding TT-cores are denoted by {Rk∈ Rrk−1×mk×nk×rk}k=1d and {Xk∈Rr^k−1×nk×r^k}k=1d, respectively, we set r0=r1=…=rd=1, and y:=Rx is specified by Eq. (9).
In Section 2.2, it is shown that for any TT-rank, the computation cost of Rx in Eq. (17) is O(dmnr2r^2), and the storage cost is O(dmnr2), where m=max{m1,m2,…,md}, n=max{n1,n2,…,nd}, r=max{r0,r1,…,rd} and r^=max{r^0,r^1,…,r^d}. Note that our TTRP is based on the tensorized version of R with TT-ranks all equal to one, i.e., r0=r1=…=rd=1, which significantly reduces the computational and storage costs. Therefore, our TTRP is constructed using a rank-one TT tensor. The comparisons for TTRP with different TT-ranks are investigated in Section 4. It turns out that the computational cost of computing Rx and the storage cost in Eq. (17) are O(dmnr^2) and O(dmn), respectively. Moreover, from our analysis in the latter part of this section, we find that the Rademacher distribution introduced in Section 1 is an optimal choice for generating the TT-cores of R. In the following, we prove that TTRP established by Eq. (17) is an expected isometric projection with bounded variance.
Theorem 3.1. Given a vector x∈R∏j=1dnj, if R in Eq. (17) is composed of d independent TT-cores R1,…,Rd, whose entries are independent and identically random variables with mean zero and variance one, then the following equation holds
E‖fTTRP(x)‖22=‖x‖22.
Proof. Denoting y=Rx gives
E‖fTTRP(x)‖22=1ME‖y‖22=1ME[∑i=1My2(i)]=1ME[∑i1,…,idY2(i1,…,id)].(18)
By the TT-format, Y(i1,…,id)=Y1(i1)⋯Yd(id), where Yk(ik)=∑jkRk(ik,jk)⊗Xk(jk), for k=1,…,d, it follows that
E[Y2(i1,…,id)]=E[(Y1(i1)⋯Yd(id))(Y1(i1)⋯Yd(id))]=E[(Y1(i1)⋯Yd(id))⊗(Y1(i1)⋯Yd(id))](19)
=E[(Y1(i1)⊗Y1(i1))⋯(Yd(id)⊗Yd(id))](20)
=E[Y1(i1)⊗Y1(i1)]⋯E[Yd(id)⊗Yd(id)],(21)
where Eq. (20) is derived using Eqs. (3) and (19), and then combining Eq. (20) and using the independence of TT-cores R1,…,Rd give Eq. (21). The k-th term of the right hand side of Eq. (21), for k=1,…,d, can be computed by
E[Yk(ik)⊗Yk(ik)]=E[[∑jkRk(ik,jk)⊗Xk(jk)]⊗[∑jkRk(ik,jk)⊗Xk(jk)]](22)
=E[[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(ik,jk)Xk(jk)]](23)
=∑jk,j′kE[Rk(ik,jk)Rk(ik,jk′)]Xk(jk)⊗Xk(jk′)(24)
=∑jkE[Rk2(ik,jk)]Xk(jk)⊗Xk(jk)(25)
=∑jkXk(jk)⊗Xk(jk).(26)
Here as we set the TT-ranks of R to be one, Rk(ik,jk) is scalar, and Eq. (22) then leads to Eq. (23). Using Eqs. (4), (5) and (23) gives Eq. (24), and we derive Eq. (26) from Eq. (24) by the assumption that E[Rk2(ik,jk)]=1 and E[Rk(ik,jk)Rk(ik,jk′)]=0, for jk,jk′=1,…,nk, jk≠jk′, k=1,…,d. Substituting Eq. (26) into Eq. (21) gives
E[Y2(i1,…,id)]=[∑j1X1(j1)⊗X1(j1)]⋯[∑jdXd(jd)⊗Xd(jd)] =∑j1,…,jd[X1(j1)⋯Xd(jd)]⊗[X1(j1)⋯Xd(jd)] =∑j1,…,jdX2(j1,…,jd) =‖x‖22.(27)
Substituting Eq. (27) into Eq. (18), it concludes that
E‖fTTRP(X)‖22=1ME[∑i1,…,idY2(i1,…,id)]=1M×M‖x‖22=‖x‖22.
Theorem 3.2. Given a vector x∈R∏j=1dnj, if R in Eq. (17) is composed of d independent TT-cores R1,…,Rd, whose entries are independent and identically random variables with mean zero, variance one, with the same fourth moment Δ and M:=maxi=1,…,N|x(i)|,m=max{m1,m2,…,md},n=max{n1,n2,…,nd}, then
Var(‖fTTRP(x)‖22)≤1M(Δ+n(m+2)−3)dNM4−‖x‖24.
Proof. By the property of the variance and using Theorem 3.1,
Var(‖fTTRP(x)‖22)=E[‖fTTRP(x)‖24]−[E[‖fTTRP(x)‖22]]2=E[‖1My‖24]−‖x‖24=1M2E[‖y‖24]−‖x‖24(28)
=1M2[∑i=1ME[y4(i)]+∑i≠jE[y2(i)y2(j)]]−‖x‖24,(29)
where note that E[y2(i)y2(j)]≠E[y2(i)]E[y2(j)] in general and a simple example can be found in Appendix A.
We compute the first term of the right hand side of Eq. (29),
E[y4(i)]=E[Y(i1,…,id)⊗Y(i1,…,id)⊗Y(i1,…,id)⊗Y(i1,…,id)](30)
=E[[Y1(i1)⊗Y1(i1)⊗Y1(i1)⊗Y1(i1)]⋯[Yd(id)⊗Yd(id)⊗Yd(id)⊗Yd(id)]](31)
=E[Y1(i1)⊗Y1(i1)⊗Y1(i1)⊗Y1(i1)]⋯E[Yd(id)⊗Yd(id)⊗Yd(id)⊗Yd(id)],(32)
where y(i)=Y(i1,…,id), applying Eq. (3) to Eq. (30) obtains Eq. (31), and we derive Eq. (32) from Eq. (31) by the independence of TT-cores {Rk}k=1d.
Considering the k-th term of the right hand side of Eq. (32), for k=1,…,d, we obtain that
E[Yk(ik)⊗Yk(ik)⊗Yk(ik)⊗Yk(ik)]=E[[∑jkRk(ik,jk)⊗Xk(jk)]⊗[∑jkRk(ik,jk)⊗Xk(jk)]⊗[∑jkRk(ik,jk)⊗Xk(jk)]⊗[∑jkRk(ik,jk)⊗Xk(jk)]](33)
=E[[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(ik,jk)Xk(jk)]](34)
=E[∑jkRk4(ik,jk)Xk(jk)⊗Xk(jk)⊗Xk(jk)⊗Xk(jk)]+E[∑jk≠j′kRk2(ik,jk)Rk2(ik,j′k)Xk(jk)⊗Xk(jk)⊗Xk(j′k)⊗Xk(j′k)]+E[∑jk≠j′kRk2(ik,jk)Rk2(ik,j′k)Xk(jk)⊗Xk(j′k)⊗Xk(jk)⊗Xk(j′k)]+E[∑jk≠j′kRk2(ik,jk)Rk2(ik,j′k)Xk(jk)⊗Xk(j′k)⊗Xk(j′k)⊗Xk(jk)](35)
=Δ∑jkXk(jk)⊗Xk(jk)⊗Xk(jk)⊗Xk(jk)+∑jk≠j′kXk(jk)⊗Xk(jk)⊗Xk(j′k)⊗Xk(j′k) +∑jk≠j′kXk(jk)⊗Xk(j′k)⊗Xk(jk)⊗Xk(j′k)+∑jk≠j′kXk(jk)⊗Xk(j′k)⊗Xk(j′k)⊗Xk(jk),(36)
where we infer Eq. (34) from Eq. (33) by scalar property of Rk(ik,jk), Eq. (35) is obtained by Eq. (4) and the independence of TT-cores {Rk}k=1d, and denoting the fourth moment Δ:=E[Rk4(ik,jk)], we deduce Eq. (36) by the assumption E[Rk2(ik,jk)]=1, for k=1,…,d.
Substituting Eq. (36) into Eq. (32), it implies that
E[Y4(i1,…,id)]=[Δ∑j1X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)+∑j1≠j1′X1(j1)⊗X1(j1)⊗X1(j1′)⊗X1(j1′)+∑j1≠j1′X1(j1)⊗X1(j1′)⊗X1(j1)⊗X1(j1′)+∑j1≠j1′X1(j1)⊗X1(j1′)⊗X1(j1′)⊗X1(j1)]⋯[Δ∑jdXd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)+∑jd≠jd′Xd(jd)⊗Xd(jd)⊗Xd(jd′)⊗Xd(jd′)+∑jd≠jd′Xd(jd)⊗Xd(jd′)⊗Xd(jd)⊗Xd(jd′)+∑jd≠jd′Xd(jd)⊗Xd(jd′)⊗Xd(jd′)⊗Xd(jd)]≤Δd∑j1,…,jd[[X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)]⋯[Xd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)]]+Δd−1Cd1maxk[∑j1,..,jk≠jk′,…,jd[X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)]⋯[Xk(jk)⊗Xk(jk)⊗Xk(jk′)⊗Xk(jk′)]⋯[Xd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)]]+Δd−1Cd1maxk[∑j1,..,jk≠jk′,…,jd[X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)]⋯[Xk(jk)⊗Xk(jk′)⊗Xk(jk)⊗Xk(jk′)]⋯[Xd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)]]+Δd−1Cd1maxk[∑j1,..,jk≠jk′,…,jd[X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)]⋯[Xk(jk)⊗Xk(jk′)⊗Xk(jk′)⊗Xk(jk)]⋯[Xd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)]]+⋯(37)
≤Δd∑j1,…,jdX4(j1,…,jd)+3Δd−1Cd1maxk[∑j1,..,jk≠j′k,…,jdX(j1,…,jk,…,jd)2X(j1,…,j′k,…,jd)2]+⋯(38)
≤Δd‖x‖44+3(n−1)Δd−1Cd1NM4+32(n−1)2Δd−2Cd2NM4+⋯+3d(n−1)dNM4 ≤(Δ+3(n−1))dNM4,(39)
where denoting M:=maxi=1,…,N|x(i)|,n=max{n1,n2,…,nd}, we derive Eq. (38) from Eq. (37) by Eq. (3).
Similarly, the second term E[y2(i)y2(j)] of the right hand side of Eq. (29), for i≠j,ν(i)=(i1,i2,…,id)≠ν(j)=(i1′,i2′,…,id′), is obtained by
E[y2(i)y2(j)] =E[Y1(i1)⊗Y1(i1)⊗Y1(i′1)⊗Y1(i′1)]⋯E[Yd(id)⊗Yd(id)⊗Yd(i′d)⊗Yd(i′d)].(40)
If ik≠ik′, for k=1,…,d, then the k-th term of the right hand side of Eq. (40) is computed by
E[Yk(ik)⊗Yk(ik)⊗Yk(i′k)⊗Yk(i′k)] =E[[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(ik,jk)Xk(jk)]⊗[∑jkRk(i′k,jk)Xk(jk)]⊗[∑jkRk(i′k,jk)Xk(jk)]]=E[∑jkRk2(ik,jk)Rk2(i′k,jk)Xk(jk)⊗Xk(jk)⊗Xk(jk)⊗Xk(jk)]+E[∑jk≠j′kRk2(ik,jk)Rk2(i′k,j′k)Xk(jk)⊗Xk(jk)⊗Xk(j′k)⊗Xk(j′k)]=∑jkXk(jk)⊗Xk(jk)⊗Xk(jk)⊗Xk(jk)+∑jk≠j′kXk(jk)⊗Xk(jk)⊗Xk(j′k)⊗Xk(j′k).(41)
Supposing that i1=i1′,…,ik≠ik′,…,id=id′ and substituting Eqs. (36) and (41) into Eq. (40), we obtain
E[y2(i)y2(j)]=E[Y1(i1)⊗Y1(i1)⊗Y1(i1)⊗Y1(i1)]⋯E[Yk(ik)⊗Yk(ik)⊗Yk(ik′)⊗Yk(ik′)]⋯E[Yd(id)⊗Yd(id)⊗Yd(id)⊗Yd(id)]=[Δ∑j1X1(j1)⊗X1(j1)⊗X1(j1)⊗X1(j1)+∑j1≠j1′X1(j1)⊗X1(j1)⊗X1(j1′)⊗X1(j1′)+∑j1≠j1′X1(j1)⊗X1(j1′)⊗X1(j1)⊗X1(j1′)+∑j1≠j1′X1(j1)⊗X1(j1′)⊗X1(j1′)⊗X1(j1)]⋯[∑jkXk(jk)⊗Xk(jk)⊗Xk(jk)⊗Xk(jk)+∑jk≠jk′Xk(jk)⊗Xk(jk)⊗Xk(jk′)⊗Xk(jk′)]⋯[Δ∑jdXd(jd)⊗Xd(jd)⊗Xd(jd)⊗Xd(jd)+∑jd≠jd′Xd(jd)⊗Xd(jd)⊗Xd(jd′)⊗Xd(jd′)+∑jd≠jd′Xd(jd)⊗Xd(jd′)⊗Xd(jd)⊗Xd(jd′)+∑jd≠jd′Xd(jd)⊗Xd(jd′)⊗Xd(jd′)⊗Xd(jd)]≤n(Δ+3(n−1))d−1NM4.(42)
Similarly, if for k∈S⊆{1,…,d},|S|=l, ik≠ik′, and for k∈S¯, ik=ik′, then
E[y2(i)y2(j)]≤nl(Δ+3(n−1))d−lNM4.(43)
Hence, combining Eqs. (42) and (43) gives
∑i≠jE[y2(i)y2(j)]≤M[Cd1(m−1)n(Δ+3(n−1))d−1+⋯+Cdl(m−1)lnl(Δ+3(n−1))(d−l) +⋯+Cdd(m−1)dnd]NM4,(44)
where m=max{m1,m2,…,md}.
Therefore, using Eqs. (39) and (44) deduces
E[‖y‖24]≤M[(Δ+3(n−1))d+Cd1(m−1)n(Δ+3(n−1))d−1+⋯+Cdd(m−1)dnd]NM4=M((m−1)n+Δ+3(n−1))dNM4 =M(Δ+n(m+2)−3)dNM4.(45)
In summary, substituting Eq. (45) into Eq. (28) implies
Var(‖fTTRP(x)‖22)≤M(Δ+n(m+2)−3)dNM4M2−‖x‖24≤1M(Δ+n(m+2)−3)dNM4−‖x‖24.(46)
One can see that the bound of the variance Eq. (46) is reduced as M increases, which is expected. When M=md and N=nd, we have
Var(‖fTTRP(x)‖22)≤(Δ+2n−3m+n)dNM4−‖x‖24.(47)
Note that as m increases, the upper bound in Eq. (47) tends to (N2M4−‖x‖24)≥0, and this upper bound vanishes as M increases if and only if x(1)=x(2)=…=x(N). Also, the upper bound in Eq. (46) is affected by the fourth moment Δ=E[Rk4(ik,jk)]=Var(Rk2(ik,jk))+[E[Rk2(ik,jk)]]2. To keep the expected isometry, we need E[Rk2(ik,jk)]=1. Note that when the TT-cores follow the Rademacher distribution, i.e., Var(Rk2(ik,jk))=0, the fourth moment Δ in Eq. (46) achieves the minimum. So, the Rademacher distribution is an optimal choice for generating the TT-cores, and we set the Rademacher distribution to be our default choice for constructing TTRP (Definition 3.1).
Proposition 3.1. (Hypercontractivity [35]) Consider a degree q polynomial f(Y)=f(Y1,…,Yn) of independent centered Gaussian or Rademacher random variables Y1,…,Yn. Then for any λ>0
P(|f(Y)−E[f(Y)]|≥λ)≤e2⋅exp[−(λ2K⋅Var[f(Y)])1q],
where Var([f(Y)]) is the variance of the random variable f(Y) and K>0 is an absolute constant.
Proposition 3.1 extends the Hanson-Wright inequality whose proof can be found in [35].
Proposition 3.2. Let fTTRP: RN↦ RM be the tensor train random projection defined by Eq. (17). Suppose that for i=1,…,d, all entries of TT-cores Ri are independent standard Gaussian or Rademacher random variables, with the same fourth moment Δ and M:=maxi=1,…,N|x(i)|,m=max{m1,m2,…,md},n=max{n1,n2,…,nd}. For any x∈ RN, there exist absolute constants C and K>0 such that the following claim holds
P(|‖fTTRP(x)‖22− ‖x‖22|≥ε ‖x‖22)≤Cexp[−(M⋅ε2K⋅[(Δ+n(m+2)−3)dN−M])12d].
Proof. According to Theorem 3.1, E‖fTTRP(x)‖22=‖x‖22. Since ‖fTTRP(x)‖22 is a polynomial of degree 2d of independent standard Gaussian or Radamecher random variables, which are the entries of TT-cores Ri, for i=1,…,d, we apply Proposition 3.1 and Theorem 3.2 to obtain
P(|‖fTTRP(x)‖22−‖x‖22|≥ε‖x‖22)≤e2⋅exp[−(ε2‖x‖24K⋅Var(‖fTTRP(x)‖22))12d]≤e2⋅exp[−(ε2K⋅[1M(Δ+n(m+2)−3)dNM4‖x‖24−1])12d]≤e2⋅exp[−(M⋅ε2K⋅[(Δ+n(m+2)−3)dN−M])12d]≤Cexp[−(M⋅ε2K⋅[(Δ+n(m+2)−3)dN−M])12d],
where M=maxi=1,…,N|x(i)| and then M4‖x‖24≤1.
We note that the upper bound in Proposition 3.2 is not tight, as it involves the dimensionality of datasets (N). To give a tight bound independent of the dimensionality of datasets for the corresponding concentration inequality is our future work.
The procedure of TTRP is summarized in Algorithm 2. For the input of this algorithm, the TT-ranks of R (the tensorized version of the projection matrix R in Eq. (17)) are set to one, and from our above analysis, we generate entries of the corresponding TT-cores {Rk}k=1d through the Rademacher distribution. For a given data point x in the TT-format, Algorithm 2 gives the TT-cores of the corresponding output, and each element of fTTRP(x) in Eq. (17) can be represented as
fTTRP(x)(i)=fTTRP(x)(ν(i))=fTTRP(x)(i1,…,id)=1MY1(i1)⋯Yd(id),
where ν is a bijection from N to Nd.
4 Numerical Experiments
We demonstrate the efficiency of TTRP using synthetic datasets and the MNIST dataset [36]. The quality of isometry is a key factor to assess the performance of random projection methods, which in our numerical studies is estimated by the ratio of the pairwise distance:
2n0(n0−1)∑n0≥i>j‖fTTRP(x(i))−fTTRP(x(j))‖2‖x(i)−x(j)‖2,(48)
where n0 is the number of data points. Since the output of our TTRP procedure (see Algorithm 2) is in the TT-format, it is efficient to apply TT-format operations to compute the pairwise distance of Eq. (48) through Algorithm 1. In order to obtain the average performance of isometry, we repeat numerical experiments 100 times (different realizations for TT-cores) and estimate the mean and the variance for the ratio of the pairwise distance using these samples. The rest of this section is organized as follows. First, through a synthetic dataset, the effect of different TT-ranks of the tensorized version R of R in Eq. (17) is shown, which leads to our motivation of setting the TT-ranks to be one. After that, we focus on the situation with TT-ranks equal to one, and test the effect of different distributions of TT-cores. Finally, based on both high-dimensional synthetic and MNIST datasets, our TTRP are compared with related projection methods, including Gaussian TRP [16], Very Sparse RP [14] and Gaussian RP [30]. All results reported below are obtained in MATLAB with TT-Toolbox 2.2.2 (https://github.com/oseledets/TT-Toolbox).
4.1 Effect of Different TT-Ranks
In Definition 3.1, we set the TT-ranks to be one. To explain our motivation of this settting, we investigate the effect of different TT-ranks—we herein consider the situation that the TT-ranks take r0=rd=1,rk=r,k=1,…,d−1, where the rank r∈{1,2,…}, and we keep other settings in Definition 3.1 unchanged. For comparison, two different distributions are considered to generate the TT-cores in this part—the Rademacher distribution (our default optimal choice) and the Gaussian distribution, and the corresponding tensor train projection is denoted by rank-r TTRP and Gaussian TT, respectively. To keep the expected isometry of rank-r TTRP, the entries of TT-cores R1 and Rd are drawn from 1/r1/4 or −1/r1/4 with equal probability, and each element of Rk,k=2,..,d−1 is uniformly and independently drawn from 1/r1/2 or −1/r1/2. For Gaussian TT, Rk for k=2,…,d−1 and k=1,d are drawn from N(0,1r) and N(0,1r), respectively.
Two synthetic datasets with dimension N=1000 are generated (the size of the first one is n0=10 and that of the second one is n0=50), where each entry of vectors (each vector is a sample in the synthetic dataset) is independently generated through N(0,1). In this test problem, we set the reduced dimension to be M=24, and the dimensions of the corresponding tensor representations are set to m1=4,m2=3,m3=2 and n1=n2=n3=10 (M=m1m2m3 and N=n1n2n3). Fig. 2 shows the ratio of the pairwise distance of the two projection methods (computed through Eq. (48)). It can be seen that the estimated mean of ratio of the pairwise distance of rank-r TTRP is typically more close to one than that of Gaussian TT, i.e., rank-r TTRP has advantages for keeping the pairwise distances. Clearly, for a given rank in Fig. 2, the estimated variance of the pairwise distance of rank-r TTRP is smaller than that of Gaussian TT. Moreover, focusing on rank-r TTRP, the results of both the mean and the variance are not significantly different for different TT-ranks. In order to reduce the storage, we only focus on the rank-one case (as in Definition 3.1) in the rest of this paper.
Figure 2: Effect of different ranks based on synthetic data (M=24,N=1000,m1=4,m2=3,m3=2,n1=n2=n3=10)
4.2 Effect of Different TT-Cores
Two synthetic datasets are tested to assess the effect of different distributions for TT-cores, whose sizes are n0=10 and n0=50, respectively, with dimension N=2500, whose elements are sampled from the standard Gaussian distribution. The following three distributions are investigated to construct TTRP (see Definition 3.1), which include the Rademacher distribution (our default choice), the standard Gaussian distribution, and the 1/3-sparse distribution (i.e., s=3 in Eq. (2)), while the corresponding projection methods are denoted by TTRP-RD, TTRP-N(0,1), and TTRP-1/3-sparse, respectively. For this test problem, three TT-cores are utilized for m1=M/2,m2=2,m3=1 and n1=25,n2=10,n3=10. Fig. 3 shows that the estimated mean of the ratio of the pairwise distance for TTRP-RD is very close to one, and the estimated variance of TTRP-RD is at least one order of magnitude smaller than that of TTRP-N(0,1) and TTRP-1/3-sparse. These results are consist with Theorem 3.2. In the rest of this paper, we focus on our default choice for TTRP—the TT-ranks are set to one, and each element of TT-cores is independently sampled through the Rademacher distribution.
Figure 3: Three test distributions for TT-cores based on synthetic data (N = 2500)
4.3 Comparison with Gaussian TRP, Very Sparse RP and Gaussian RP
The storage of the projection matrix and the cost of computing Rx (see Eq. (17)) of our TTRP (TT-ranks equal one), Gaussian TRP [16], Very Sparse RP [14] and Gaussian RP [30], are shown in Table 1, where R∈RM×N,M=∏i=1dmi,N=∏j=1dnj,m=max{m1,m2,…,md} and n=max{n1,n2,…,nd}. Note that the matrix R in Eq. (17) is tensorized in the TT-format, and TTRP is efficiently achieved by the matrix-by-vector products in the TT-format (see Eq. (9)). From Table 1, it is clear that our TTRP has the smallest storage cost and requires the smallest computational cost for computing Rx.
Two synthetic datasets with size n0=10 are tested—the dimension of the first one is N=2500 and that of the second one is N=104; each entry of the samples is independently generated through N(0,1). For TTRP and Gaussian TRP, the dimensions of tensor representations are set to: for N=2500, we set n1=25,n2=10,n3=10,m1=M/2,m2=2,m3=1; for N=104, we set n1=n2=25,n3=n4=4,m1=M/2,m2=2,m3=1,m4=1, where M is the reduced dimension. We again focus on the ratio of the pairwise distance (putting the outputs of different projection methods into Eq. (48)), and estimate the mean and the variance for the ratio of the pairwise distance through repeating numerical experiments 100 times (different realizations for constructing the random projections, e.g., different realizations of the Rademacher distribution for TTRP).
Fig. 4 shows that the performance of TTRP is very close to that of sparse RP and Gaussian RP, while the variance for Gaussian TRP is larger than that for the other three projection methods. Moreover, the variance for TTRP basically reduces as the dimension M increases, which is consistent with Theorem 3.2. To be further, more details are given for the case with M=24 and N=104 in Tables 2 and 3, where the value of storage is the number of nonzero entries that need to be stored. It turns out that TTRP with fewer storage costs achieves a competitive performance compared with Very Sparse RP and Gaussian RP. In addition, from Table 3, for d>2, the variance of TTRP is clearly smaller than that of Gaussian TRP, and the storage cost of TTRP is much smaller than that of Gaussian TRP.
Figure 4: Mean and variance for the ratio of the pairwise distance, synthetic data
Next the CPU times for projecting a data point using the four methods (TTRP, Gaussian TRP, Very Sparse RP and Gaussian RP) are assessed. Here, we set the reduced dimension M=1000, and test four cases with N=104, N=105, N=2×105 and N=106, respectively. The dimensions of the tensorized output is set to m1=m2=m3=10 (such that M=m1m2m3), and the dimensions of the corresponding tensor representations of the original data points are set to: for N=104, n1=25,n2=25,n3=16; for N=105, n1=50,n2=50,n3=40; for N=2×105, n1=80,n2=50,n3=50; for N=106, n1=n2=n3=100. For each case, given a data point of which elements are sampled from the standard Gaussian distribution, the simulation of projecting it to the reduced dimensional space is repeated 100 times (different realizations for constructing the random projections), and the CPU time is defined to be the average time of these 100 simulations. Fig. 5 shows the CPU times, where the results are obtained in MATLAB on a workstation with Intel(R) Xeon(R) Gold 6130 CPU. It is clear that the computational cost of our TTRP is much smaller than those of Gaussian TRP and Gaussian RP for different data dimension N. Note that as the data dimension N increases, the computational costs of Gaussian TRP and Gaussian RP grow rapidly, while the computational cost of our TTRP grows slowly. When the data dimension is large (e.g., N=106 in Fig. 5), the CPU time of TTRP becomes smaller than that of Very Sparse RP, which is consistent with the results in Table 1.
Figure 5: A comparison of CPU time for different random projections (M = 1000)
Finally, we validate the performance of our TTRP approach using the MNIST dataset [36]. From MNIST, we randomly take n0=50 data points, each of which is a vector with dimension N=784. We consider two cases for the dimensions of tensor representations: in the first case, we set m1=M/2,m2=2,n1=196,n2=4, and in the second case, we set m1=M/2,m2=2,m3=1,n1=49,n2=4,n3=4. Fig. 6 shows the properties of isometry and bounded variance of different random projections on MNIST. It can be seen that TTRP satisfies the isometry property with bounded variance. Although Gaussian RP has the smallest variance (note that it is not based on the tensor format), its computational and storage costs are large (see Fig. 5). It is clear that as the reduced dimension M increases, the variances of the four methods reduce, and the variance of our TTRP is close to that of Very Sparse RP.
Figure 6: Isometry and variance quality for MNIST data (N = 784)
5 Conclusion
Random projection plays a fundamental role in conducting dimension reduction for high-dimensional datasets, where pairwise distances need to be approximately preserved. With a focus on efficient tensorized computation, this paper develops a tensor train random projection (TTRP) method. Based on our analysis for the bias and the variance, TTRP is proven to be an expected isometric projection with bounded variance. From the analysis in Theorem 3.2, the Rademacher distribution is shown to be an optimal choice to generate the TT-cores of TTRP. For computational convenience, the TT-ranks of TTRP are set to one, while from our numerical results, we show that different TT-ranks do not lead to significant results for the mean and the variance of the ratio of the pairwise distance. Our detailed numerical studies show that, compared with standard projection methods, our TTRP with the default setting (TT-ranks equal one and TT-cores are generated through the Rademacher distribution), requires significantly smaller storage and computational costs to achieve a competitive performance. From numerical results, we also find that our TTRP has smaller variances than tensor train random projection methods based on Gaussian distributions. Even though we have proven the properties of the mean and the variance of TTRP and the numerical results show that TTRP is efficient, the upper bound in Proposition 3.2 involves the dimensionality of datasets (N), and our future work is to give a tight bound independent of the dimensionality of datasets for the concentration inequality.
Acknowledgement: The authors thank Osman Asif Malik and Stephen Becker for helpful suggestions and discussions.
Funding Statement: This work is supported by the National Natural Science Foundation of China (No. 12071291), the Science and Technology Commission of Shanghai Municipality (No. 20JC1414300) and the Natural Science Foundation of Shanghai (No. 20ZR1436200).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
References
1. Wold, S., Esbensen, K., Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52. DOI 10.1016/0169-7439(87)80084-9. [Google Scholar] [CrossRef]
2. Vidal, R., Ma, Y., Sastry, S. S. (2016). Generalized principal component analysis. New York: Springer. [Google Scholar]
3. Sra, S., Dhillon, I. (2005). Generalized nonnegative matrix approximations with bregman divergences. Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 283–290. Vancouver, Canada. [Google Scholar]
4. van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605. [Google Scholar]
5. Pham, N., Pagh, R. (2013). Fast and scalable polynomial kernels via explicit feature maps. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239–247. Chicago, USA. [Google Scholar]
6. Johnson, W. B., Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26. DOI 10.1090/conm/026/737400. [Google Scholar] [CrossRef]
7. Dasgupta, S., Gupta, A. (2003). An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1), 60–65. DOI 10.1002/(ISSN)1098-2418. [Google Scholar] [CrossRef]
8. Kleinberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 599–608. El Paso, USA. [Google Scholar]
9. Ailon, N., Chazelle, B. (2006). Approximate nearest neighbors and the fast Johnson--Lindenstrauss transform. Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, pp. 557–563. Seattle, USA. [Google Scholar]
10. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M. (2008). A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3), 253–263. DOI 10.1007/s00365-007-9003-x. [Google Scholar] [CrossRef]
11. Krahmer, F., Ward, R. (2011). New and improved Johnson–Lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3), 1269–1281. DOI 10.1137/100810447. [Google Scholar] [CrossRef]
12. Candès, E. J., Romberg, J., Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509. DOI 10.1109/TIT.2005.862083. [Google Scholar] [CrossRef]
13. Achlioptas, D. (2003). Database-friendly random projections: Johnson--Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. DOI 10.1016/S0022-0000(03)00025-4. [Google Scholar] [CrossRef]
14. Li, P., Hastie, T. J., Church, K. W. (2006). Very sparse random projections. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 287–296. Philadelphia, USA. [Google Scholar]
15. Ailon, N., Chazelle, B. (2009). The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1), 302–322. DOI 10.1137/060673096. [Google Scholar] [CrossRef]
16. Sun, Y., Guo, Y., Tropp, J. A., Udell, M. (2021). Tensor random projection for low memory dimension reduction. arXiv preprint arXiv:2105.00105. [Google Scholar]
17. Jin, R., Kolda, T. G., Ward, R. (2021). Faster Johnson–Lindenstrauss transforms via Kronecker products. Information and Inference: A Journal of the IMA, 10(4), 1533–1562. DOI 10.1093/imaiai/iaaa028. [Google Scholar] [CrossRef]
18. Malik, O. A., Becker, S. (2020). Guarantees for the kronecker fast Johnson–Lindenstrauss transform using a coherence and sampling argument. Linear Algebra and its Applications, 602, 120–137. DOI 10.1016/j.laa.2020.05.004. [Google Scholar] [CrossRef]
19. Kolda, T. G., Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. DOI 10.1137/07070111X. [Google Scholar] [CrossRef]
20. Acar, E., Dunlavy, D. M., Kolda, T. G., MArup, M. (2010). Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1), 41–56. DOI 10.1016/j.chemolab.2010.08.004. [Google Scholar] [CrossRef]
21. Austin, W., Ballard, G., Kolda, T. G. (2016). Parallel tensor compression for large-scale scientific data. 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 912–922. Chicago, USA. [Google Scholar]
22. Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A. et al. (2020). Oblivious sketching of high-degree polynomial kernels. Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 141–160. DOI 10.1137/1.9781611975994.9. [Google Scholar] [CrossRef]
23. Tang, K., Liao, Q. (2020). Rank adaptive tensor recovery based model reduction for partial differential equations with high-dimensional random inputs. Journal of Computational Physics, 409, 109326. DOI 10.1016/j.jcp.2020.109326. [Google Scholar] [CrossRef]
24. Cui, T., Dolgov, S. (2021). Deep composition of tensor-trains using squared inverse rosenblatt transports. Foundations of Computational Mathematics, 21, 1–60. DOI 10.1007/s10208-021-09537-5. [Google Scholar] [CrossRef]
25. White, S. R. (1992). Density matrix formulation for quantum renormalization groups. Physical Review Letters, 69(19), 2863. DOI 10.1103/PhysRevLett.69.2863. [Google Scholar] [CrossRef]
26. Perez-Garcia, D., Verstraete, F., Wolf, M., Cirac, J. (2007). Matrix product state representations. Quantum Information & Computation, 7(5–6), 401–430. DOI 10.26421/QIC. [Google Scholar] [CrossRef]
27. Verstraete, F., Murg, V., Cirac, J. I. (2008). Matrix product states, projected entangled pair states, and variational renormalization group methods for quantum spin systems. Advances in Physics, 57(2), 143–224. DOI 10.1080/14789940801912366. [Google Scholar] [CrossRef]
28. Orús, R. (2014). A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349, 117–158. DOI 10.1016/j.aop.2014.06.013. [Google Scholar] [CrossRef]
29. Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5), 2295–2317. DOI 10.1137/090752286. [Google Scholar] [CrossRef]
30. Achlioptas, D. (2001). Database-friendly random projections. Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281. Santa Barbara, USA. [Google Scholar]
31. Rakhshan, B., Rabusseau, G. (2020). Tensorized random projections. International Conference on Artificial Intelligence and Statistics, pp. 3306–3316. Palermo, Italy. [Google Scholar]
32. van Loan, C. F. (2000). The ubiquitous kronecker product. Journal of Computational and Applied Mathematics, 123(1–2), 85–100. DOI 10.1016/S0377-0427(00)00393-9. [Google Scholar] [CrossRef]
33. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D. P. (2015). Tensorizing neural networks. Advances in Neural Information Processing Systems, 28, 442–450. [Google Scholar]
34. Golub, G. H., van Loan, C. F. (2013). Matrix computations. Baltimore: The Johns Hopkins University Press. [Google Scholar]
35. Schudy, W., Sviridenko, M. (2012). Concentration and moment inequalities for polynomials of independent random variables. Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 437–446. Kyoto, Japan. [Google Scholar]
36. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. DOI 10.1109/5.726791. [Google Scholar] [CrossRef]
Appendix A
Example for E[y2(i)y2(j)]≠E[y2(i)]E[y2(j)],i≠j.
If all TT-ranks of tensorized matrix R in Eq. (17) are equal to one, then R is represented as a Kronecker product of d matrices [29],
R=R1⊗R2⊗⋯⊗Rd,(49)
where Ri∈Rmi×ni, for i=1,2,..,d, whose entries are i.i.d. mean zero and variance one.
We set d=2,m1=m2=n1=n2=2 in Eq. (49), then
y=Rx=(R1⊗R2)x,(50)
where
R1=[a1a2b1b2],R2=[c1c2d1d2].
Substituting x:=[x(1)x(2)x(3)x(4)]T and y:=[y(1)y(2)y(3)y(4)]T into Eq. (50), we obtain
[y(1)y(2)y(3)y(4)]=[a1c1x(1)+a1c2x(2)+a2c1x(3)+a2c2x(4)a1d1x(1)+a1d2x(2)+a2d1x(3)+a2d2x(4)b1c1x(1)+b1c2x(2)+b2c1x(3)+b2c2x(4)b1d1x(1)+b1d2x(2)+b2d1x(3)+b2d2x(4)].(51)
Applying Eq. (51) to cov(y2(1),y2(3)) gives
cov(y2(1),y2(3))=cov((a1c1x(1)+a1c2x(2)+a2c1x(3)+a2c2x(4))2,(b1c1x(1)+b1c2x(2)+b2c1x(3)+b2c2x(4))2)=cov(a12c12x(1)2+a12c22x(2)2+a22c12x(3)2+a22c22x(4)2,b12c12x(1)2+b12c22x(2)2+b22c12x(3)2+b22c22x(4)2)+cov(2a12c1c2x(1)x(2)+2a22c1c2x(3)x(4),2b12c1c2x(1)x(2)+2b22c1c2x(3)x(4))=(x(1)2+x(3)2)2var(c12)+(x(2)2+x(4)2)2var(c22)+4(x(1)x(2)+x(3)x(4))2var(c1c2)=(x(1)2+x(3)2)2var(c12)+(x(2)2+x(4)2)2var(c22)+4(x(1)x(2)+x(3)x(4))2>0,(52)
and then Eq. (52) implies E[y2(1)y2(3)]≠E[y2(1)]E[y2(3)].
Generally, for some i≠j, E[y2(i)y2(j)]≠E[y2(i)]E[y2(j)].