### Megasthenis Asteris

##### megas@utexas.edu

I am a graduate student in the ECE Department of the University of Texas at Austin affiliated with WNCG, working towards my PhD in EE under Alexandros G. Dimakis. Our group moved to Austin in Spring 2013. Prior to that, I spent the first two and a half years of graduate school in the Electrical Engineering Department of the University of Southern California in Los Angeles. I received my Diploma in Electronic and Computer Engineering in 2010 from the Technical University of Crete, Chania, Greece, under the supervision of Assistant Prof. George N. Karystinos.

## Publications

### Conferences

Given two sets of variables derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced \emph{canonical} variables are maximally correlated. Sparse CCA is NP-hard.

We propose a novel combinatorial algorithm for sparse CCA. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the ambient dimension, which makes it suitable for large scale settings. It is simple to implement, and embarrassingly parallelizable. In contrast to the majority of existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA, enforcing additional structure beyond sparsity.

We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.

- (.pdf available soon)
- code

In *Bipartite Correlation Clustering* (BCC)
we are given a complete *bipartite* graph \(G\) with `\(+\)' and `\(-\)' edges,
and we seek a vertex clustering that maximizes the number of *agreements*:
the number of all \(+\) edges within clusters plus all `\(-\)' edges cut across clusters.
BCC is known to be NP-hard.

We present a novel algorithm for \(k\)-BCC, a variant of BCC with an upper bound \(k\) on the number of clusters. Our algorithm outputs a \(k\)-clustering that provably achieves a number of agreements within a multiplicative \((1-\delta)\)-factor from the optimal, for any desired accuracy \(\delta\). It relies on solving a bilinear maximization on the bi-adjacency matrix of the graph, subject to combinatorial constraints. Our algorithm runs in time exponential in \(k\) and \(\delta^{-1}\), but linear in the size of the input.

In the (unconstrained) BCC setting, we show that \( O(\delta^{-1}) \) clusters suffice to achieve an \((1-\delta)\)-factor approximation regardless of the size of the graph. In turn, our \(k\)-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to \(1\) from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data matrix, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtain polynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm on real data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

```
@inproceedings{spca:nips2015,
title = {Sparse PCA via Bipartite Matchings},
author = {Asteris, Megasthenis and Papailiopoulos, Dimitris and Kyrillidis, Anastasios and Dimakis, Alexandros G},
booktitle = {Advances in Neural Information Processing Systems 28},
editor = {C. Cortes and N.D. Lawrence and D.D. Lee and M. Sugiyama and R. Garnett and R. Garnett},
pages = {766--774},
year = {2015},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/5901-sparse-pca-via-bipartite-matchings.pdf},
}
```

Orthogonal Nonnegative Matrix Factorization (ONMF) aims to approximate a nonnegative matrix as the product of two \(k\)-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees.

We present a new ONMF algorithm with provable approximation guarantees. For any constant dimension \(k\), we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component.

We evaluate our algorithms on several real and artificial datasets and show that their performance matches or outperforms the state of the art.

```
@inproceedings{onmf:nips2015,
title = {Orthogonal NMF through Subspace Exploration},
author = {Asteris, Megasthenis and Papailiopoulos, Dimitris and Dimakis, Alexandros G},
booktitle = {Advances in Neural Information Processing Systems 28},
editor = {C. Cortes and N.D. Lawrence and D.D. Lee and M. Sugiyama and R. Garnett and R. Garnett},
pages = {343--351},
year = {2015},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/5998-orthogonal-nmf-through-subspace-exploration.pdf},
}
```

We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph~\(G\) on \(p\) vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in \(G\).

From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity.

We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets.

```
@inproceedings{pathpca:icml2015,
author = {Asteris, Megasthenis and Kyrillidis, Anastasios and Dimakis, Alex and Yi, Han-Gyol and Chandrasekaran, Bharath},
title = {Stay on path: PCA along graph paths},
year = {2015},
volume = {37},
pages = {},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML-15)},
editor = {Francis Bach, David Blei},
publisher = {JMLR Workshop and Conference Proceedings},
url = {http://jmlr.org/proceedings/papers/v37/asteris15.pdf},
}
```

We study upper bounds on the sum-rate of multiple-unicasts.
We approximate the Generalized Network Sharing Bound (GNS cut) of the multiple-unicasts network coding problem with \(k\) independent sources.
Our approximation algorithm runs in polynomial time and yields an upper bound on the joint source entropy rate, which is within an \(O(\log^2{k})\) factor from the GNS cut.
It further yields a vector-linear network code that achieves joint source entropy rate within an \(O(\log^2{k})\) factor from the GNS cut,
but *not* with independent sources: the code induces a correlation pattern among the sources.

Our second contribution is establishing a separation result for vector-linear network codes: for any given field \(\mathbb{F}\) there exist networks for which the optimum sum-rate supported by vector-linear codes over \(\mathbb{F}\) for independent sources can be multiplicatively separated by a factor of \(k^{1-\delta}\), for any constant \(\delta > 0\), from the optimum joint entropy rate supported by a code that allows correlation between sources. Finally, we establish a similar separation result for the asymmetric optimum vector-linear sum-rates achieved over two distinct fields \(\mathbb{F}_{p}\) and \(\mathbb{F}_{q}\) for independent sources, revealing that the choice of field can heavily impact the performance of a linear network code.

```
@inproceedings{shanmugam:sumrate:2015,
author = {Shanmugam, K. and Asteris, M. and Dimakis, A.G.},
booktitle = {Information Theory (ISIT), 2015 IEEE International Symposium on},
title = {On approximating the sum-rate for multiple-unicasts},
year = {2015},
pages = {381--385},
keywords = {combined source-channel coding;correlation methods;entropy;network coding;polynomial approximation;GNS cut;approximation algorithm;correlation pattern;generalized network sharing bound;joint source entropy rate;multiple-unicasts network coding problem;optimum sum-rate;polynomial time;vector-linear network code;Approximation methods;Encoding;Entropy;Indexes;Joints;Network coding;Upper bound;GNS-cut;Multiple unicasts;index coding;network coding;sum-rate},
doi = {10.1109/ISIT.2015.7282481},
month = {June},
}
```

We introduce a novel algorithm to compute nonnegative sparse principal components of positive semidefinite (PSD) matrices. Our algorithm comes with approximation guarantees contingent on the spectral profile of the input matrix \(\mathbf{A}\): the sharper the eigenvalue decay, the better the quality of the approximation.

If the eigenvalues decay like any asymptotically vanishing function,
we can approximate nonnegative sparse PCA within any accuracy \(\epsilon\) in time
polynomial in the matrix dimension \(n\) and desired sparsity \(k\), but not in \(1/\epsilon\).
Further, we obtain a data dependent bound that is computed by executing an algorithm on a given data set.
This bound is significantly tighter than *a-priori* bounds and
can be used to show that for all tested datasets our algorithm is provably
within 40%-90% from the unknown optimum.

Our algorithm is combinatorial and explores a subspace defined by the leading eigenvectors of \(\mathbf{A}\). We test our scheme on several data sets, showing that it matches or outperforms the previous state of the art.

```
@inproceedings{nnspca:2014,
author = {Asteris, Megasthenis and Papailiopoulos, Dimitris and Dimakis, Alexandros},
title = {Nonnegative Sparse PCA with Provable Guarantees},
year = {2014},
pages = {1728--1736},
booktitle = {Proceedings of the 31st International Conference on Machine Learning (ICML-14)},
editor = {Tony Jebara and Eric P. Xing},
publisher = {JMLR Workshop and Conference Proceedings},
url = {http://jmlr.org/proceedings/papers/v32/asteris14.pdf},
}
```

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.

We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

```
@inproceedings{xorbas:2013,
author = {Sathiamoorthy, Maheswaran and Asteris, Megasthenis and Papailiopoulos, Dimitris and Dimakis, Alexandros G. and Vadali, Ramkumar and Chen, Scott and Borthakur, Dhruba},
title = {XORing elephants: novel erasure codes for big data},
booktitle = {Proceedings of the VLDB Endowment},
volume = {6},
number = {5},
pages = {325--336},
organization = {VLDB Endowment},
year = {2013},
}
```

We introduce a new family of Fountain codes that are systematic and also have sparse parities. Although this is impossible if we require the code to be MDS, we show it can be achieved if we relax our requirement into a near-MDS property. More concretely, for any \(\epsilon\) we construct codes that guarantee that a random subset of \((1 + \epsilon)k\) symbols sufﬁces to recover the original symbols with high probability. Our codes produce an unbounded number of output symbols, creating each parity independently by linearly combining a logarithmic number of input symbols. This structure has the additional beneﬁt of logarithmic locality: a single symbol loss can be repaired by accessing only \(O(\log k)\) other coded symbols. This is a desired property for distributed storage systems where symbols are spread over a network of storage nodes. Our mathematical contribution involves analyzing the rank of sparse random matrices over ﬁnite ﬁelds. We rely on establishing that a new family of sparse random bipartite graphs have large matchings with high probability.

```
@inproceedings{repfount:2012,
author = {Asteris, M. and Dimakis, A.G.},
title = {Repairable Fountain Codes},
booktitle = {Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on},
pages = {1752--1756},
year = {2012},
month = {July},
issn = {2157-8095},
doi = {10.1109/ISIT.2012.6283579},
}
```

We consider the problem of identifying the sparse principal component of a rank-deficient matrix.
We introduce auxiliary spherical variables and prove that there exists a set of candidate index-sets
(that is, sets of indices to the nonzero elements of the vector argument) whose size is polynomially bounded, in terms of rank,
and contains the optimal index-set, *i.e.,* the index-set of the nonzero elements of the optimal solution.
Finally, we develop an algorithm that computes the optimal sparse principal component in polynomial time for any sparsity degree.

```
@inproceedings{asteris:spca:2011,
author = {Asteris, M. and Papailiopoulos, D.S. and Karystinos, G.N.},
title = {Sparse principal component of a rank-deficient matrix},
booktitle = {Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on},
pages = {673--677},
year = {2011},
month = {March},
issn = {2157-8095},
doi = {10.1109/ISIT.2011.6034216},
}
```

### Journals

We introduce a new family of Fountain codes that are systematic and also have sparse parities. Given an input of \(k\) symbols, our codes produce an unbounded number of output symbols, generating each parity independently by linearly combining a logarithmic number of randomly selected input symbols. The construction guarantees that for any \(\epsilon>0\) accessing a random subset of \((1+\epsilon)k\) encoded symbols, asymptotically suffices to recover the \(k\) input symbols with high probability.

Our codes have the additional benefit of logarithmic locality: a single lost
symbol can be repaired by accessing a subset of \(O(\log k)\) of the remaining
encoded symbols.
This is a desired property for distributed storage systems where symbols are
spread over a network of storage nodes.
Beyond recovery upon loss, local reconstruction provides an efficient
alternative for reading symbols that cannot be accessed directly.
In our code, a logarithmic number of disjoint local groups is associated
with each systematic symbol, allowing *multiple parallel reads*.

Our main mathematical contribution involves analyzing the rank of sparse random matrices with specific structure over finite fields. We rely on establishing that a new family of sparse random bipartite graphs have perfect matchings with high probability.

```
@article{repfount:2014,
author = {Asteris, M. and Dimakis, AG.},
journal = {Selected Areas in Communications, IEEE Journal on},
title = {Repairable Fountain Codes},
year = {2014},
month = {May},
volume = {32},
number = {5},
pages = {1037--1047},
doi = {10.1109/JSAC.2014.140522},
issn = {0733-8716},
}
```

The computation of the sparse principal component of a matrix is equivalent to the identification of its principal submatrix with the largest maximum eigenvalue. Finding this optimal submatrix is what renders the problem NP-hard. In this work, we prove that, if the matrix is positive semidefinite and its rank is constant, then its sparse principal component is polynomially computable. Our proof utilizes the auxiliary unit vector technique that has been recently developed to identify problems that are polynomially solvable. Moreover, we use this technique to design an algorithm which, for any sparsity value, computes the sparse principal component with complexity \(O\left(N^{D+1}\right)\), where \(N\) and \(D\) are the matrix size and rank, respectively. Our algorithm is fully parallelizable and memory efficient.

```
@article{spca:2014,
author = {Asteris, Megasthenis and Papailiopoulos, Dimitris S. and Karystinos, Georgios N.},
journal = {Information Theory, IEEE Transactions on},
title = {The Sparse Principal Component of a Constant-Rank Matrix},
year = {2014},
month = {April},
volume = {60},
number = {4},
pages = {2281--2290},
keywords = {"computational complexity;eigenvalues and eigenfunctions;optimisation;
principal component analysis;signal processing;NP-hard;auxiliary unit vector;
constant-rank matrix;maximum eigenvalue;optimal submatrix;principal submatrix;
sparse principal component;sparsity value;Complexity theory;Indexes;
Optimization;Polynomials;Principal component analysis;Sparse matrices;Vectors;
Eigenvalues and eigenfunctions;feature extraction;information processing;
machine learning algorithms;principal component analysis;signal processing algorithms"
},
doi = {10.1109/TIT.2014.2303975},
issn = {0018-9448},
}
```

```
@article{digitalgarden:2013,
author = {Bletsas, A and Vlachaki, A and Kampianakis, E. and Sklivanitis, G. and Kimionis, J. and Tountas, K. and Asteris, M. and Markopoulos, P.},
journal = {Pervasive Computing, IEEE},
title = {Building a Low-Cost Digital Garden as a Telecom Lab Exercise},
year = {2013},
month = {Jan},
volume = {12},
number = {1},
pages = {48--57},
doi = {10.1109/MPRV.2011.83},
issn = {1536-1268},
}
```

### Other

Discrete integration over exponentially large combinatorial sets is a key task for approximate probabilistic inference.
Ermon *et al.* recently developed an approximation algorithm for discrete integration with provable guarantees using hashing techniques. These works reduce discrete integration to solving a polynomial number of MAP inference queries for models that include random linear parity check constraints.
Empirical evidence suggests these inference queries can be solved much more efficiently when the linear parity constraints involve only a few variables. The challenge is how to obtain the required hashing properties with codes that are as sparse as possible.

In this work, we establish a connection between Average Universal (AU) hashing and low-density parity-check codes; we show how the desired statistical properties of Average Universal (AU) hash families translate to properties of the average distance distribution of the code ensemble. Using this connection, we show that random independent LDPC codes with logarithmic degrees suffice to yield AU hash families with the required properties.

Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing a new update rule which requires infrequent passes over the entire input dataset to compute the full-gradient.

In this work, we propose CheapSVRG, a stochastic variance-reduction optimization scheme. Our algorithm is similar to SVRG but instead of the full gradient, it uses a surrogate which can be efficiently computed on a small subset of the input data. It achieves a linear convergence rate ---up to some error level, depending on the nature of the optimization problem--- and features a trade-off between the computational complexity and the convergence rate. Empirical evaluation shows that CheapSVRG performs at least competitively compared to the state of the art.

```
@article{2016arXiv160306861S,
title = {Trading-off variance and complexity in stochastic gradient descent},
author = {{Shah}, V. and {Asteris}, M. and {Kyrillidis}, A. and {Sanghavi}, S.},
journal = {ArXiv e-prints},
archivePrefix = {arXiv},
eprint = {1603.06861},
primaryClass = {stat.ML},
keywords = {Statistics - Machine Learning, Computer Science - Information Theory, Computer Science - Learning, Mathematics - Optimization and Control},
year = {2016},
month = {Mar},
adsurl = {http://adsabs.harvard.edu/abs/2016arXiv160306861S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System},
}
```

We consider the problem of identifying the sparse principal component of a rank-deficient matrix.
We introduce auxiliary spherical variables and prove that there exists a set of candidate index-sets
(that is, sets of indices to the nonzero elements of the vector argument) whose size is polynomially bounded,
in terms of rank, and contains the optimal index-set, *i.e.,* the index-set of the nonzero elements of the optimal solution.
Finally, we develop an algorithm that computes the optimal sparse principal component in polynomial time for any sparsity degree.