Big data

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 12.1 : Clustering
12.1.1 : $k$-means clustering
12.1.2 : Spectral clustering
12.2 : Classification
12.2.1 : Decision trees
12.3 : Recommender systems
12.3.1 : Stochastic gradient descent
12.3.2 : Alternating least squares
Back to Table of Contents

12 Big data

\def\R{{\cal R}}

12.1 Clustering

Top > Clustering

An important part of many data analytics algorithms is a clustering stage, which sorts the data in a small number of groups that, presumably, belong together.

12.1.1 $k$-means clustering

Top > Clustering > $k$-means clustering

The $k$-means clustering algorithm sorts data into a predefined number of clusters. Each cluster is defined by its center, and a data point belongs to a cluster if it's closer to the center of that cluster than to that of any other.

  1. Find $k$ distinct points to serve as initial cluster centers.

  2. Now repeat until the clusters are stable:

  3. Group to the points according to the center they are closets to;

  4. recompute the centers based on these points.

12.1.2 Spectral clustering

Top > Clustering > Spectral clustering

The disadvantage of $k$-means clustering is that it presumes convex clusters. Other algorithms are more general. For instance, in spectral graph theory (section  ) the Fiedler vector sorts the nodes of a graph into a small number of connected groups. We could apply this method as follows:

  1. Let $D$ be the distance matrix $D_{ij}=|x_i-x_j|$ of the data points.

  2. Let $W$ be the diagonal matrix containing the row sums of $D$; then

  3. $W-D$ is a matrix for which we can find the Fiedler vector.

Normally, we would use inverse iteration to find this vector, but it is also possible  [LinCohen:PIC] to use the power method; section  .

12.2 Classification

Top > Classification

12.2.1 Decision trees

Top > Classification > Decision trees

A decision tree is a classifier for a combination of ordered and unordered attributes. For each attribute, we have a list of record-id $|$ class $|$ value triplets.

Building the decision tree now goes as follows:\\

until all attributes are covered:

\>find an attribute

\>find a way to split the attribute

\>for all other attributes

\>\>split the attribute lists in

\>\>\>a `left' and `right' part

Finding an attribute is typically done using the \indexterm{gini index}. After that, finding a split point in that attribute is relatively simple for numerical attributes, though this is facilitated by always sorting the lists.

Splitting non-numeric, or `categorical', attributes is harder. We make a count of the records with the various values. We then group these to give a balanced split.

Spliting the attribute lists takes one scan per attribute list:\\

for each item:

\>use the record identifier to

\>\>find its value of the split attribute

\>based on this assign the item to

\>\>the `left' or `right' part of this attribute list.

If the splitting attribute is categorical, assigning a record is slightly harder. For this, in the splitting stage typically a hash table is used, noting for each record number in which subtree it is classified. For the other attribute lists the record number is then used to move the item left or right.

[Sprint:classifier,ScalParC]

12.3 Recommender systems

Top > Recommender systems

The article http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf describes systems such as from Netflix that try to predict how much a given person will like a given item.

The approach here is identify a, relatively low dimensional, space $\R^f$ of features . For movies one could for instance adopt two features: low-brow versus high-brow, and historic vs contemporary. (High-brow historic: Hamlet; low-brow historic: Robin Hood, men in tights; high-brow contemporary: anything by Godard; low-brow contemporary, eh, do I really need to find examples?) For each item (movie) $i$ there is then a vector $q_i\in\R^f$ that characterises the item, and for each person there is a vector $p_j\in\R^f$ that characterises how much that person likes such movies. The predicted rating for a user of a movie is then $r_{ij}=q_i^tp_j$.

The vectors $q_i,p_j$ are gradually learned based on actual ratings $r_{ij}$, for instance by gradually minimizing \begin{equation} \min_{p,q} \sum_{ij} (r_{ij}-q_i^tp_j)^2+ \lambda \bigl( \|q_i\|^2+\|p_j\|^2 \bigr) \end{equation}

12.3.1 Stochastic gradient descent

Top > Recommender systems > Stochastic gradient descent

Let $e_{ij}=r_{ij}-q_i^tp_j$ be the difference between actual and predicted rating. \begin{equation} \begin{array} {l} q_i\leftarrow q_i+\gamma (e_{ij}p_j-\lambda q_i)\\ p_j\leftarrow p_j+\gamma (e_{ij}q_i-\lambda p_j) \end{array} \end{equation}

Decentralized recommending:  [Zheng:2016arXiv:decentral-recommend]

12.3.2 Alternating least squares

Top > Recommender systems > Alternating least squares

First fix all $p_j$ and minimize the $q_i$ (in parallel), then fix $q_i$ and minimize the $p_j$. This handles missing data better.

Back to Table of Contents