Big data

Experimental html version of downloadable textbook, see
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \] 15.1 : Clustering
15.1.1 : $k$-means clustering
15.1.2 : Spectral clustering
15.2 : Classification
15.2.1 : Decision trees
15.3 : Recommender systems
15.3.1 : Stochastic gradient descent
15.3.2 : Alternating least squares
Back to Table of Contents

15 Big data

\def\R{{\cal R}}

15.1 Clustering

crumb trail: > analytics > Clustering

An important part of many data analytics algorithms is a clustering stage, which sorts the data in a small number of groups that, presumably, belong together.

15.1.1 $k$-means clustering

crumb trail: > analytics > Clustering > $k$-means clustering

The $k$-means clustering algorithm sorts data into a predefined number of clusters. Each cluster is defined by its center, and a data point belongs to a cluster if it's closer to the center of that cluster than to that of any other.

  1. Find $k$ distinct points to serve as initial cluster centers.

  2. Now repeat until the clusters are stable:
  3. Group to the points according to the center they are closets to;
  4. recompute the centers based on these points.

15.1.2 Spectral clustering

crumb trail: > analytics > Clustering > Spectral clustering

The disadvantage of $k$-means clustering is that it presumes convex clusters. Other algorithms are more general. For instance, in spectral graph theory (section  25.5.5 ) the Fiedler vector sorts the nodes of a graph into a small number of connected groups. We could apply this method as follows:

  1. Let $D$ be the distance matrix $D_{ij}=|x_i-x_j|$ of the data points.

  2. Let $W$ be the diagonal matrix containing the row sums of $D$; then
  3. $W-D$ is a matrix for which we can find the Fiedler vector.

Normally, we would use inverse iteration to find this vector, but it is also possible  [LinCohen:PIC] to use the power method; section  20.3 .

15.2 Classification

crumb trail: > analytics > Classification

15.2.1 Decision trees

crumb trail: > analytics > Classification > Decision trees

A decision tree is a classifier for a combination of ordered and unordered attributes. For each attribute, we have a list of

record-id $|$ class $|$ value


Building the decision tree now goes as follows:\\

until all attributes are covered:

find an attribute

find a way to split the attribute

for all other attributes

split the attribute lists in

a `left' and `right' part

Finding an attribute is typically done using the index}. After that, finding a split point in that attribute is relatively simple for numerical attributes, though this is facilitated by always sorting the lists.

Splitting non-numeric, or `categorical', attributes is harder. We make a count of the records with the various values. We then group these to give a balanced split.

Spliting the attribute lists takes one scan per attribute list:\\

for each item:

use the record identifier to

find its value of the split attribute

based on this assign the item to

the `left' or `right' part of this attribute list.

If the splitting attribute is categorical, assigning a record is slightly harder. For this, in the splitting stage typically a hash table is used, noting for each record number in which subtree it is classified. For the other attribute lists the record number is then used to move the item left or right.


15.3 Recommender systems

crumb trail: > analytics > Recommender systems

The article describes systems such as from Netflix that try to predict how much a given person will like a given item.

The approach here is identify a, relatively low dimensional, space $\R^f$ of features . For movies one could for instance adopt two features: low-brow versus high-brow, and historic vs contemporary. (High-brow historic: Hamlet; low-brow historic: Robin Hood, men in tights; high-brow contemporary: anything by Godard; low-brow contemporary, eh, do I really need to find examples?) For each item (movie) $i$ there is then a vector $q_i\in\R^f$ that characterises the item, and for each person there is a vector $p_j\in\R^f$ that characterises how much that person likes such movies. The predicted rating for a user of a movie is then $r_{ij}=q_i^tp_j$.

The vectors $q_i,p_j$ are gradually learned based on actual ratings $r_{ij}$, for instance by gradually minimizing \[ \min_{p,q} \sum_{ij} (r_{ij}-q_i^tp_j)^2+ \lambda \bigl( \|q_i\|^2+\|p_j\|^2 \bigr) \]

15.3.1 Stochastic gradient descent

crumb trail: > analytics > Recommender systems > Stochastic gradient descent

Let $e_{ij}=r_{ij}-q_i^tp_j$ be the difference between actual and predicted rating. \[ \begin{array} {l} q_i\leftarrow q_i+\gamma (e_{ij}p_j-\lambda q_i)\\ p_j\leftarrow p_j+\gamma (e_{ij}q_i-\lambda p_j) \end{array} \]

Decentralized recommending:  [Zheng:2016arXiv:decentral-recommend]

15.3.2 Alternating least squares

crumb trail: > analytics > Recommender systems > Alternating least squares

First fix all $p_j$ and minimize the $q_i$ (in parallel), then fix $q_i$ and minimize the $p_j$. This handles missing data better.

Back to Table of Contents