\def\R{{\cal R}}
An important part of many data analytics algorithms is a clustering stage, which sorts the data in a small number of groups that, presumably, belong together.
The $k$-means clustering algorithm sorts data into a predefined number of clusters. Each cluster is defined by its center, and a data point belongs to a cluster if it's closer to the center of that cluster than to that of any other.
Find $k$ distinct points to serve as initial cluster centers.
The disadvantage of $k$-means clustering is that it presumes convex clusters. Other algorithms are more general. For instance, in spectral graph theory (section 25.5.5 ) the Fiedler vector sorts the nodes of a graph into a small number of connected groups. We could apply this method as follows:
Let $D$ be the distance matrix $D_{ij}=|x_i-x_j|$ of the data points.
Normally, we would use inverse iteration to find this vector, but it is also possible [LinCohen:PIC] to use the power method; section 20.3 .
A decision tree is a classifier for a combination of ordered and unordered attributes. For each attribute, we have a list of
record-id $|$ class $|$ value
triplets.
Building the decision tree now goes as follows:\\
until all attributes are covered:
find an attribute
find a way to split the attribute
for all other attributes
split the attribute lists in
a `left' and `right' part
Finding an attribute is typically done using the index}. After that, finding a split point in that attribute is relatively simple for numerical attributes, though this is facilitated by always sorting the lists.
Splitting non-numeric, or `categorical', attributes is harder. We make a count of the records with the various values. We then group these to give a balanced split.
Spliting the attribute lists takes one scan per attribute list:\\
for each item:
use the record identifier to
find its value of the split attribute
based on this assign the item to
the `left' or `right' part of this attribute list.
If the splitting attribute is categorical, assigning a record is slightly harder. For this, in the splitting stage typically a hash table is used, noting for each record number in which subtree it is classified. For the other attribute lists the record number is then used to move the item left or right.
[Sprint:classifier,ScalParC]The article http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf describes systems such as from Netflix that try to predict how much a given person will like a given item.
The approach here is identify a, relatively low dimensional, space $\R^f$ of features . For movies one could for instance adopt two features: low-brow versus high-brow, and historic vs contemporary. (High-brow historic: Hamlet; low-brow historic: Robin Hood, men in tights; high-brow contemporary: anything by Godard; low-brow contemporary, eh, do I really need to find examples?) For each item (movie) $i$ there is then a vector $q_i\in\R^f$ that characterises the item, and for each person there is a vector $p_j\in\R^f$ that characterises how much that person likes such movies. The predicted rating for a user of a movie is then $r_{ij}=q_i^tp_j$.
The vectors $q_i,p_j$ are gradually learned based on actual ratings $r_{ij}$, for instance by gradually minimizing \[ \min_{p,q} \sum_{ij} (r_{ij}-q_i^tp_j)^2+ \lambda \bigl( \|q_i\|^2+\|p_j\|^2 \bigr) \]
Let $e_{ij}=r_{ij}-q_i^tp_j$ be the difference between actual and predicted rating. \[ \begin{array} {l} q_i\leftarrow q_i+\gamma (e_{ij}p_j-\lambda q_i)\\ p_j\leftarrow p_j+\gamma (e_{ij}q_i-\lambda p_j) \end{array} \]
Decentralized recommending: [Zheng:2016arXiv:decentral-recommend]
First fix all $p_j$ and minimize the $q_i$ (in parallel), then fix $q_i$ and minimize the $p_j$. This handles missing data better.