\def\R{{\cal R}}
An important part of many data analytics algorithms is a clustering stage, which sorts the data in a small number of groups that, presumably, belong together.
The $k$-means clustering algorithm sorts data into a predefined number of clusters. Each cluster is defined by its center, and a data point belongs to a cluster if it's closer to the center of that cluster than to that of any other.
Find $k$ distinct points to serve as initial cluster centers.
Now repeat until the clusters are stable:
Group to the points according to the center they are closets to;
recompute the centers based on these points.
The disadvantage of $k$-means clustering is that it presumes convex clusters. Other algorithms are more general. For instance, in spectral graph theory (section ) the Fiedler vector sorts the nodes of a graph into a small number of connected groups. We could apply this method as follows:
Let $D$ be the distance matrix $D_{ij}=|x_i-x_j|$ of the data points.
Let $W$ be the diagonal matrix containing the row sums of $D$; then
$W-D$ is a matrix for which we can find the Fiedler vector.
Normally, we would use inverse iteration to find this vector, but it is also possible [LinCohen:PIC] to use the power method; section .
A
decision tree
is a classifier for a combination of ordered and
unordered attributes. For each attribute, we have a list of
record-id $|$ class $|$ value
triplets.
Building the decision tree now goes as follows:\\
until all attributes are covered:
\>find an attribute
\>find a way to split the attribute
\>for all other attributes
\>\>split the attribute lists in
\>\>\>a `left' and `right' part
Finding an attribute is typically done using the \indexterm{gini index}. After that, finding a split point in that attribute is relatively simple for numerical attributes, though this is facilitated by always sorting the lists.
Splitting non-numeric, or `categorical', attributes is harder. We make a count of the records with the various values. We then group these to give a balanced split.
Spliting the attribute lists takes one
scan per attribute list:\\
for each item:
\>use the record identifier to
\>\>find its value of the split attribute
\>based on this assign the item to
\>\>the `left' or `right' part of this attribute list.
If the splitting attribute is categorical, assigning a record is slightly harder. For this, in the splitting stage typically a hash table is used, noting for each record number in which subtree it is classified. For the other attribute lists the record number is then used to move the item left or right.
[Sprint:classifier,ScalParC]The article http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf describes systems such as from Netflix that try to predict how much a given person will like a given item.
The approach here is identify a, relatively low dimensional, space $\R^f$ of features . For movies one could for instance adopt two features: low-brow versus high-brow, and historic vs contemporary. (High-brow historic: Hamlet; low-brow historic: Robin Hood, men in tights; high-brow contemporary: anything by Godard; low-brow contemporary, eh, do I really need to find examples?) For each item (movie) $i$ there is then a vector $q_i\in\R^f$ that characterises the item, and for each person there is a vector $p_j\in\R^f$ that characterises how much that person likes such movies. The predicted rating for a user of a movie is then $r_{ij}=q_i^tp_j$.
The vectors $q_i,p_j$ are gradually learned based on actual ratings $r_{ij}$, for instance by gradually minimizing \begin{equation} \min_{p,q} \sum_{ij} (r_{ij}-q_i^tp_j)^2+ \lambda \bigl( \|q_i\|^2+\|p_j\|^2 \bigr) \end{equation}
Let $e_{ij}=r_{ij}-q_i^tp_j$ be the difference between actual and predicted rating. \begin{equation} \begin{array} {l} q_i\leftarrow q_i+\gamma (e_{ij}p_j-\lambda q_i)\\ p_j\leftarrow p_j+\gamma (e_{ij}q_i-\lambda p_j) \end{array} \end{equation}
Decentralized recommending: [Zheng:2016arXiv:decentral-recommend]
First fix all $p_j$ and minimize the $q_i$ (in parallel), then fix $q_i$ and minimize the $p_j$. This handles missing data better.