Software Intelligence

Saturday, January 18, 2020

Equivalence Relations and Deduping your Music Library

Motivation

One way or another, I always end up with duplicate songs or pictures in my digital libraries, as well as duplicate contacts in my address book. For music, two songs having the same title does not necessarily make them the same song, so a more complex relation is required. I have developed some techniques for deduping and what they all have in common is that they are looking for equivalence relations. In this article I will introduce equivalence relations and apply them to the simple example of deduping a music library.

The Equivalence Relation

Relations are extensively used in Computer Science. For example, in a Relational Database, a relation is established from the table Song to the table Artist by creating a foreign key. In a graph, node x is relate to node y if there is an edge from x to y. In Object Oriented design, class Song can be related to class Artist via Composition if Song contains Artist as a field. The examples are countless.

In Set Theory a relation R on a Set A is a set of ordered pairs $\{(x,y)|x,y \in A\}$. Consequently, a relation is simply a subset of $A \times A$, the Cartesian product of A with itself.

Some relations have interesting properties that make them special. A relation $R$ is reflexive if $(x,x) \in R, \forall x \in R$. In other words, if each element of A is related to itself. For example, the relation $\geq$ in the set of real numbers is reflexive since $x \geq x$, is true for every real number. Similarly, the relation "x is as fast as y" defined on the set of people is reflexive, since I am as fast as myself, even though sometimes I get ahead of myself.

A relation R is symmetric if $(x,y) \in R \implies (y,x) \in R$, which means that if
$x$ if related to $y$ then $y$ is also related to $x$. Friendships in Facebook are considered to be symmetric since if Bob is friends with Tim, then Tim is also a friend of Bob, even though in real life friendships may not be as reciprocal. Following on Twitter are not symmetric since you may be following Kim Kardasian but, since you are reading this blog, it is very unlikely that she is following you.

A relation R is transitive if whenever $(x,y) \in R$ and $(y,z) \in R$, then $(x,z) \in R$. For example is taller is a transitive relation since if Tim is taller than Bob and Bob is taller than Greg, then Tim is also taller than Greg. The relations $\geq, \leq,\subset$, $\supset$ are other famous transitive relations, and in general, every type of comparison is expected to be transitive.

This leads us to the relation that has it all: A relation that is reflexive, symmetric, and transitive, is an Equivalence Relation. Equivalent relations are omnipresent in Math and Software, though not so much in real life where some animals are always more equal than others. For example, modular arithmetic is an equivalence on the set of integers, equality of norm defines an equivalence on a set of vectors, connectivity defines an equivalence on a graph.

When two elements of the set $A$ are related via an equivalence relation, we say that they are equivalent and the relationship is denoted by $x \equiv y$.

Given a set $A$ and an equivalence relation R, for any element $x \in A$, the set of all the elements equivalent to x is called the equivalence class of x and is denoted by $[x]$. Each equivalence class is a subset of $A$. The following theorem is very important, albeit rather obvious. I supply the proof at the end of this article.

Theorem: The collection of all equivalence classes under an equivalence relation $R$ partitions $A$ into disjoint subsets.

Note: A partition of a set $A$ is collection of disjoint subsets $B_i$, such that $\cup_i B_i = A$.

Finding duplicates

Now let's look at how equivalences are used to find duplicates in a music library. As I mentioned in the introduction, the same title does not necessarily identify two songs as duplicates, especially if the title is "Track 1". I have noticed that if both the title and the size of the track match, then they are almost certainly duplicates.

So I can identify duplicates by creating an equivalence relation on the set of my music files. I define two mp3s to be equivalent if the have the same title and the same size. Then I partition the set of songs into subsets using this equivalence relation. Any subset that has more than one element consists of duplicate songs.

So if my song class is this (Kotlin):

Then I can use this code to find duplicates

The succinctness of Kotlin perhaps makes it difficult to follow how the task is accomplished. Here is the same code in Java:

I represent the equivalence class by an an object identified by title and size. The data class in Kotlin gives me equals() and hashCode() for free, but in Java I need to explicitly define them to use this object as a key to a map.

In the following code, I construct a map where the key is the equivalence class and the value is the list of songs that belong to that equivalent relation, which I build as I iterate through the list of songs.Finally, I retain only the equivalence classes that contain more than one element. These are the ones that contain duplicates.

Admittedly, in this case I got lucky in terms of performance. Because I am able to identify each equivalence class by looking up a key in a map, which is an $O(1)$ operation, the complexity of the code is $O(n)$. In other situations, looking up the equivalence class may not be as straightforward. For example, when I am looking for duplicate contacts in an address book, two contacts will be identical if they share one of many email addresses or if they share one of many phone numbers. It is not possible to determine this relation by a single lookup, so I would have to become more creative, since a brute force approach will result in $O(n^2)$ complexity. I will leave this harder problem for another blog post.

Proof of the Theorem

Let $A$ be a set and $R$ an equivalence relation on $A$. If $[x]$ and $[y]$ are two equivalence classes of $R$, I must show that either they are disjoint, or that $[x] = [y]$.

If $[x]$ and $[y]$ are not disjoint, then they have a common element. Let's call it $z$.

To show that $[x] = [y]$, I must show that $[x] \subset [y]$ and $[y] \subset [x]$.

To show that $[x] \subset [y]$, it suffices to pick any member of $[x]$ and show that is also a member of $[y]$.

Pick any $x' \in [x]$. Since $z$ is in the equivalence class of $x$, we must have $z \equiv x'$, and using the symmetry property of equivalences, this mean that $x' \equiv z$. Similarly, since $z$ is in $[y]$, we have $z \equiv y$. Using the transitive property of the equivalence, $x' \equiv z$ and $z \equiv y$, implies $x' \equiv y$, which means that $x'$ is in the equivalence class $[y]$.

I have shown that $[x] \subset [y]$. A similar argument, reversing the roles of $[x]$ and $[y]$, shows that $[y] \subset [x]$. Therefore, $[x] = [y]$.

Elegant Image Resizing with Graph Theory

We often need to resize images to upload them, share them, or print them. However, occasionally we need to reduce the image size only in one direction. For example, if I want to post a panoramic picture in my blog, like the one below, the disproportionate length makes the details hard to see.

Standard image manipulation tools do not offer many choices. I can crop, thus losing detail on the edges or resize in one direction making everything look distorted.

Ideally, I'd like to have a smaller picture where the essential details are not distorted, as in the one below:

In this post, I will present a brilliant, yet relatively low tech algorithm to achieve this result, called Seam Carving. The original paper describing the algorithm by Avidan and Shamir is here.

If you are interested in the implementation details, check out the Java source code I used to create these examples in my GitHub repository.

Representation of Digital Images

A digital image is represented by a matrix of pixels. If the image is in grayscale, each pixel has a value between 0 and 255 to represent its level of grayness with 0 being black and 255 white. For color images, each pixel has three values, representing the intensities of red, green, and blue.

Consequently, we can consider an image as a function in two dimensions that assigns to each point (x,y) a color intensity f(x,y). The figure below shows this functional representation of a simple grayscale image.

Notice that the curvature of the 3D contour is nearly flat for areas of the image with similar color, but it is steep when there are significant color variations. Therefore if we compute the derivative - or gradient - of the image at every point, points appearing in uniformly colored areas, such as the sky or ocean, will have small gradient values. In contrast, points at the boundaries of different elements will have large gradients. We can use this information to tell us which pixels we can remove without affecting the image very much.

Images as Graphs

The idea behind Seam Carving is to remove vertical lines from the image until it reaches the desired width (or horizontal lines until it reaches the desired height). The lines need not be straight. In fact, we are looking for a jagged line going from top to bottom such that removing it will cause the minimum alteration to the image. We already know that the gradient of the image will give us information about the least valuable points to remove. Now we need a way to figure out which, among all the possible jagged lines, is the best. In order to do this, we use concepts from graphs and shortest paths.

A finite graph consists a set of points called vertices and a set of edges connecting some of these points. For our purposes, we can view an image as a graph where each point is connected to the immediate neighbors right below it. So each interior point has 3 neighbors, whereas boundary points have between 0 and 3 neighbors.

A path in a graph is a sequence of vertices, such that each vertex in the sequence is connected by an edge to the vertex preceding it and the vertex following it. Remember that each vertex on the image graph has a value defined by the magnitude of the gradient. Our goal is to find a path starting from the top of the image to the bottom such that the total value of the path vertices is minimal. In Graph theory this is called finding the Shortest Path.

Dynamic Programming

There are many algorithms for computing shortest paths in graphs and a lot of them are derived from ideas from Dynamic Programming (DP). Dynamic Programming is a generic optimization framework for solving a vast family of problems often involving systems that evolve over time and are subject to randomness. In the case of shortest paths in graphs, there is no time evolution or randomness, so we get a very simple version of DP. The idea of the algorithm is very straightforward. I will present it here after introducing a bit of notation:

Let $v_{ij}$ be the vertex in the image graph at row $i$ and column $j$. Let $N(v_{ij})$ be the set of neighbors of a vertex $v_{ij}$. Let $g(v_{ij})$ denote the magnitude of the gradient for that vertex. We introduce a terminal vertex $v_T$ such that all the vertices in the last row are connected to it. This vertex has gradiend 0. We want to find shortest paths from all the vertices in the top of the graph to this terminal vertex $v_T$. Finally, let $d(v_{ij})$ denote the shortest path between vertex $v_{ij}$ and $v_T$.

We can now set up a recursive computation for the shortest path from any vertex. Let $n$ be the height of the graph. For all vertices in the last row $n$, we have

$$
d(v_{nj}) = g(v_{nj})
$$

For any arbitrary row $i$, we have:

$$
d(v_{ij}) = g(v_{ij}) + \min\limits_{k \in N(v_{ij})} \{d(v_{kj+1})\}
$$

Using this recursive relationship we can compute the shortest path row by row, starting at the bottom and moving all the way to the first row. Once we have all the shortest paths from every pixel in the top row, we just choose the shortest one.

Putting it all together: The Seam Carving Algorithm

The key to seam carving is finding low gradient paths through the picture and successively removing them:

Step 1: Compute the gradient at each point of the image

Step 2: Compute shortest paths using Dynamic Programming

Step 3: Remove the shortest path (the seam)

Repeat Steps 1 to 3 until the image has reached the desired dimensions.

I will demonstrate that algorithm in the following picture. Its initial dimensions are 350x263 and I want to reduce the width by 80 pixels and the height by 13.

Here is what the image would look like if I simply resized by changing the aspect ratio.

Here is what the picture looks like when I use Seam Carving. Note that the apples remain proportionate.

Here is a visual representation of the gradient, or energy, of the initial picture:

We can also visualize the seams that the algorithm removed by adding them back in the final picture one by one:

Further Applications of Seam Carving

A variation of the algorithm can be used to expand an image instead of shrinking it. We can also coerce the algorithm's behavior by providing user input to the image. For example, we can mark portions of the image that we want to keep intact and portions that we want completely removed. In the energy representation of the image, the regions to keep will appear as high energy and the areas to remove will appear as low energy.

I will probably explore these modification in a future article. You are also welcome to clone my GitHub repo and play around with your own modifications!

K-Means Clustering for Image Compression

K-Means Clustering Overview

A long time ago I attended a Coursera course on Machine Learning that presented an interesting application of clustering to image compression. In this post I will present the K-means clustering algorithm and demonstrate how it can be used to color-compress an image.

K-means clustering is an algorithm for unsupervised learning that aims to identify clusters of similar objects in a data set. A clustering of objects into K clusters means simply to partition your data set into K subsets.The algorithms assumes, like may others, that the objects in the data set can be directly mapped into a Vector Space equipped with a norm. The objective of the algorithms is to find a clustering such that objects belonging to the same cluster are close together and objects belonging to different clusters are distant.
Mathematically, given a partition of the data set $S$ into K disjoint subsets $S_1, S_2, \cdots, S_K$, the total intra-cluster distance is given by

$$
\sum_{k=1}^K\sum_{i,j \in S_k} d(x_i,x_j)
$$

where $d(x_i,x_j) := ||x_i-x_j||^2$ is the Euclidean distance between the two points. The objective of K-means clustering is to find the choice of clusters that minimizes the intra-cluster distance. Finding an exact solution to the optimization problem becomes computationally infeasible for large values of samples, so K-means clustering proceeds by making an initial guess and then iteratively improving this guess. The details of the algorithms are as follows:

Initialization: Randomly select ${\mu_1, \cdots,\mu_K}$ points from the data set of $n$ points and make them the cluster centers.

Repeat:

1. Cluster Assignment Step: For each point $x$ in the dataset, assign it to the cluster $k \in {1,\cdots,K}$ then minimizes the distance $d(x, \mu_k)$.

2. Recompute each cluster center by setting it to the mean of all points assigned to the cluster: $\mu'_k := \sum_{i \in S_k}x_i/n$

Termination: The algorithm terminates when the centers converge. In other words when the distance between the old and new centers is smaller than some specified threshold: $d(\mu'_k, \mu_k) < \epsilon, \forall k$.

Implementing K-Means Clustering

It is possible to find several implementation of K-means clustering in Java, for example, in open source libraries such as Apache Commons Math, Weka, and Mahout. Invariably what you will find when trying to use these implementations is that you have to make your data conform to each library's API. Here, I provide a simple implementation that operates on data in the form of double arrays. The source code available on github.

Application to Image Compression
K-means clustering can be used in a few creative ways to compress image content. In this particular case compression is achieved by using only a subset of all available colors. The algorithm proceeds as follows:

Reshape the image into a list of pixels of size width*height. Each pixel is a 3D vector corresponding to the RGB value of the image.
Run K-Means clustering with 16 clusters. The resulting centers correspond to the RGB values of the selected colors
Create an look-up table with the resulting colors.
Create an image where each pixel is represented by its index into the look-up table of step 3. This means that each pixel can be represented with only 4 bits instead of the 24 required for the original RGB.

Original Picture occupies 983 KB

Compressed image with 16 colors occupies 75 KB

Software Intelligence

Mathematics in the Service of Software

Saturday, January 18, 2020

Equivalence Relations and Deduping your Music Library

Motivation

The Equivalence Relation

Finding duplicates

Proof of the Theorem

Wednesday, July 26, 2017

Elegant Image Resizing with Graph Theory

Representation of Digital Images

Images as Graphs

Dynamic Programming

Putting it all together: The Seam Carving Algorithm

Further Applications of Seam Carving

Wednesday, June 17, 2015

K-Means Clustering for Image Compression

Popular Posts

Recent Posts

Followers

Author

Categories

Blog Archive