Tuesday, March 29, 2016

Calculating cosine similarity between each pair of vectors using SQL




Today we are going to show how to calculate the cosine similarity between two vectors using SQL.

The cosine similarity between two vectors is given by this formula:





where A and B are the respective vectors.


In order to calculate this formula, first we need to efficiently store the vectors into a relational database.

Thus, we need to store the value of dimensions for each vector. Our only entity is the vector with its properties, the dimensions. Thus we need to store this vector efficiently. The database schema should allow us to efficiently calculate the cosine distance between each pair of vectors. Many DBMS don't have an array data type in their official release. PostgreSQL has the array data type, but it does not have complex functionalities over this data type e.g. PRODUCT. Using the predefined functions: SUM, PRODUCT and SQRT, we can efficiently calculate the cosine similarities between the words, because these functions are optimized to handle operations on grouped rows. Thus our schema should be designed in a way that the queries can make use of these functions.

The model consists of these tables:

  • The vectors(vid, did, dimension) table. In this table we store the information about the dimensions of the vectors. For each vector we store the information regarding each dimension.
  • The distances(vid1,vid2, cosine) table. This is an optional table and stores all the information regarding the distance between the vectors. It is better to process the information and store it in this table, because the query used to calculate these distances requires a lot of time and resources.

The query used in the process of calculation of the cosine similarities between the given vectors is:


select v1.vid as vid1,v2.vid as vid2,SUM(v1.dimension*v2.dimension)/(SQRT(SUM(v1.dimension*v1.dimension))*SQRT(SUM(v2.dimension*v2.dimension))) as cosine
from vectors v1 inner join vectors v2 on v1.did=v2.did and v1.vid<>v2.vid
group by v1.vid,v2.vid



As we can see this query produces duplicates e.g. it calculates the cosine similarity between the vector with id 1 and the vector with id 2 and again it calculates the cosine similarity between the vector with id 2 and the vector with id 1.


To eliminate this problem we add a new WHERE condition into the query:


select v1.vid as vid1,v2.vid as vid2,SUM(v1.dimension*v2.dimension)/(SQRT(SUM(v1.dimension*v1.dimension))*SQRT(SUM(v2.dimension*v2.dimension))) as cosine
from vectors v1 inner join vectors v2 on v1.did=v2.did and v1.vid<>v2.vid
where v1.vid<v2.vid
group by v1.vid,v2.vid



2 comments:

  1. Hey, thanks for the tutorial, it's really helpful. Do you mind explaining what 'did' is in the vector schema? Thank you

    ReplyDelete
    Replies
    1. `did` is the dimension id: a unique number assigned to each dimension. Within the vector table this is a weak key i.e. it is not unique, but it is unique within the same vid (vector id).

      For example assume each vector has 2 dimensions `length`, `width` & `height` with did's 0, 1 and 2 respectively. Then in your vector table if you have 2 vectors with vid's 0 and 1 then each one would have entries for each of the 3 dimensions.

      So the rows of the vector table would be:
      [0,0,'length']
      [0,1,'width' ]
      [0,2,'height']
      [1,0,'length']
      [1,1,'width' ]
      [1,2,'height']

      Delete