Ben Gorman

Ben Gorman

Life's a garden. Dig it.

You operate an online business called tindercoach.com where you give people advice on their Tinder profiles ❤️‍🔥. You have a dictionary of visits indicating how many times each visitor_id visited each page on your site.

import random
import string
from collections import defaultdict
 
Npages = 10
Nvisitors = 10
Nvisits = 100
 
random.seed(2357)
visitor_ids = list(set(random.randint(1000, 9999) for i in range(Nvisitors)))
pages = list(set('tindercoach.com/' + ''.join(random.choices(string.ascii_lowercase, k=3)) for i in range(Nvisits)))
 
visits = defaultdict(int)
for i in range(Nvisits):
    key = (random.choice(visitor_ids), random.choice(pages))
    visits[key] += 1
 
print(visits)
# defaultdict(<class 'int'>, {
#  (3654, 'tindercoach.com/bgr'): 1, 
#  (1443, 'tindercoach.com/nky'): 1, 
#  (3654, 'tindercoach.com/wpb'): 1, 
#  ..., 
#  (3181, 'tindercoach.com/jam'): 1, 
#  (5502, 'tindercoach.com/cjp'): 1, 
#  (5502, 'tindercoach.com/tjk'): 1
# })

Convert visits into a Compressed Sparse Column (CSC) matrix where element (i,j) stores the number of times visitor i visited page j.

Then print the sub-matrix showing how many times visitors 1443, 6584, and 7040 visited pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.


Solution

import numpy as np
from scipy import sparse
 
# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
  row_keys[val] = idx
 
# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
  col_keys[val] = idx
 
# determine the row & col index for each data point
row_idxs = []
col_idxs = []
for key in visits.keys():
    visitor_id = key[0]
    page = key[1]
    row_idxs.append(row_keys[visitor_id])
    col_idxs.append(col_keys[page])
 
# Build the csc matrix
data = list(visits.values())
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
 
# print subset
print_visitor_ids = [3654, 1443, 3654]
print_pages = ['tindercoach.com/bgr', 'tindercoach.com/nky', 'tindercoach.com/wpb']
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
print(mat[np.ix_(print_row_idxs, print_col_idxs)].todense())
# [[0 1 0]
# [2 0 1]
# [1 0 0]]

Explanation

Our strategy is to build three lists:

  • data: contains the non-zero elements of the matrix
  • row_idxs: contains the row index of each non-zero element
  • col_idxs: contains the column index of each non-zero element

Then we can instantiate a CSC matrix with sparse.csc_matrix((data, (row_idxs, col_idxs))).

  1. Build mappings for visitor_id:row_index and page:col_index.

    # build visitor_id:row_index mapping
    row_keys = {}
    for idx, val in enumerate(visitor_ids):
      row_keys[val] = idx
     
    print(row_keys)
    # {
    #  7040: 0, 
    #  4545: 1, 
    #  ..., 
    #  6584: 8, 
    #  5502: 9
    # }
     
    # build page:col_index mapping
    col_keys = {}
    for idx, val in enumerate(pages):
      col_keys[val] = idx
     
    print(col_keys)
    # {
    #  'tindercoach.com/gqy': 0, 
    #  'tindercoach.com/yez': 1, 
    #  ..., 
    #  'tindercoach.com/eaw': 98, 
    #  'tindercoach.com/eow': 99
    # }

    These mappings allow us to get the row / column index for any visitor_id / page.

  2. Determine the row and column index for each data point.

    row_idxs = []
    col_idxs = []
    for key in visits.keys():
        visitor_id = key[0]                   
        page = key[1]                         
        row_idxs.append(row_keys[visitor_id]) 
        col_idxs.append(col_keys[page])       
  3. Build the csc_matrix.

    data = list(visits.values())  
    mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
  4. Print the sub matrix showing visitors 1443, 6584, and 7040 and pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.

    We can fetch the appropriate row and column indices as follows:

    print_visitor_ids = [1443, 6584, 7040]
    print_pages = ['tindercoach.com/chl', 'tindercoach.com/nky', 'tindercoach.com/zmr']
     
    print_row_idxs = [row_keys[x] for x in print_visitor_ids]
    print_col_idxs = [col_keys[x] for x in print_pages]
     
    print(print_row_idxs)
    # [2, 8, 0]
     
    print(print_col_idxs)
    # [38, 17, 86]

    To fetch the sub matrix indexed by these row and column indices, we can do

    submat = mat[np.ix_(print_row_idxs, print_col_idxs)]
     
    print(submat.todense())
    # [[0 1 0]
    #  [2 0 1]
    #  [1 0 0]]

    If we simply did mat[print_row_idxs, print_col_idxs], scipy would fetch three elements from the matrix; the elements at positions: (2, 38), (8, 17), and (0, 86). This is consistent with NumPy array indexing behavior, but it's not the behavior we desire.

    Rather, we want to fetch all combinations of (row, col) indices from our lists print_row_idxs and print_col_idxs. (9 combinations in total, forming a 3x3 sub matrix.)

    We use np.ix_() to accomplish this. np.ix_(print_row_idxs, print_col_idxs) constructs an open mesh from the input lists / arrays.

    np.ix_(print_row_idxs, print_col_idxs)
    # (array([[2],
    #        [8],
    #        [0]]), array([[38, 17, 86]]))

    When these arrays are used to index mat, they are essentially broadcasted into the 3x3 i and j index arrays.

    i index array (after broadcasting)
    [[2 2 2]
      [8 8 8]
      [0 0 0]]
     
    j index array (after broadcasting)
    [[38 17 86]
      [38 17 86]
      [38 17 86]]