roam/daily/2022-03-29.org
2022-04-23 00:39:06 -04:00

7.1 KiB

2022-03-29

Get list of subscribers with non-normalized tags on active accounts

Investigating the impact of the new Tag Normalization rules on existing subscribers on active accounts.

Gathering data

I imported a dump of the subscriber_tags table from AppDB as well as the list.subscribers table data for all active accounts (~SELECT s.* FROM list.subscribers s JOIN accounts a ON (a.a_id = s.account_id) WHERE a.status_id < 7~)

I then built a table of subscribers having tags that do not match our validation rules.

  CREATE TABLE invalid_tags AS
  SELECT s.list_id, s.account_id, t.subscriber_id, tag
  FROM subscribers s
  JOIN subscriber_tags as t ON (s.id = t.subscriber_id)
     , unnest(tags) tag
  WHERE tag != normalize_tag(tag)

Active accounts

  SELECT COUNT(DISTINCT account_id) FROM subscribers
count
103,357

Subscribers on active accounts

  SELECT COUNT(id) FROM subscribers
count
259,745,858

Subscribers with invalid tags

  SELECT COUNT(DISTINCT subscriber_id) FROM invalid_tags
count
1,331,220
"259,745,858"

/correlr/roam/media/commit/26fa2f81176946116dc94dd448704178d388fbb0/daily/2022-03-29-subscribers-with-invalid-tags.png

Accounts with subscribers with invalid tags

  SELECT COUNT(DISTINCT account_id) FROM invalid_tags;
count
3,220

/correlr/roam/media/commit/26fa2f81176946116dc94dd448704178d388fbb0/daily/2022-03-29-accounts-with-invalid-tags.png

Normalized tag breakdown

  SELECT 'Non-printable characters' AS "Rule"
       , COUNT(DISTINCT account_id) AS "Accounts"
       , COUNT(subscriber_id) AS "Subscribers"
  FROM invalid_tags
  WHERE tag ~ '[^[:print:]]'
  UNION SELECT 'Commas' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE tag ~ ','
  UNION SELECT 'ASCII quotation marks' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE tag ~ '[''""]'
  UNION SELECT 'Unicode quotation marks' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE tag ~ '[‘’“”]'
  UNION SELECT 'Leading or trailing whitespace' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE TRIM(tag) != tag
  UNION SELECT 'Repeated whitespace' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE TRIM(tag) ~ '[:space:]{2,}'
  UNION SELECT 'Upper-case characters' AS "Rule"
             , COUNT(DISTINCT account_id) AS "Accounts"
             , COUNT(subscriber_id) AS "Subscribers"
        FROM invalid_tags
        WHERE LOWER(tag) != tag
Rule Accounts Subscribers
Leading or trailing whitespace 119 66,788
Repeated whitespace 2,404 1,234,651
Unicode quotation marks 126 21,343
Commas 378 54,567
ASCII quotation marks 2,507 1,544,607
Upper-case characters 0 0
Non-printable characters 58 1,749

/correlr/roam/media/commit/26fa2f81176946116dc94dd448704178d388fbb0/daily/2022-03-29-invalid-tag-breakdown.png