Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a CrUX origin search tool #3

Open
rviscomi opened this issue May 23, 2020 · 2 comments
Open

Build a CrUX origin search tool #3

rviscomi opened this issue May 23, 2020 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@rviscomi
Copy link
Member

rviscomi commented May 23, 2020

As of now, there are 18,352,960 distinct origins in the CrUX BigQuery dataset since October 2017. That's a lot of websites, but clearly not all of the websites out in the wild. One of the common problems I see from CrUX users is that they're not sure if their websites' origins exist in the dataset.

I'd like to design a tool to help CrUX users quickly and easily discover origins in the dataset and make it clear when a particular origin is not found. Here's how I envision the UX working:

  • lightweight web app
  • prominent text input field
  • autosuggest known origins as you type
  • selecting a known origin will give you options for viewing its CrUX data
    • deep-linked custom CrUX Dashboard for that origin
    • deep-linked PageSpeed Insights report
    • customized BigQuery SQL snippet
  • if an origin is not found, offer boilerplate suggestions
    • ensure it is the canonical origin users actually visit (eg https not http, or www)
    • ensure it is publicly accessible (suggest Lighthouse SEO indexability audits)
    • ensure the origin receives a healthy amount of traffic (exact thresholds can't be given)
    • use first party RUM to more closely observe UX in the field
  • explain what an "origin" is (protocol + subdomain + domain, no path)

I imagine the backend will work by using a fast, in-memory storage solution for the ~18M origins. In total the size of the data is ~500+MB. However, if we build more advanced/faster search functionality (eg n-grams), it might require more storage space. An autosuggest endpoint will take the user's input, scan the origin list, and return matches via JSON. The list of origins can be populated monthly by mirroring the chrome-ux-report:materialized.origin_summary table on BigQuery.

Finding matches is the magic part. If a user types google it should return origins whose domain name (eTLD+1) starts with google, like https://www.google.com or https://mail.google.com or https://www.google.co.jp. It should also return matches whose host names (eTLD+2) are prefixed by the query, for example mail should return https://mail.google.com or https://mail.yahoo.com. Searches starting with the protocol (http[s]://) should only match origins prefixed with that input, like https://example.com matching a search for https://ex. I think this can be simplified to a regular expression where the user input is preceded by a boundary character \b, but the backend might need to tokenize origins into host names and domain names instead for performance. My goal is for the median autosuggestion to complete buttery smooth in under 100ms from the user's keyup to suggestion rendered.

For demonstration purposes, a really naive implementation would be for the backend to query the BigQuery origin_summary table directly:

SELECT
  origin
FROM
  `chrome-ux-report.materialized.origin_summary`
WHERE
  REGEXP_CONTAINS(origin, r'\b@input')
LIMIT
  20

This query processes 505 MB in 4.7 seconds, obviously not fast enough for a production-ready solution, but just showing a simple approach.

Any technology recommendations for the backend of the app?

@rviscomi rviscomi changed the title Built a CrUX origin search tool Build a CrUX origin search tool May 23, 2020
@rviscomi rviscomi added the help wanted Extra attention is needed label May 23, 2020
@rviscomi
Copy link
Member Author

Some tips for a good autocomplete UX: https://blog.algolia.com/search-autocomplete-on-mobile/

@peter-up
Copy link

peter-up commented Nov 7, 2023

Two solution from my mind:

  1. Relational database(SQL), cheaper solution, query would in 100ms
    The table design:
Column name Index type
id primary key
prefix index
origin -

To insert a origin like https://mail.google.com, you should insert below records:

prefix origin
mail.google.com https://mail.google.com
google.com https://mail.google.com
https://mail.google.com https://mail.google.com

To search xxx:
select * from table where prefix >="xxx" and prefix < ""xxx{"

  1. In-memory database(Redis), better performance
    Set two sorted set.
    key1, "https://mail.google.com"(member), "mail.google.com"(score)
    key2, "mail.google.com"(member), "mail.google.com"(score)
    key3, "google.com"(member), "mail.google.com"(score)

Query: xxx search above 3 keys one by one using zrangebyscore "keyx" "xxx" "xxx{"
https://redis.io/commands/zrangebyscore/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants