Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split the custom metrics JSON into structured fields #262

Open
rviscomi opened this issue Apr 9, 2024 · 1 comment
Open

Split the custom metrics JSON into structured fields #262

rviscomi opened this issue Apr 9, 2024 · 1 comment

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 9, 2024

The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 4.35 TB and takes about 4 minutes.

The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.

An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.

So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.


As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS '''
  const topLevelMetrics = new Set([
    'performance'
  ]);
  try {
    custom_metrics = JSON.parse(custom_metrics);
  } catch {
    return {};
  }

  if (!custom_metrics) {
    return {};
  }

  const performance = JSON.stringify(custom_metrics.performance);
  delete custom_metrics.performance;

  const other = JSON.stringify(custom_metrics);

  return {
    performance,
    other
  }
''';

CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
AS SELECT
  * EXCEPT (custom_metrics),
  GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-03-01'

Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.

Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:

WITH lcp_stats AS (
  SELECT
    client,
    isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
    hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
  FROM
    `httparchive.scratchspace.custom_metrics_struct`
  WHERE
    date = '2024-03-01' AND
    is_root_page AND
    JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') = 'IMG'
)

This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.

@rviscomi
Copy link
Member Author

rviscomi commented Apr 9, 2024

I'd consider this and #189 (add rank to the requests table) and #149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant