Split the custom metrics JSON into structured fields #262

rviscomi · 2024-04-09T14:30:09Z

The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 4.35 TB and takes about 4 minutes.

The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.

An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.

So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.

As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS '''
  const topLevelMetrics = new Set([
    'performance'
  ]);
  try {
    custom_metrics = JSON.parse(custom_metrics);
  } catch {
    return {};
  }

  if (!custom_metrics) {
    return {};
  }

  const performance = JSON.stringify(custom_metrics.performance);
  delete custom_metrics.performance;

  const other = JSON.stringify(custom_metrics);

  return {
    performance,
    other
  }
''';

CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
AS SELECT
  * EXCEPT (custom_metrics),
  GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-03-01'

Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.

Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:

WITH lcp_stats AS (
  SELECT
    client,
    isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
    hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
  FROM
    `httparchive.scratchspace.custom_metrics_struct`
  WHERE
    date = '2024-03-01' AND
    is_root_page AND
    JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') = 'IMG'
)

This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.

The text was updated successfully, but these errors were encountered:

rviscomi · 2024-04-09T14:36:05Z

I'd consider this and #189 (add rank to the requests table) and #149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split the custom metrics JSON into structured fields #262

Split the custom metrics JSON into structured fields #262

rviscomi commented Apr 9, 2024

rviscomi commented Apr 9, 2024 •

edited

Split the custom metrics JSON into structured fields #262

Split the custom metrics JSON into structured fields #262

Comments

rviscomi commented Apr 9, 2024

rviscomi commented Apr 9, 2024 • edited

rviscomi commented Apr 9, 2024 •

edited