You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 4.35 TB and takes about 4 minutes.
The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.
An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.
So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.
As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:
CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS''' const topLevelMetrics = new Set(['performance' ]); try { custom_metrics = JSON.parse(custom_metrics); } catch { return {}; } if (!custom_metrics) { return {}; } const performance = JSON.stringify(custom_metrics.performance); delete custom_metrics.performance; const other = JSON.stringify(custom_metrics); return { performance, other }''';
CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
ASSELECT* EXCEPT (custom_metrics),
GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM`httparchive.all.pages`WHEREdate='2024-03-01'
Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.
Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:
WITH lcp_stats AS (
SELECT
client,
isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
FROM`httparchive.scratchspace.custom_metrics_struct`WHEREdate='2024-03-01'AND
is_root_page AND
JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') ='IMG'
)
This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.
The text was updated successfully, but these errors were encountered:
I'd consider this and #189 (add rank to the requests table) and #149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.
The
custom_metrics
field of thepages
table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 4.35 TB and takes about 4 minutes.The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.
An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named
other
.So instead of a single
custom_metrics
field of typeSTRING
, there would be acustom_metrics
field of typeSTRUCT
containing named fields corresponding to the core custom metrics.As a proof of concept, here's a query that creates a scratchspace table with the
performance
custom metric extracted into its own field, with everything else in another
field:Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.
Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:
This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.
The text was updated successfully, but these errors were encountered: