Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[uplink] Download failing for JavaSystemProperties with error: missing chunk number 0 for toast value xx in pg_toast_xxx #3883

Open
dduportal opened this issue Jan 2, 2024 · 18 comments
Assignees
Labels

Comments

@dduportal
Copy link
Contributor

dduportal commented Jan 2, 2024

Service(s)

Uplink

Summary

As reported by @daniel-beck in IM discussion (ref. https://matrix.to/#/!JLUOInpEYmxJIYXlzs:matrix.org/$VlAjMDeZOPjLhMwkevPtM7P1zklTyH_wvJ9eXfpkHaE?via=g4v.dev&via=gitter.im&via=matrix.org):

Hi folks, I am unable to download the JavaSystemProperties telemetry trial data from uplink.jenkins.io, downloads abort. Other trials' data can be downloaded. Do we have data(base) problems? Anything interesting being logged?
(...)
Specifically, any date range including Dec 24 fails. Earlier and later work.
(...)
Correction, 2023-12-31 also fails.

Reproduction steps

  • With an authorized GitHub account, log-in to uplink.jenkins.io
  • Open the logs in "follow" mode on publick8s (kubectl -n uplink logs -l 'app.kubernetes.io/name=uplink' -f)
  • Select the jenkins.telemetry.impl.JavaSystemProperties type and the 2023-12-24 to 2023-12-25 time range and click "Export"
@dduportal dduportal added the triage Incoming issues that need review label Jan 2, 2024
@dduportal dduportal self-assigned this Jan 2, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Jan 2, 2024
@dduportal dduportal added this to the infra-team-sync-2024-01-02 milestone Jan 2, 2024
@dduportal
Copy link
Contributor Author

Reproduced the error:

  • The download stop on client side with an error
  • The logs shows the following error message:
error: missing chunk number 0 for toast value 21681973 in pg_toast_17895
    at Connection.parseE (/srv/uplink/node_modules/pg/lib/connection.js:553:11)
    at Connection.parseMessage (/srv/uplink/node_modules/pg/lib/connection.js:378:19)
    at Socket.<anonymous> (/srv/uplink/node_modules/pg/lib/connection.js:119:22)
    at Socket.emit (events.js:180:13)
    at Socket.emit (domain.js:422:20)
    at addChunk (_stream_readable.js:274:12)
    at readableAddChunk (_stream_readable.js:261:11)
    at Socket.Readable.push (_stream_readable.js:218:10)
    at TCP.onread (net.js:581:20)

@dduportal
Copy link
Contributor Author

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

@dduportal
Copy link
Contributor Author

dduportal commented Jan 8, 2024

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

Update:

  • REINDEX table events; finished ith success (step 1/3) after ~6 hours
  • Started the pg_toast re-index now (step 2/3):
> select reltoastrelid::regclass from pg_class where relname = 'events';
      reltoastrelid      
-------------------------
 pg_toast.pg_toast_17895
> REINDEX table pg_toast.pg_toast_17895;

TODO:

  • Run an anlayse of possible cleanups on the table events with VACUUM analyze events; (step 3/3)
  • Check if it works otherwise run the same steps on the 4 other tables

@dduportal
Copy link
Contributor Author

Update:

  • Started the VACUUM analyze events; step (in progress)

@dduportal
Copy link
Contributor Author

Update:

@dduportal dduportal modified the milestones: infra-team-sync-2024-01-09, infra-team-sync-2024-01-16 Jan 10, 2024
@dduportal
Copy link
Contributor Author

Update: still trying to find the corrupted table segment

@dduportal
Copy link
Contributor Author

dduportal commented Jan 11, 2024

Update:

> select reltoastrelid::regclass from pg_class where relname = 'events';
reltoastrelid      
-------------------------
pg_toast.pg_toast_17895

> select count(*) from events;
  count
----------
 34673551
(1 row)

# Some retries
> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

=> now to use dichotomy to locate the failing segment

@dduportal
Copy link
Contributor Author

Update: dichotomy

> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2500000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1250000 offset 30000000; # 30000000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 625000 offset 30000000; # 30000000 -> 30625000
(no error)


> select * from events order by id limit 625000 offset 30625000; # 30625000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 312500 offset 30625000; # 30625000 -> 30937500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

@smerle33 smerle33 modified the milestones: infra-team-sync-2024-01-16, infra-team-sync-2024-01-23 Jan 16, 2024
@dduportal
Copy link
Contributor Author

Update:

> select * from events order by id limit 156250 offset 30625000; # 30625000 -> 30781250
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 35000 offset 30625000; # 30625000 -> 30660000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10000 offset 30625000; # 30625000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5000 offset 30625000; # 30625000 -> 30630000
No error

Almost there!

@dduportal
Copy link
Contributor Author

dduportal commented Jan 18, 2024

Update:

> select * from events order by id limit 5000 offset 30630000; # 30630000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895


> select * from events order by id limit 2500 offset 30630000; # 30630000 -> 30632500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1000 offset 30630000; # 30630000 -> 30631000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 500 offset 30630000; # 30630000 -> 30630500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 100 offset 30630000; # 30630000 -> 30630100
No error

> select * from events order by id limit 100 offset 30630100; # 30630100 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
# In progress

@dduportal
Copy link
Contributor Author

Finally!

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
No error

> select * from events order by id limit 50 offset 30630150; # 30630150 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 25 offset 30630150; # 30630150 -> 30630175
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10 offset 30630150; # 30630150 -> 30630160
No error

> select * from events order by id limit 10 offset 30630160; # 30630160 -> 30630170
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5 offset 30630160; # 30630160 -> 30630165
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

@dduportal
Copy link
Contributor Author

Update:

  • Removed the line detected as corrupted:
> select id from events order by id limit 1 offset 30630160;
id
----------
 30648884
(1 row)

> delete from events where id = 30648884;
DELETE 1
uplink=> select id from events order by id limit 1 offset 30630160;
    id
----------
 30648885
(1 row)
  • The download goes further until 3Mb (I was blocked around 1.4Mb) but still errors. Let's check for other corrupted entries

@dduportal
Copy link
Contributor Author

Here we go again, let's fin the other corrupted records (same screen session, same VM):

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
# WiP

@dduportal dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024
@dduportal
Copy link
Contributor Author

Deleted a few more (corrupted) records and ran a VACUUM:

> delete from events where id = 30648885;
DELETE 1
> delete from events where id = 30648886;
DELETE 1
> delete from events where id = 30648887;
DELETE 1
> delete from events where id = 30648888;
DELETE 1

> VACUUM analyze events;

=> download goes further but still errors. Had to reboot the machine (and loose the screen session) so the research of other corrupted records will have to wait after FOSDEM.

@daniel-beck
Copy link

FWIW this particular data collection is now (mostly) finished. It was how I got file.encoding stats for the advisory. The other system properties might be interesting but not an immediate concern.

Since it's not just one record it might be interesting to find out what's happening here though, to prevent it from occurring in the future.

@smerle33 smerle33 modified the milestones: infra-team-sync-2024-01-30, infra-team-sync-2024-02-13 Jan 30, 2024
@dduportal
Copy link
Contributor Author

Reseach for corrupted records still running

@dduportal
Copy link
Contributor Author

dduportal commented Feb 20, 2024

Update:

  • Found another corrupted record and deleted it:

    uplink=> select * from events order by id limit 1 offset 30630160;
    ERROR:  missing chunk number 0 for toast value 19019693 in pg_toast_17895
    uplink=> select id from events order by id limit 1 offset 30630160;
       id
    ----------
    30648910
    (1 row)
    
    uplink=> delete from events where id = 30648910;
    DELETE 1
    
  • However, a new error message appears:

    select * from events order by id limit 1 offset 30630160;
    ERROR:  unexpected chunk number 1 (expected 0) for toast value 19019698 in pg_toast_17895
    
  • Started a new reindex + vacuum yesterday around 18:16: REINDEX table events; REINDEX table pg_toast.pg_toast_17895; VACUUM analyze events; trying to fix this new error (by regenerating the toast index after records deletion). This taks 10 to 14 hours usually.

  • But today we had an incident on uplink.jenkins.io:

@dduportal
Copy link
Contributor Author

Of course:

> REINDEX table pg_toast.pg_toast_17895;
ERROR:  permission denied for schema pg_toast

because Azure Flexible Server does NOT provide super user access.

Gotta check https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-perform-fullvacuum-pg-repack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants