Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed gravity update loses whole DB / configuration #5168

Open
Zocker1999NET opened this issue Feb 11, 2023 · 21 comments
Open

Failed gravity update loses whole DB / configuration #5168

Zocker1999NET opened this issue Feb 11, 2023 · 21 comments
Assignees
Labels
Bug: Confirmed Verified as a bug to be resolved

Comments

@Zocker1999NET
Copy link

Zocker1999NET commented Feb 11, 2023

Versions

  • Pi-hole: v5.15.5
  • AdminLTE: v5.18.4
  • FTL: v5.21

Platform

  • OS and version: Armbian Buster (Debian Buster but with custom kernel for Banana Pi)
  • Platform: Banana Pi M2 Berry (comparable to Raspberry Pi 2)

Expected behavior

When the update of the gravity fails due to a full /tmp filesystem, I expect the update to fail.
I expect that Pi-Hole can resume working with the old gravity version.
I also expect that Pi-Hole keeps its configuration (which should be stored in /etc/pihole, which had plenty of free storage left, ca. 1.7 GiB).

Actual behavior / bug

Pi-Hole losts its full configuration (at least groups and configured adlists) and the "blocked domains" counter dropped to 0. Any restart of Pi-Hole did not restore the old database.

Steps to reproduce

Steps to reproduce the behavior:

  1. Add an example group & add a (preferably small) adlist.
  2. Update the gravity successfully (because of small adlist, /tmp was sufficient for now)
  3. Add a whole bunch of adlists (e.g. all from here)
  4. Update the gravity again, in my case with pihole -g (because of huge adlists, the gravity update "hopefully fails")
  5. See that the "blocked domains" counter dropped to 0 and Pi-Hole forgot all adlists (including the working one before) and all groups (including the example one).

Debug Token

Screenshots

all screenshots

Before Failed Update

Mind the working list & already added (but not yet downloaded) new adlists:
image
image
image
image

During Update

Screenshot_20230211_102114

After Failed Update

image
image
image

Additional context

I continuously monitored the free storage of my root and temp filesystem during the gravity update. Only the temp filesystem was fulled, not the root one.

If it wasn't clear enough from above: The bug report is not about that the gravity update failed due to a full /tmp. The bug report is about that Pi-Hole losts / forgot huge parts of its configuration without being able to restore it (automatically).

part of logs during failed gravity update

logs from during failed gravity update
  [i] Neutrino emissions detected...                                                                                                                                                                                                                                                                                          
  [✓] Pulling blocklist source list into range                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                              
  [✓] Preparing new gravity database                                                                                                                                                                                                                                                                                          
  [i] Using libz compression                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://dbl.oisd.nl/                                                                                                                                                                                                                                                                                            
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 889998 domains                                                                                                                                                                                                                                                                                                 
  [i] List has been updated                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts                                                                                                                                                                                                                                                
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 177888 domains, ignoring 1 non-domain entries                                                                                                                                                                                                                                                                  
      Sample of non-domain entries:                                                                                                                                                                                                                                                                                           
        - 0.0.0.0                                                                                                                                                                                                                                                                                                             
  [i] List has been updated                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.github.com/notracking/hosts-blocklists/master/hostnames.txt                                                                                                                                                                                                                                         
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 227412 domains                                                                                                                                                                                                                                                                                                 
                                   
<<< logs continued with success messages like that >>>

  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/pornblock2                                                                                                                                                                                                                                
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 1700000 domains                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/pornblock3                                                                                                                                                                                                                                
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 1700000 domains                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/pornblock4                                                                                                                                                                                                                                
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Imported 1700000 domains                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/pornblock5                                                                                                                                                                                                                                
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
sed: couldn't write 31 items to stdout: No space left on device                                                                                                                                                                                                                                                               
/opt/pihole/gravity.sh: line 538: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 539: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 540: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 540: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 568: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 571: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
/opt/pihole/gravity.sh: line 571: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
  [i] Imported 84842 domains                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                              
  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/pornblock6                                                                                                                                                                                                                                
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device                                                                                                                                                                                                                          
  [✓] Status: Retrieval successful                                                                                                                                                                                                                                                                                            
  [i] Received empty file                                                                                                                                                                                                                                                                                                     
  [✗] List download failed: no cached list available

  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/gambling
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device
  [✓] Status: Retrieval successful
  [i] Received empty file
  [✗] List download failed: no cached list available

  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/child-protection
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device
  [✓] Status: Retrieval successful
  [i] Received empty file
  [✗] List download failed: no cached list available

<<< logs continued with "cannot create temp file" & "no cached list available" >>>

  [i] Target: https://raw.githubusercontent.com/RPiList/specials/master/Blocklisten/Fake-Science
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device
  [✓] Status: Retrieval successful
  [i] Received empty file
  [✗] List download failed: no cached list available

  [i] Target: https://raw.githubusercontent.com/hagezi/dns-blocklists/main/domains/multi.txt
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device
  [✓] Status: Retrieval successful
  [i] Received empty file
  [✗] List download failed: no cached list available

  [i] Target: https://raw.githubusercontent.com/elliotwutingfeng/Inversion-DNSBL-Blocklists/main/Google_hostnames.txt
/opt/pihole/gravity.sh: line 459: cannot create temp file for here-document: No space left on device
  [✓] Status: Retrieval successful
  [i] Received empty file
  [✗] List download failed: no cached list available

  [i] Creating new gravity databases...
  [✗] Unable to copy data from /etc/pihole/gravity.db to /etc/pihole/gravity.db_temp
  /opt/pihole/gravity.sh: line 479: cannot create temp file for here-document: No space left on device
  [✓] Building tree
  [✓] Swapping databases
  [✓] The old database remains available.
  [i] Number of gravity domains: 0 (0 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 0
  [i] Number of exact whitelisted domains: 0
  [i] Number of regex whitelist filters: 0
  [✓] Flushing DNS cache
  [✓] Cleaning up stray matter

  [✓] FTL is listening on port 53
     [✓] UDP (IPv4)
     [✓] TCP (IPv4)
     [✓] UDP (IPv6)
     [✓] TCP (IPv6)

  [✓] Pi-hole blocking is enabled
@dschaper
Copy link
Member

Can you check your /etc/pihole directory and see if there is a gravity_old.db file there?

  [✓] Building tree
  [✓] Swapping databases
  [✓] The old database remains available.

The gravity update process builds a new database with the updated domains and then swaps the existing database with the new database to keep any potential downtime to a minimum.

The return 1 at https://github.com/pi-hole/pi-hole/blob/master/gravity.sh#L484 should have prevented the databases from being swapped, but apparently did not. @DL6ER it looks like this is your code section, I don't know how you'd like to handle cases like this. I wouldn't expect output=$( { pihole-FTL sqlite3 "${gravityTEMPfile}" <<< "${copyGravity}"; } 2>&1 ) to involve /tmp at all?

@DL6ER
Copy link
Member

DL6ER commented Feb 12, 2023

/opt/pihole/gravity.sh: line 479: cannot create temp file for here-document: No space left on device

looks much rather like a bash issue so the SQL commands could not be compiled and passed to SQLite3 in the first place. We need to handle this better in case of a full disk, I thought we'd already trip for any non-handled bash errors, apparently, this special case isn't included. A very quick websearch did not reveal how to handle so - if nobody else finds something - we need to do one of the following:

  1. Restructure the code to not rely on here-docs
  2. Store here-documents in a different location than the downloaded lists (by setting export TMPDIR=/a/b/c) but this won't help on containers where everything is just one single partition
  3. Check between each step above that there are at least X MB of memory free in /tmp

All of these suggestions are somewhat hacky, the appropriate solution would be catching this error and terminating, possibly even cleaning up behind us a bit (delete already downloaded lists) so the disk won't stay filled to the brim possibly causing issues with system logins.

@Zocker1999NET
Copy link
Author

Zocker1999NET commented Feb 12, 2023

Can you check your /etc/pihole directory and see if there is a gravity_old.db file there?

The gravity update process builds a new database with the updated domains and then swaps the existing database with the new database to keep any potential downtime to a minimum.

It was there. At the time I detected Pi-Hole keps a backup, I already had reissued an update with the most important lists to reduce the downtime as well, so I couldn't use it to restore my config from before. However, for my case, that's not a that big loss (1 user Pi-Hole with well-known adlist synchronized from another instance), it was more important for me to issue that bug report so others don't loose their highly configured Pi-Hole instances.

From a developer standpoint (but haven't worked on Pi-Hole before), why is the integrated transaction mechanism of SQLite not used? Looking from the outside, it looks like you're simulating a transaction mechanism as good as possible. (no blame intended, I'm considering there is a valid reason, I'm asking just in case).

@dschaper dschaper added the Bug: Confirmed Verified as a bug to be resolved label Feb 14, 2023
@dschaper
Copy link
Member

why is the integrated transaction mechanism of SQLite not used?

Good question, is this something you might know of @PromoFaux

@dschaper dschaper self-assigned this Feb 14, 2023
@NickDeBeenSAE
Copy link

To fix this:

sudo pihole -r if your using a supported operating system, but if you have multiple Pi-Hole servers, and they don't run a supported operating system, then you can run this:

sudo pihole PIHOLE_SKIP_OS_CHECK=true -r

The above 2 commands in my experience re-installed Pi-Hole just fine.

@dschaper
Copy link
Member

@PromoFaux bumping your notification. I know MM has been eating notifications lately.

@dschaper
Copy link
Member

Store here-documents in a different location than the downloaded lists (by setting export TMPDIR=/a/b/c) but this won't help on containers where everything is just one single partition

Yeah, that sounds sane. Along with something to trap sed: couldn't write 31 items to stdout: No space left on device and then clean-up things afterwards.

@dschaper
Copy link
Member

why is the integrated transaction mechanism of SQLite not used

Can you expand a little bit on that? Is there a difference between that process (I'm not familiar with it) and our current process of creating a new database and then swapping in the new file once things are "ready"? (That may be broken here but it is intended for it to be ready.)

The reason for that is to have as little downtime as possible with the resolver. It takes some time to build the database and doing that live will block the DNS resolving process until everything is good to go.

@dschaper
Copy link
Member

The way we have now swaps the files so there is zero downtime as FTL will simply stick with the "old" file as the file descriptor stays the same even when renaming a file on disk. Only when the new database is in place, we signal FTL to reopen (= use the newer file) and from then on use the new file

@Zocker1999NET
Copy link
Author

Zocker1999NET commented Feb 21, 2023

Can you expand a little bit on that?

Database transactions in a nutshell are a way to be able to use a database normally but only you see the changes you apply (and these are not really stored) until you commit them.
Basically you can just start a transaction, truncate the table, add all entries from the imported lists and them commit it without having any downtime and without needing to do more than START TRANSACTION at the beginning, COMMIT after a successful end and ROLLBACK when any error happens. And if the process crashes without any handling, ROLLBACK is applied automatically, which would have prevented this bug from ever happening.

Also, transactions are nearly the same for all DB systems, so Pi-Hole could begin to support more DB vendors without needing to resort to copying files around.

Please correct me if I'm wrong: As I understand it, Pi-Hole currently creates a new DB file, connects to both the old & new one and transfers all config from the old DB to the new DB. This would also be no longer required, as you wouldn't need to touch any data not required to be changed on a gravity update.

@Zocker1999NET
Copy link
Author

Zocker1999NET commented Feb 21, 2023

Also, a further point: I now probably understand, why my Pi-Hole lost its data. Because SQLite is a single file DB with no persistent daemon like PostgreSQL etc., it has automatic transactions. I expect that following had happened:

  • the SQLite DB was moved to another location called .old or so
  • Pi-Hole created a new DB and opened it for writing, hence starting a new transaction on that empty DB
  • It copied over the config from the old one
  • It downloaded all the lists and failed during that
  • Because the process crashed, it closed the "connection" to the DB and because the loss was not intentional, SQLite rolled back the transaction hence the new DB stayed empty.
  • Pi-Hole was non the less reloaded and loaded the new & empty DB.

Have I missed something or could that be it?
If so, it can have following fixes:

  • use DB transactions intentionally & do not create the DB a new from scratch
    • which is the recommend way when working with DBs
    • should make Pi-Hole more robust against failures during gravity updates in general
    • would make code which copies config over manually obsolete, reducing complexity
  • create the new DB at the other location and move it in place after it was completed
    • not before as of now
    • should have the same behavior in this error case as my recommendation)
  • send a COMMIT SQL command after the config has been copied over
    • maybe the easiest to be implemented right away
    • still the blocklist would be empty after the same kind of error

@DL6ER
Copy link
Member

DL6ER commented Feb 22, 2023

When we first drafted the new database-based gravity, we started with the obvious choice of using a transaction very much in the way you described:

BEGIN TRANSACTION;
DROP INDEX idx_gravity;
DELETE FROM gravity;
INSERT INTO gravity ...;
CREATE INDEX idx_gravity ON gravity (domain, adlist_id);
COMMIT;

and also something like

BEGIN TRANSACTION;
CREATE TABLE gravity_new ...;
INSERT INTO gravity_new ...;
CREATE INDEX idx_gravity_1234´ ON gravity_new (domain, adlist_id);
DROP TABLE IF EXISTS gravity_old;
ALTER TABLE gravity RENAME TO gravity_old;
ALTER TABLE gravity_new RENAME TO gravity;
COMMIT;

Hints:

  1. Externalizing the creation of the index instead of doing it while inserting is much less work = a lot quicker
  2. Renaming a table does not rename the attached indices and renaming an index is not possible (at least not without without low-level modification of sqlite_master and friends)
  3. The string 1234 is a placeholder for a sufficiently long string (we have to specify an index name and uniqueness is enforced of this name). We have to use external scripting to ensure we have a unique name.

Now we come to the problem:

Both approaches above do suffer from locking issues with the database file. I agree that they shouldn't but they still did and resorting to the "two completely independent" files solution we have right now was the only viable solution we found back then. It's been slightly over three years since we changed this so don't ask me for the exact details, but I think it was the (lengthy) creation of the index which lead to (concurrency?) issues with the database getting locked during the index creation. This part of my memory seems to be inline with File Locking And Concurrency In SQLite Version 3 (mind the warning but it is still accurate in our case):

Eventually, the writing process will want to update the database file, either because its memory cache has filled up or because it is ready to commit its changes. Before this happens, the writer must make sure no other process is reading the database and that the rollback journal data is safely on the disk surface so that it can be used to rollback incomplete changes in the event of a power failure. The steps are as follows:

  1. Make sure all rollback journal data has actually been written to the surface of the disk (and is not just being held in the operating system's or disk controllers cache) so that if a power failure occurs the data will still be there after power is restored.
  2. Obtain a PENDING lock and then an EXCLUSIVE lock on the database file. If other processes still have SHARED locks, the writer might have to wait until those SHARED locks clear before it is able to obtain an EXCLUSIVE lock.
  3. Write all page modifications currently held in memory out to the original database disk file.

This is exactly the point where the database becomes unavailable for us. FTL would not know which domains to block (unless this information is cached because such a domain was requested recently). The issue here is that this can take up to minutes thinking about users with millions of domains on their lists but working on a Raspberry Pi Zero. As said, resorting to a two-independent-files solution was the only thing we found working reliably and without any downtime at all even on the low-end hardware.

Of course, we could start experimenting with switching from ROLLBACK to WAL mode and see if the locking situation improves. However, WAL has its own downsides on its own. For one, it is known to be somewhat slower (I don't have any real numbers here) for applications that are read-intense but write only occasionally (which is the case for us). Also, WAL has the downside of utilizing shared-memory, i.e., there is no way the database can be stored transparently on a network mount (e.g., NAS) which we know is done at least by a few users.

Summing this all up, I think the best strategy is to improve the gravity script to handle this situation better. Looking though the script with this problem in mind, I see we handle issues with sqlite3 pretty well, however, we do not handle errors thrown by sed which cannot continue due to a full disk at all. Furthermore, I have not found a way to catch the bash error cannot create temp file for here-document: No space left on device at all making this pretty tough.

A well-enough compromise might be to ensure that we have trice the amount of space of the current gravity.db (if no gravity.db.old exists even four-fold) but at least X MB before even starting the gravity.db run. I know this is a compromise and may not work in any case so periodically rechecking this after the list download steps is a must, too. What do you think @dschaper ?

@DL6ER
Copy link
Member

DL6ER commented Feb 22, 2023

Addendum: I just read the SQlite3 WAL documentation. What made me sceptical is that they do not mention what happens during "checkpointing". It's mostly a magical mystic black box. Diving deeper, we finally see in the description of the implementation of wal_checkpoint_v2():

All calls obtain an exclusive "checkpoint" lock on the database file. [...]

The EXCLUSIVE lock is called exclusive because ... well ... we have the same issue as described with the traditional ROLLBACK journal above. It may be less severe as the lock will not be held during the entirety of index creation (at the point of hitting the memory cache limit) but only once at the end. However, writing a database worth of, say, 200 MB (this is not exaggerating) will still take noteworthy amounts of time on slow devices. Think of maybe not an hour of index generation but still a minute or two database downtime while checkpointing your WAL.


One other thing I forgot to mention:

[...] Pi-Hole could begin to support more DB vendors without needing to resort to copying files around.

It's not that simple. Pi-hole implemented SQL language extensions such as subnet_match() used in

https://github.com/pi-hole/FTL/blob/6075a70accdd16a9647a986fce8af51dce03a66f/src/database/gravity-db.c#L264-L273

Furthermore, using SQLite3 in the way Pi-hole does have some performance benefits you will not be able to find with a database server attached over network (not even on localhost) as we keep the entire index trees in memory, allowing for checking if a domain is blocked or not most often by actually not having to do a single disk I/O (or socket) operation.

@yubiuser
Copy link
Member

Part of the issue will be mitigated with #5179 as it reduces the number of here-documents

@yubiuser
Copy link
Member

And this should further mitigate the littering 600540a

@Zocker1999NET
Copy link
Author

Zocker1999NET commented Feb 22, 2023

Okay, I see there are good reasons why transactions are not feasible for Pi-Hole, at most due to restrictions from SQLite I wasn't aware (other DB's should be able to handle reads during index creation, at least it was never a problem for systems I wrote).

When suggesting other DB systems, I was mainly referring to PostreSQL, which has feature parity with SQLite, but in the case of subnet_match, they require a different syntax to SQLite's one.

I asked about it because I couldn't find any discussion about these issues on GitHub and because normal transactions are way less error prone. Thanks for the explanation.

@dschaper
Copy link
Member

FTL has sqlite3 embedded in the binary and is what we use for the engine.

Keep in mind that Pi-hole will run on a single core ARM device with 512M RAM, PSQL isn't that light.

@dschaper
Copy link
Member

dschaper commented Feb 22, 2023

What do you think @dschaper ?

I think there's a couple of options for how to run gravity.sh. But I think the primary issue here that needs to be addressed is properly handling gravity.sh when there is no more space to run any error. And that is, simply, to halt at first sign of trouble.

The rest of the discussion is just gravy. Do we move the temp storage to /etc/pihole? Do we use some other kind of structure or memory to handle the files? Do we redesign how the download and parsing process is run? Do we look at other database options?

Those are all great questions but to prevent wandering from the point, lets pin this down to one specific issue. There is nothing to prevent gravity.sh from swapping in a bad database. That is a bug.

@pralor-bot
Copy link

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/pihole-tmp-no-space-left-on-device/61797/1

@dschaper
Copy link
Member

The temporary directory for gravity functions is now user configurable.

I'm leaving this issue open because it addresses more than just a full /tmp but also some design issues with sqlite3.

@dschaper
Copy link
Member

The latest hotfix now includes a check to prevent empty databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug: Confirmed Verified as a bug to be resolved
Projects
None yet
Development

No branches or pull requests

6 participants