Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Drupal database tables do not support 4 byte utf8 characters #18123

Open
4 tasks
omahane opened this issue May 15, 2024 · 0 comments
Open
4 tasks

Some Drupal database tables do not support 4 byte utf8 characters #18123

omahane opened this issue May 15, 2024 · 0 comments
Labels
Defect Something isn't working (issue type) Needs refining Issue status

Comments

@omahane
Copy link
Contributor

omahane commented May 15, 2024

Describe the defect

Drupal allows editors to enter multibyte data in text fields, but the database cannot support 4 byte characters in some tables. Currently we are aware of a single table, search_api_db_content, but an audit will need to be done to discover all utf8 character sets in use.

Both Drupal and Mysql (and by extension, mariaDb) recommend setting up the database to use the utf8mb4_unicode_ci collation, and the utf8mb4 character set across the database. In fact Mysql has deprecated utf8 altogether in favor of ut8mb4.

To Reproduce

Steps to reproduce the behavior:
As an administrator

  1. Go to /media/add/image
  2. In the description, enter: 𝐂𝐚𝐫𝐞𝐠𝐢𝐯𝐞𝐫 𝐒𝐮𝐦𝐦𝐢𝐭 (copy this text, do not type it!)
  3. Save the image after filling required fields.
  4. View the error log at admin/reports/dblog?type%5B%5D=search_api_db (filtered by search_api_db errors)
  5. Note there is an error for the media entity that looks like the below screenshot.

AC / Expected behavior

No errors exist in watchdog when saving entities with 4 byte characters, and the data persists.

Screenshots

Screenshot 2024-05-16 at 2 27 42 PM

Additional context

This issue impacts following pages:

It's also producing a number of warning log entries every time cron runs:
Screenshot 2024-05-15 at 10 00 26 AM

Conversation around this on Slack.

Engineering Notes

When troubleshooting, we noted that the search_api_content table was using the character set utf8 and collation of utf8_general_ci. This character set only allows for a single byte for a character.

Some helpful sql queries
# Get the current character set for the search_api_db_content table.
SELECT CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.`COLUMNS` 
WHERE table_schema = "db"
  AND table_name = "search_api_db_content"
  AND column_name = "field_description";
  
# Change the character set (mutates existing data)
ALTER TABLE search_api_db_content CONVERT TO CHARACTER SET utf8mb4;

# View server default charset and collation
SELECT @@character_set_database, @@collation_database;

# Show the available character sets.
SHOW CHARACTER SET;

ACs

@omahane omahane added Defect Something isn't working (issue type) Needs refining Issue status labels May 15, 2024
@dsasser dsasser changed the title Stub: search_api_db warning due to character set mismatch Drupal database does not support multi-byte strings May 16, 2024
@dsasser dsasser changed the title Drupal database does not support multi-byte strings Drupal database does not support multibyte characters May 16, 2024
@dsasser dsasser changed the title Drupal database does not support multibyte characters Some Drupal database tables do not support 4 byte utf8 characters May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Defect Something isn't working (issue type) Needs refining Issue status
Projects
None yet
Development

No branches or pull requests

1 participant