Technical Advisory 144650

On this page Carat arrow pointing down

Publication date: April 28, 2025

Description

In CockroachDB v23.2.0 to v23.2.23, v24.1.0 to v24.1.16, all versions of v24.2, v24.3.0 to v24.3.10, v25.1.0 to v25.1.4, and testing versions of v25.2 through v25.2.0-beta.1, the following bulk write operations can succeed without ingesting all expected data:

  • IMPORT
  • MOLT Fetch (without MOLT Verify)
  • Physical Cluster Replication
  • CREATE TABLE ... AS (CTAS)
  • Materialized view refreshes

Other bulk write operations are not affected. Restore is unaffected because it never uses the rare code path affected by this bug. Index backfill schema changes avoid this problem because of post-ingestion index validation, which fails the index creation if it finds any missing data.

Bulk write operations have two code paths for flushing data to disk. One path is common, and one is rare. The issue described on this page only impacts the rare code path, on which errors are incorrectly handled and not properly retried, allowing bulk write operations to improperly succeed. The issue can only occur if errors are encountered on the rare code path for bulk write operations (e.g., lack of disk space or node unavailability), and no errors are encountered outside of the rare code path, since the latter would cause the bulk write operation to retry or roll back.

For steps to identify and mitigate affected operations, refer to Mitigation.

Statement

This issue is resolved in CockroachDB by #144646, which disables the rare (async flush) code path for bulk write operations. The fix has been applied to the v23.2.24, v24.1.17, v24.3.11, and v25.1.5 production releases, and to the v25.2.0-beta.2 testing release.

The issue is tracked publicly by #144650.

Mitigation

Users of CockroachDB v23.2.0 to v23.2.23, v24.1.0 to v24.1.16, all versions of v24.2, v24.3.0 to v24.3.10, v25.1.0 to v25.1.4, and testing versions of v25.2 through v25.2.0-beta.1 are encouraged to upgrade to v23.2.24, v24.1.17, v24.3.11, v25.1.5, v25.2.0-beta.2, or a later version.

This issue only affects the following operations: IMPORT, MOLT Fetch (without Verify), Physical Cluster Replication, CREATE TABLE ... AS (CTAS), and materialized view refreshes.

Users on affected versions can determine whether they may have encountered the problem by running the script detect_144650.sh to detect any bulk ingestion jobs that improperly succeeded. If the script detects a potential issue, users should contact support and work through the appropriate manual validation steps.

For details, refer to Detection via logs and Manual validation.

Detection via logs

detect_144650.sh requires the following dependencies:

Run detect_144650.sh from within a decompressed debug.zip directory or by appending the path to the directory. For example:

icon/buttons/copy
/path/to/script/detect_144650.sh /path/of/unzipped/debug

Do not run the script on a production system. Instead, move the debug.zip file to a separate system for analysis and run the script there.

detect_144650.sh examines the CockroachDB logs for entries that indicate that a job may have been affected.

  • If a job is likely to have corruption, the output will look like the following:

    ❌ Potentially Corrupt Jobs found. Contact Support.
    
    +---------------------+----------+
    |       job_id        | num_rows |
    +---------------------+----------+
    | 1065722008981340161 | 2964940  |
    +---------------------+----------+
    

    If a job ID is reported in the output, contact support and validate the job using the relevant method in Manual validation.

  • If a problematic error is detected, but cannot be linked to a job ID, a warning is returned:

    Searching for silent async flush errors
    WARNING: detected bad error without job id tag. Cannot check if job succeeded. See jobless_errs.txt. Contact support.
    Searching for succeeded jobs
    

    If the script returns a warning, contact support to determine next steps.

If the script does not find the applicable log entries, your cluster was not affected by this issue during the log retention period. However, the script only identifies corrupt jobs for which you have logs. If a job ran during a timespan for which you have no logs, you can manually validate the job using one of the relevant methods described in Manual validation.

Manual validation

The following sections contain mitigation steps for the bulk write operations affected by this issue.

Follow these steps after detect_144650.sh finds a corrupted job or problematic error and you have contacted support, or if you want to manually validate a job for which you have no logs.

IMPORT

IMPORT jobs affected by this issue could be missing rows that were present in the source data. To check whether a suspected IMPORT job was affected, compare the number of rows imported with the number of rows in the source data.

  1. If the output of the IMPORT statement showed the expected number of rows, then the import of the primary index was unaffected, but secondary indexes may be corrupted. Cockroach Labs suggests dropping and recreating secondary indexes.

  2. After a successful IMPORT job, there should be a RecoveryEvent logged to the TELEMETRY logging channel with "RecoveryType": "import_job". If NumRows from this event shows the expected number of rows, then the import of the primary index was unaffected, but secondary indexes may be corrupted. Cockroach Labs suggests dropping and recreating secondary indexes.

  3. If the preceding two steps do not show the expected number of rows, or cannot be performed due to the time elapsed since the IMPORT job, then use RESTORE ... AS OF SYSTEM TIME to reconstruct the table as it was immediately after the import finished. Run a SELECT count(*) query to count the number of rows in the restored table. (If the table contained any rows before the IMPORT statement was run, then use a second RESTORE ... AS OF SYSTEM TIME to reconstruct the table as it was immediately before the IMPORT started, and subtract the SELECT count(*) on this table from the first count.)

    • If this count shows the expected number of rows, then the import of the primary index was unaffected, but secondary indexes may be corrupted. As in the preceding cases, Cockroach Labs suggests dropping and recreating secondary indexes.
    • If this count does not show the expected number of rows, this indicates a discrepancy affecting the primary index. After upgrading to a version with the fix, run the IMPORT a second time. Compare the two imported datasets with a JOIN, and manually add any missing rows.

MOLT Fetch

By default, MOLT Fetch uses IMPORT INTO to load data into CockroachDB, and can therefore be affected by this issue. As recommended in the migration documentation, a run of MOLT Fetch should be followed by a run of MOLT Verify to ensure that all data on the target side matches the data on the source side.

  • If you ran MOLT Verify after completing your MOLT Fetch run, and Verify did not find mismatches, then MOLT Fetch was unaffected by this issue.

  • If you did not run Verify after Fetch, analyze the exported files that exist in your configured Fetch data path to determine the expected number of rows. Then follow steps 1-3 in the IMPORT section.

Physical Cluster Replication

After upgrading both the primary and standby clusters to a version with the fix, run a point-in-time fingerprint on both clusters to verify that the data on the primary and standby clusters are the same at a certain replicated time.

If the fingerprints do not have the same value, then upgrade to a version with the fix, cancel your PCR stream by dropping the replicating virtual cluster, and restart the PCR stream. Cockroach Labs encourages users to contact support to assist with these steps if a fingerprint mismatch is found.

CREATE TABLE ... AS (CTAS)

CREATE TABLE ... AS (CTAS) operations can be validated similarly to IMPORT jobs. The number of rows in the created table indicates whether the CTAS operation was successful.

To check whether a CTAS operation was affected, compare the number of rows in the created table with the output of the CTAS query. If you cannot perform this comparison due to the time elapsed since the CTAS operation, then either re-run the CTAS query with AS OF SYSTEM TIME or use RESTORE ... AS OF SYSTEM TIME to reconstruct the table as it was immediately after the CTAS operation finished.

If you find a discrepancy, then upgrade to a version with the fix, drop the affected table, and re-run the CTAS operation.

Materialized view refreshes

Materialized view refreshes can be validated similarly to CTAS operations. To check whether a REFRESH was affected, compare the number of rows in the materialized view with the source data.

If you find a discrepancy, then upgrade to a version with the fix and refresh the materialized view to ensure it has the most up-to-date data.

Impact

In rare cases, some bulk write operations can succeed without ingesting all expected data. Versions affected include v23.2.0 to v23.2.23, v24.1.0 to v24.1.16, all versions of v24.2, v24.3.0 to v24.3.10, v25.1.0 to v25.1.4, and testing versions of v25.2 through v25.2.0-beta.1.

Contact the support team if more information or assistance is needed.


Yes No
On this page

Yes No