Skip to content
Commit 723f5f36 authored by Tobias Thierer's avatar Tobias Thierer
Browse files

Fix state deletion for transient backup issues.

Since Android 10, backupPm() includes sendDataToTransport(), which was
not previously the case. This means that error handling logic that
deletes the backup state file (causing initialize_device() on the next
attempt, which deletes any existing backup) will now also be triggered
upon errors during sendDataToTransport(), which wasn't previously
(Android <= 9) the case.

This has the potential of making an existing temporary outage much
worse:

  1. A few devices might run into temporary issues, e.g. a B&R server
     returning HTTP 503 Service Unavailable (treated as a
     TransientHttpStatusException instanceof NetworkException, which
     is mapped to TRANSPORT_ERROR during handleTransportStatus(),
     which results in a TaskException with stateCompromised==false
     but which backupPm() wraps in another TaskException that forces
     stateCompromised=true).
  2. On their next backup attempt, those devices throw away any
     existing backup and start from scratch (initialize_device()),
     increasing the load on the server.
  3. This leads to a positive-feedback loop where more devices than
     before run into HTTP 503 Service Unavailable.
  4. As a result, masses of devices delete their backups and then
     hammer the B&R server with attempts to upload new backups.
  5. Backups are unavailable to any users who would otherwise rely
     on them during this outage.

To improve on this dangerous situation, this CL changes the code to
force stateCompromised=true only for TaskExceptions thrown
specifically during extractPmAgentData(), and (as before) for all
AgentExceptions.

Note that the code is still quite brittle. It still seems like we
are probably forcing stateCompromised=true in too many situations,
but it's hard to say so this CL is being conservative about the
changes. Changing back to the old behavior could be done through
a local change around KeyValueBackupTask.java:676; a future CL may
do this to have a safety hatch in case we want to cherry-pick this
CL into an upcoming Android release late in the release cycle.

[1] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/pie-dev/services/backup/java/com/android/server/backup/internal/PerformBackupTask.java#1035
[2] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/master/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java#1040
[3] https://source.corp.google.com/piper///depot/google3/java/com/google/android/gmscore/integ/modules/backup/transport/src/com/google/android/gms/backup/transport/GmsBackupTransport.java;l=770;rcl=281845876

Bug: 144030477
Test: Checked that the following passes after this CL, but
  testRunTask_whenTransportReturnsErrorForPm_updatesFilesAndCleansUp()
  fails if I revert the state of KeyValueBackupTask.java to before this CL:
    ROBOTEST_FILTER=KeyValueBackupTaskTest make \
    RunBackupFrameworksServicesRoboTests

Change-Id: I6c622c55fbd804ec0a12e0bea7ade1308f7a3877
(cherry picked from commit 819ed81f)
parent 03b0c47b
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment