Fix state deletion for transient backup issues.
Since Android 10, backupPm() includes sendDataToTransport(), which was not previously the case. This means that error handling logic that deletes the backup state file (causing initialize_device() on the next attempt, which deletes any existing backup) will now also be triggered upon errors during sendDataToTransport(), which wasn't previously (Android <= 9) the case. This has the potential of making an existing temporary outage much worse: 1. A few devices might run into temporary issues, e.g. a B&R server returning HTTP 503 Service Unavailable (treated as a TransientHttpStatusException instanceof NetworkException, which is mapped to TRANSPORT_ERROR during handleTransportStatus(), which results in a TaskException with stateCompromised==false but which backupPm() wraps in another TaskException that forces stateCompromised=true). 2. On their next backup attempt, those devices throw away any existing backup and start from scratch (initialize_device()), increasing the load on the server. 3. This leads to a positive-feedback loop where more devices than before run into HTTP 503 Service Unavailable. 4. As a result, masses of devices delete their backups and then hammer the B&R server with attempts to upload new backups. 5. Backups are unavailable to any users who would otherwise rely on them during this outage. To improve on this dangerous situation, this CL changes the code to force stateCompromised=true only for TaskExceptions thrown specifically during extractPmAgentData(), and (as before) for all AgentExceptions. Note that the code is still quite brittle. It still seems like we are probably forcing stateCompromised=true in too many situations, but it's hard to say so this CL is being conservative about the changes. Changing back to the old behavior could be done through a local change around KeyValueBackupTask.java:676; a future CL may do this to have a safety hatch in case we want to cherry-pick this CL into an upcoming Android release late in the release cycle. [1] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/pie-dev/services/backup/java/com/android/server/backup/internal/PerformBackupTask.java#1035 [2] https://android.googlesource.com/platform/frameworks/base/+/refs/heads/master/services/backup/java/com/android/server/backup/keyvalue/KeyValueBackupTask.java#1040 [3] https://source.corp.google.com/piper///depot/google3/java/com/google/android/gmscore/integ/modules/backup/transport/src/com/google/android/gms/backup/transport/GmsBackupTransport.java;l=770;rcl=281845876 Bug: 144030477 Test: Checked that the following passes after this CL, but testRunTask_whenTransportReturnsErrorForPm_updatesFilesAndCleansUp() fails if I revert the state of KeyValueBackupTask.java to before this CL: ROBOTEST_FILTER=KeyValueBackupTaskTest make \ RunBackupFrameworksServicesRoboTests Change-Id: I6c622c55fbd804ec0a12e0bea7ade1308f7a3877 (cherry picked from commit 819ed81f)
Loading
Please register or sign in to comment