Problem Recovery and Reporting

Contents

Problem Recovery and Reporting

An operating system shell should be able to inform the user about problematic or unrecoverable conditions in the system. This is different from error logging. Logging is useful to a system technician and the goals and usage scenarios for that are significantly different. They involve things like archival, remote operation, aggregation, trending, pattern matching, and report generation. Those are clearly important and useful things but the goals here are much simpler.

Primary Goals

Die with dignity
Inform and apologize to the user that things have gone awry
Stop things before they become unrecoverable / scary / pathological
Reset the system to an operational state as quickly as possible

Secondary Goals

Allow the user to participate in the improvement of the system
When permitted out of kindness by the user - send automatically generated diagnostic feedback to developers

Relevant Art

OS X

Fatal system error

Basic Mode

Developer Mode

Notes

Server mode does no reporting to the user but probably still logs the event
Crashing system services (eg. cupsd are not reported to the user in any mode)
All reports seem to be filed anonymously

Windows 7

Windows 8

Fatal system error

BSOD

Ubuntu

Details

Fedora

Firefox

Chrome

Aw Snap

Twitter

Discussion

Layers of the system

Application
Session
Operating System
- OS Core Services
- OS Kernel
- Boot System

The user must not be exposed to any finer granularity detail about the composition of the system.

In general, the handling of each type of trouble should be handled by the layer above it. Application problems should be handled by the Shell. Shell problems handled by Core Services (probably GDM or plymouth). And kernel failures by the boot system. And boot system failures by the firmware.

Forms of trouble

Crash
Misbehavior
Misconfiguration
Failure

Crash

A unhandled exception where the process exits abruptly. May be able to generate useful stack traces if debugging symbols are available.

Note: Crash could be very easy to repeat and very visible, to very difficult. E.g. crash at startup vs once in the blue moon. If very easy to repeat, it will cause agitation when the same dialog(s) pop up again and again.

Misbehavior

The process has stopped responding or done something that it wasn't supposed to. It may have been denied or permitted with a warning. This may involve selinux or similar. This may be a result of misconfiguration either by a technician or a vendor. The two cases may be able to be differentiated by whether default configuration values were used. Examples include:

Process looping and using 100% cpu
UI not responding
Attempt to access something it wasn't allowed to

Misconfiguration

The process cannot handle the configuration information that it has been provided. This may be in the form of files on disk (/etc) or in a database (dconf). This may be caused by the way the program saves settings, the user, a technician, or a vendor. When caused by non-default values often the remedy is resetting to the default value. When caused by default values this can be interpreted as a failure.

Failure

An error where process has no choice but to bail out because it cannot continue. This differs from a Crash in that often the program can provide a specific reason for why it cannot continue. Stack traces may also be available if the program does a SIGABRT. For an application this may include OS version mismatches. For an OS Shell this may include hardware capability mismatches, etc.

Tentative Design

Client Side

It seems that a solution may have a few reporting modes:

Normal
Developer
Managed
Unattended

Normal Mode

The default mode where the user may be informed of trouble conditions in System or Application, prompted to reset default settings in System or Application, and asked to kindly submit trouble reports. Since sending trouble reports is a secondary goal and resetting the system to an operational state is a primary goal - the sending process must be highly efficient, simple, and clear. The user should not be required to gather details beyond that which the system can gather automatically. The user may be permitted to supply addition details about what they were doing at the time. But this should not be required since it conflicts with primary goals and is very doubtfully useful. The process must not wait for downloading additional debugging information. That conflicts with primary goals and is frankly really irritating. The screens shown to the user must not contain technical details like stack traces.

Developer Mode

Same as Normal Mode with these exceptions: the prompts may contain stack trace information, and information about background processes and services may be shown.

Managed Mode

Possibly useful for managed clients or servers. Operation is similar to Normal mode except that the user is not prompted to submit reports. The user should still be informed of trouble and may be prompted to reset defaults. Administrators may hook into the low level system to extract details automatically (via push or pull).

We may also be able to use this mode if the user has elected to automatically provide feedback during the initial system setup.

Unattended Mode

Possibly useful for kiosks or similar. Failsafe fallbacks should operate without user intervention. No crash logging or reporting should occur. Failures should not expose system details to passers by.

Fatal system errors should automatically trigger a restart (up to a certain number of retries).

Private Mode?

Perhaps there should also be a private mode for when the user is going something that shouldn't be tracked such as using a web browser in incognito/private mode. In this mode the system would notify the user of problems but make no attempt to report them.

Guidelines

Also see Design/Apps/Oops

Fatal System Error

Server Side

A crash reporting server should:

Allow anonymous crash report submissions
Scrub sensitive user data from reports. Removing:
- User's real name
- usernames
- email addresses
- social security numbers
- phone numbers
- IP addresses
- Document titles
- user filenames (especially in $HOME)
- URLs
Support filing reports in Bugzilla for developer review
Avoid duplicate report filing
Support linking crash reports to Bugzilla to allow Developer Mode direct access to bug reports
Perform coredump analysis and backtrace generation
Support symbol resolution for applications not "packaged" directly by the OS vendor
- applications may be distributed as bundles and crash server would need a way to find debuginfo for them

Implementation Details

Please see a a proposed architecture.

Comments

OlavVitters:

Server Side questions
- Backtraces:
  - "C" / gdb stuff?
  - Python?
  - which platforms? Different on non-Linux
Provide some kind of error resolution:
- Tells user that he doesn't run the latest software
- For common crashers, provide custom feedback text. Ideally translated, but doesn't have to be.
- Crash application should check every so often for feedback.

JamesCape:

Throttling + batching + whatnot is also important, to avoid deluging the user with "XYZ crashed" "ABC crashed" "XYZ crashed again" notifications (assume all times are open to discussion):
- Be smart about repeat traces
  - For reported crashes, just ignore (opt-in, per-bugzilla "send a comment on repeat crashes" would possibly be useful)
  - For unreported crashes, be smart about it:
    - If the app was running for less then 3 seconds, throttle
    - If the app crashed that way within the last hour, let them know that it was the same crash (e.g. "Firefox has crashed again. This crash appears to be the same as the one which occured $reltime ago.")
- Throttle any notifications that occur within 30 seconds of each other (e.g. cascade/hardware-related/etc. failures)
- All throttled/batched/etc. reports should get an omnibus "20 applications have crashed in the last X minutes"-style notification.

Problem Recovery and Reporting

Primary Goals

Secondary Goals

Relevant Art

OS X

Fatal system error

Basic Mode

Developer Mode

Notes

Windows 7

Windows 8

Fatal system error

Ubuntu

Fedora

Firefox

Chrome

Twitter

Discussion

Layers of the system

Forms of trouble

Crash

Misbehavior

Misconfiguration

Failure

Tentative Design

Client Side

Normal Mode

Developer Mode

Managed Mode

Unattended Mode

Private Mode?

Guidelines

Fatal System Error

Server Side

Implementation Details

Comments

See Also