Problem Recovery and Reporting

An operating system shell should be able to inform the user about problematic or unrecoverable conditions in the system. This is different from error logging. Logging is useful to a system technician and the goals and usage scenarios for that are significantly different. They involve things like archival, remote operation, aggregation, trending, pattern matching, and report generation. Those are clearly important and useful things but the goals here are much simpler.

Primary Goals

  • Die with dignity
  • Inform and apologize to the user that things have gone awry
  • Stop things before they become unrecoverable / scary / pathological
  • Reset the system to an operational state as quickly as possible

Secondary Goals

  • Allow the user to participate in the improvement of the system
  • When permitted out of kindness by the user - send automatically generated diagnostic feedback to developers

Relevant Art

OS X

Fatal system error

http://upload.wikimedia.org/wikipedia/commons/8/8a/Panic10.6.png

Basic Mode

https://lh4.googleusercontent.com/-QgG4u6YbDvU/Tp841I70QEI/AAAAAAAAXcI/o4L5s54NrAM/s400/Screen%2520Shot%25202011-10-19%2520at%25204.29.22%2520PM.jpg https://lh3.googleusercontent.com/-VOj3HJ39Wrs/Tp841f2WdxI/AAAAAAAAXcM/UKvgqm1rfy0/s400/Screen%2520Shot%25202011-10-19%2520at%25204.34.15%2520PM.jpg https://lh4.googleusercontent.com/-NPxkuBsToUQ/Tp8418Q9rnI/AAAAAAAAXcU/A0-ANt12Jrk/s400/Screen%2520Shot%25202011-10-19%2520at%25204.37.37%2520PM.jpg https://lh3.googleusercontent.com/-Amx5_6j1q1I/Tp842EW7-CI/AAAAAAAAXc8/-pninWk7G9k/s640/Screen%2520Shot%25202011-10-19%2520at%25204.37.56%2520PM.jpg

Developer Mode

https://lh3.googleusercontent.com/-CnBQopMqIGI/Tp84y7IYGkI/AAAAAAAAXbk/8c_M2z_nyhc/s400/Screen%2520Shot%25202011-10-19%2520at%25204.40.38%2520PM.jpg https://lh6.googleusercontent.com/--XyDmTHsPJo/Tp84z36H53I/AAAAAAAAXb0/i-YQFhVcWTI/s400/Screen%2520Shot%25202011-10-19%2520at%25204.45.51%2520PM.jpg https://lh3.googleusercontent.com/-nqA3FerYjrg/Tp840BILLbI/AAAAAAAAXc0/21aIrK3pX-E/s640/Screen%2520Shot%25202011-10-19%2520at%25204.46.03%2520PM.jpg

Notes

  • Server mode does no reporting to the user but probably still logs the event
  • Crashing system services (eg. cupsd are not reported to the user in any mode)
  • All reports seem to be filed anonymously

Windows 7

http://upload.wikimedia.org/wikipedia/en/thumb/1/13/Windows_Error_Reporting_problem_details.png/640px-Windows_Error_Reporting_problem_details.png

Windows 8

Fatal system error

BSOD

Ubuntu

https://wiki.ubuntu.com/Apport?action=AttachFile&do=get&target=apport-gtk-desktopfile.png https://wiki.ubuntu.com/Apport?action=AttachFile&do=get&target=apport-gtk-report.png Details

Fedora

https://lh5.googleusercontent.com/-tYY_l8yORHc/Tpc3MC-9_jI/AAAAAAAAXGw/WJdKZZSWJVY/s640/Screenshot%2520at%25202011-10-13%252014%253A45%253A25.png https://lh5.googleusercontent.com/-QYt6xNWnDkU/Tpc3L8UqbSI/AAAAAAAAXGk/IBIyYnZ0HEg/s640/Screenshot%2520at%25202011-10-13%252014%253A45%253A44.png

Firefox

http://www.squarefree.com/blogimages/crashreportdialog.png

Chrome

Aw Snap

Twitter

http://upload.wikimedia.org/wikipedia/en/d/de/Failwhale.png

Discussion

Layers of the system

  • Application
  • Session
  • Operating System
    • OS Core Services
    • OS Kernel
    • Boot System

The user must not be exposed to any finer granularity detail about the composition of the system.

In general, the handling of each type of trouble should be handled by the layer above it. Application problems should be handled by the Shell. Shell problems handled by Core Services (probably GDM or plymouth). And kernel failures by the boot system. And boot system failures by the firmware.

Forms of trouble

  • Crash
  • Misbehavior
  • Misconfiguration
  • Failure

Crash

A unhandled exception where the process exits abruptly. May be able to generate useful stack traces if debugging symbols are available.

Note: Crash could be very easy to repeat and very visible, to very difficult. E.g. crash at startup vs once in the blue moon. If very easy to repeat, it will cause agitation when the same dialog(s) pop up again and again.

Misbehavior

The process has stopped responding or done something that it wasn't supposed to. It may have been denied or permitted with a warning. This may involve selinux or similar. This may be a result of misconfiguration either by a technician or a vendor. The two cases may be able to be differentiated by whether default configuration values were used. Examples include:

  • Process looping and using 100% cpu
  • UI not responding
  • Attempt to access something it wasn't allowed to

Misconfiguration

The process cannot handle the configuration information that it has been provided. This may be in the form of files on disk (/etc) or in a database (dconf). This may be caused by the way the program saves settings, the user, a technician, or a vendor. When caused by non-default values often the remedy is resetting to the default value. When caused by default values this can be interpreted as a failure.

Failure

An error where process has no choice but to bail out because it cannot continue. This differs from a Crash in that often the program can provide a specific reason for why it cannot continue. Stack traces may also be available if the program does a SIGABRT. For an application this may include OS version mismatches. For an OS Shell this may include hardware capability mismatches, etc.

Tentative Design

Client Side

It seems that a solution may have a few reporting modes:

  • Normal
  • Developer
  • Managed
  • Unattended

Normal Mode

The default mode where the user may be informed of trouble conditions in System or Application, prompted to reset default settings in System or Application, and asked to kindly submit trouble reports. Since sending trouble reports is a secondary goal and resetting the system to an operational state is a primary goal - the sending process must be highly efficient, simple, and clear. The user should not be required to gather details beyond that which the system can gather automatically. The user may be permitted to supply addition details about what they were doing at the time. But this should not be required since it conflicts with primary goals and is very doubtfully useful. The process must not wait for downloading additional debugging information. That conflicts with primary goals and is frankly really irritating. The screens shown to the user must not contain technical details like stack traces.

Developer Mode

Same as Normal Mode with these exceptions: the prompts may contain stack trace information, and information about background processes and services may be shown.

Managed Mode

Possibly useful for managed clients or servers. Operation is similar to Normal mode except that the user is not prompted to submit reports. The user should still be informed of trouble and may be prompted to reset defaults. Administrators may hook into the low level system to extract details automatically (via push or pull).

We may also be able to use this mode if the user has elected to automatically provide feedback during the initial system setup.

Unattended Mode

Possibly useful for kiosks or similar. Failsafe fallbacks should operate without user intervention. No crash logging or reporting should occur. Failures should not expose system details to passers by.

Fatal system errors should automatically trigger a restart (up to a certain number of retries).

Private Mode?

Perhaps there should also be a private mode for when the user is going something that shouldn't be tracked such as using a web browser in incognito/private mode. In this mode the system would notify the user of problems but make no attempt to report them.

Guidelines

attachment:ProblemReporting.pdf

Also see Design/Apps/Oops

Fatal System Error

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-normal.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-normal-notification.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-developer.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-unattended.png

Server Side

A crash reporting server should:

  • Allow anonymous crash report submissions
  • Scrub sensitive user data from reports. Removing:
    • User's real name
    • usernames
    • email addresses
    • social security numbers
    • phone numbers
    • IP addresses
    • Document titles
    • user filenames (especially in $HOME)
    • URLs
  • Support filing reports in Bugzilla for developer review
  • Avoid duplicate report filing
  • Support linking crash reports to Bugzilla to allow Developer Mode direct access to bug reports
  • Perform coredump analysis and backtrace generation
  • Support symbol resolution for applications not "packaged" directly by the OS vendor
    • applications may be distributed as bundles and crash server would need a way to find debuginfo for them

Implementation Details

Please see a a proposed architecture.

Comments

OlavVitters:

  • Server Side questions
    • Backtraces:
      • "C" / gdb stuff?
      • Python?
      • which platforms? Different on non-Linux
  • Provide some kind of error resolution:
    • Tells user that he doesn't run the latest software
    • For common crashers, provide custom feedback text. Ideally translated, but doesn't have to be.
    • Crash application should check every so often for feedback.

JamesCape:

  • Throttling + batching + whatnot is also important, to avoid deluging the user with "XYZ crashed" "ABC crashed" "XYZ crashed again" notifications (assume all times are open to discussion):
    • Be smart about repeat traces
      • For reported crashes, just ignore (opt-in, per-bugzilla "send a comment on repeat crashes" would possibly be useful)
      • For unreported crashes, be smart about it:
        • If the app was running for less then 3 seconds, throttle
        • If the app crashed that way within the last hour, let them know that it was the same crash (e.g. "Firefox has crashed again. This crash appears to be the same as the one which occured $reltime ago.")
    • Throttle any notifications that occur within 30 seconds of each other (e.g. cascade/hardware-related/etc. failures)
    • All throttled/batched/etc. reports should get an omnibus "20 applications have crashed in the last X minutes"-style notification.

See Also

Design/OS/ProblemReporting (last edited 2018-02-18 23:43:13 by JohnMcHugh)