OCEOS Operation

From wiki
Jump to navigation Jump to search

Operations basics

OCEOS is provided as an object library, header files(s), and application source files that can be compiled and linked with the standard BCC gcc or clang tools and libraries to provide a real-time multitasking application. OCEOS uses BCC library calls for some system functions. These functions are provided in the BCC board support package for the GR716. The minimum memory footprint for OCEOS is shown in the table below. Core modules occupy less than 10K bytes.

OCEOSsize.png

Operations manual

General

OCEOS is structured as a set of managers each providing OCEOS service calls.
The activities supported by these managers are:

Initialization manager : initializing the hardware and OCEOS itself
Task manager : handling tasks and jobs, including scheduling
Interrupt manager : setting up trap handlers
Clock manager : providing time of day, dates, and alarms
Timer manager : interval timing and alarms, timed outputs and jobs
Mutex manager : mutual exclusion management
Semaphore manager : counting semaphore management
Data queue manager : data queue management

The directives provided by these managers are described in the following sections.

Set‐up and initialisation

Setup for the use of OCEOS requires the user to prepare the development environment. At a minimum this requires the installation for compilers, debug tool, editors, …etc.

Getting started

DMON OCEOS Support

OCE added OCEOS support to DMON to help user in debugging application software. Detailed manual and available functionality can be found here.

Cpu Load View Cpu Profile View

Files provided with OCEOS

Unpack the OCEOS software into an appropriate folder on the computer. The folders are as follows:

Unzip folders.jpg

  • Documentation OCEOS User Manual and Quick Reference.
  • include Include files required by the application
  • OCEOS_GCC_LIB oceos.a object library for applications to be built with gcc
  • OCEOS_LLNM_LIB oceos.a object library for applications to be built with llvm (32-bit)
  • OCEOS_LLVM_REX_LIB oceos.a object library for applications to be built with llvm (16-bit REX mode)
  • Sample Project c sources, include files, and make files for an example project
  • UserManualTutorials c sources, include files, and make files for User Manual tutorials and application example
  • Excel configuration file to help produce oceos_config.h can be downloaded from Media:OCEOS_Config.xlsx

The source and build files associated with the example application referred to in this manual are located in the folder <installation-folder> \UserManualTutorials\asw.

asw.h

Simple application header.

asw.c

Simple application example. This file is a self-contained example of an OCEOS application including tasks, mutexes, semaphores, and dataqueues.

oceos_config.h

This is the OCEOS application header file where the developer specifies the number of tasks, mutexes, sempahores, dataqs, log entries, and readyq entries. These numbers are used to allocate the memory required for storing oceos variables. The definitions to be specified follow.

Oceos configuration.jpg
Notes:

  1. The application must create the number of tasks, mutexes, semaphores or dataqs specified in this header, otherwise an error will be returned from oceos_init_finish().
  2. LOG_DATA_ARRAY_SIZE_U32S - see section OCEOS Data Areas
  3. FIXED_DATA_ARRAY_SIZE_U32S - see section OCEOS Data Areas
  4. DYN_DATA_ARRAY_SIZE_U32S - see section OCEOS Data Areas
  5. Enumerated types are recommended for names of tasks, mutexes, semaphores and dataqs. This header file contains some examples.

oceos_config.c

This file contains the OCEOS data areas and the following functions:

  1. int application_init() This function initialises the oceos data area.
  2. void oceos_on_error(void* ptr) The developer should code this function to implement actions in case of a system error.
  3. void oceos_on_full_log(void* ptr)

The developer should code this function which is called when the system log is two-thirds full.

makefile & makefile.mk

These file are provided to facilitate building the asw application. Some local edits are required to point to the local compiler folders.

Building the example application

Download and install the BCC2 compiler from here.

Use the Makefile provided or use the following commands to compile and link the application.

sparc-gaisler-elf-clang -I. -Iinclude -IC:/opt/bcc-2.1.1-llvm/sparc-gaisler-elf/bsp/gr716/include -qbsp=gr716 -mcpu=leon3 -Wall -c asw.c -o asw.o
sparc-gaisler-elf-clang -I. -Iinclude -IC:/opt/bcc-2.1.1-llvm/sparc-gaisler-elf/bsp/gr716/include -qbsp=gr716 -mcpu=leon3 -Wall -c oceos_config.c -o oceos_config.o
sparc-gaisler-elf-clang asw.o oceos_config.o -o asw-gr716.elf -qbsp=gr716 -mcpu=leon3 -qsvt -qnano -Wall -LC:/projects/oce/oceos/trunk/oceos_product/oceoc/build/ -loceos -Wl,-Map=asw-gr716.map

Notes:

  1. sparc-gaisler-elf-clang Ensure the path to the compiler has been configured. The llvm clang compiler is recommended as the OCEOS library was compiled with it.
  2. -I. -Iinclude -I<bcc-install-folder>/bcc-2.1.1-llvm/sparc-gaisler-elf/bsp/gr716/include This -I option allows paths to include files to be passed to the compiler.
  3. -qbsp=gr716 This option specifies the compiler gr716 board support package.
  4. -mcpu=leon3 This option specifies the processor type.
  5. -Wall Display all warning messages.
  6. -c Compile only (linking not required).
  7. -o <filename> Specify the output filename.
  8. -qsvt Use the single-vector trap model described in “SPARC-V8 Supplement, SPARC-V8 Embedded (V8E) Architecture Specification”.
  9. -qnano Use a version of the newlib C library compiled for reduced foot print. The nano version implementations of the fprintf() fscanf() family of functions are not fully C standard compliant. Code size can decrease with up to # KiB when printf() is used.
  10. -L<path-to-oceos-library-file> Specify the path where liboceos.a is located.
  11. -loceos Instructs linker to use library liboceos.a
  12. -Wl,-Map=asw-gr716.map Instructs the linker to generate a map file with the name asw-gr716.map. This optional map file details the memory map of all functions and data.
  13. Consult the BCC manual for other options.

The application executable (in this case asw.elf) can now be downloaded and executed on the target GR716 hardware. A debug tool such as DMON or GRMON is required for this. DMON provides specific support of OCEOS.

Important Note Up to BCC version 2.1.2 bcc_inline.h needs to be changed as it does not function on the GR716 as expected. The file is located in …\bcc-2.x.x-llvm\sparc-gaisler-elf\bsp\gr716\include\bcc\ for llvm and …\bcc-2.x.x-gcc\sparc-gaisler-elf\bsp\gr716\include\bcc for gcc.

/* OCEOS comment out until BCC fix
#ifdef __BCC_BSP_HAS_PWRPSR
#include <bcc/leon.h>
static inline int bcc_set_pil_inline(int newpil)
{
        uint32_t ret;
        __asm__ volatile (
              "sll %4, %1, %%o1\n\t"
              "or %%o1, %2, %4\n\t"
              ".word 0x83882000\n\t"
              "rd %%psr, %%o1\n\t"
              "andn %%o1, %3, %%o2\n\t"
              "wr %4, %%o2, %%psr\n\t"
              "and %%o1, %3, %%o2\n\t"
              "srl %%o2, %1, %0\n\t"
                : "=r" (ret)
                : "i" (PSR_PIL_BIT), "i" (PSR_ET), "i" (PSR_PIL), "r" 
(newpil)
                : "o1", "o2"
        );
        return ret;
}
#else 
*/
#include <bcc/leon.h>
static inline int bcc_set_pil_inline(int newpil)
{
        register uint32_t _val __asm__("o0") = newpil;
        /* NOTE: nop for GRLIB-TN-0018 */
        __asm__ volatile (
                "ta %1\nnop\n" :
                "=r" (_val) :
                "i" (BCC_SW_TRAP_SET_PIL), "r" (_val)
        );
        return _val;
} 
/* For OCEOS 
#endif
*/

If bcc_inline.h is not modified traps may be left disabled or in REX mode an illegal opcode trap is generated. [GRLIB]

Mode selection and control

OCEOS is provided as an object library with header files for applications wishing to use it. There are two primary modes of operation on the GR716 as follows:

  • SPARC V8 32-bit Mode
    32-bit mode is the standard instruction word size on the SPARC V8. An OCEOS object library is provided for 32-bit mode (gcc & llvm)
  • REX 16-bit mode
    REX mode allows the GR716 to execute 16-bit instructions. Applications running in rex mode occupy a smaller memory footprint. The compiler option -mrex should be used and -Oz for the smallest footprint.

It is an application design choice as to which mode is used.

Normal operations

This section describes the OCEOS directives used to in most applications. See file asw.c for example.

The calls to get OCEOS running are:

  1. application_init – uses oceoc_init to Initialise fixed data and start system time and log
  2. oceos_task_create - Create task setting priority, no of jobs, ..etc.
  3. oceos_init_finish – Initilise dynamic data area
  4. oceos_start – Start the scheduler and pass control to first task

After steps 1 to 4 above tasks implement application functionality. If mutexes, semaphores, dataqs and/or timed actions are required they also are created at step 2.
Note: it is mandatory to create the number of mutexes, semaphores, and dataqs declared otherwise oceos_init_finish() will return an error.

application_init()

Application init.jpg

oceos_task_create (…)

Oceos task create.jpg

oceos_task_create prepares a task for execution. In the case above the parameters are:

  • t_setup is an integer setup in the enumerated type in oceos_config.h. t_setup is used to identify this task to OCEOS thereafter.

Enum task name.jpg

  • 3 is the priority allocated to this task. 1 is the highest and 254 the lowest priority
  • 3 is the priority threshold. The scheduler will only allow tasks higher than this priority to pre-empt task t_setup.
  • 1 is the maximum number of jobs for task t_setup. 15 is the maximum number of jobs that can be ready-to-run for any task.
  • 0 indicating that floating point is not used by this task. Enabling floating point for a task causes the scheduler to store all floating point registers on the stack when the job is pre-empted. If the job is not using floating point this causes inefficiency.
  • 1 causes the task to be enabled on when oceos is started.
  • setup_start is the name of the function to be executed for this task. The code for setup_start is in asw.c and the function declared in asw.h.

Setup start.jpg

  • setup_end is the function called when task t_setup terminates.

Setup end.jpg

  • 0 instructs the scheduler that there is no maximum execution time for this task. If a value (in microseconds) is specified the scheduer will raise an error if the task exceeds this execution time.
  • 0 instructs the scheduler that there is no minimum time between executions of this task. If a time in microseconds is specified the scheduler will raise an error if this value is violated.

oceos_init_finish()

This function must be called before OCEOS is started. It completes the initialisation of OCEOS data for the application.

Oceos init finish.jpg

oceos_start()

This function starts task scheduling in OCEOS. It should never return from this function.

Oceos start.jpg

The parameters passed to oceos_start are:

  • fixed_data is a pointer to the start of the fixed data area. It is declared in oceos_config.c.
  • t_setup is the ID of the task to be set executing when OCEOS starts (in oceos_config.h).
  • Optional pointer that will be passed to the task started, here t_setup.

oceos_task_start(..)

This directive starts an instance of a task, called a job in OCEOS. In this case start means puts an instance of the task, a job, on the ready-to-run queue, and performs a reschedule.

If it is the highest priority job ready-to-run, it will be set running by the scheduler.

Oceos task start.jpg

The parameters passed to it are the task ID and a pointer. It returns an integer indicating the status of the call.

In many cases the first task to be run by OCEOS will perform application initialisation and then start the other application tasks including the idle task.

oceos_mutex_create(..)

This directive creates a mutex. It must be called before oceos_init_finish(). Parameters passed are the mutex ID and the priority ceiling for the mutex.

The priority ceiling is the priority of the highest task that uses the mutex.

Note that the user must determine this priority ceiling as part of the application design. OCEOS relies on the developer to specify the correct priority ceiling.

Oceos mutex create.jpg

oceos_mutex_wait (..)

This directive assigns the mutex to the calling job. It should always be successful because a job should only be set running if the mutex is available to it.

Oceos mutex wait.jpg

The mutex ID is the only parameter.

oceos_mutex_signal (..)

This directive releases a mutex from a job. It will cause a reschedule and if a higher priority job has been waiting on the mutex it will start causing this job to be pre-empted.

Oceos mutex signal.jpg

The only parameter to the directive is the mutex ID.

oceos_sem_create(..)

This directive creates a counting semaphore.

Oceos sem create.jpg

  • Semaphore ID (0 to 62)
  • Maximum No of permits (1 to 4094)
  • Starting number of permits (1 to 4094)
  • Maximum number of jobs on this semaphore queue (1 to 254)

oceos_sem_wait_restart (..)

This directive decrements the semaphore and the job continues if the semaphore is greater then zero.

If the semaphore is equal to zero then the job is put on a queue until the semaphore is incremented (signalled), at which time the job is restarted.

Note that the job is restarted from scratch and not just continued from the directive call.

Oceos sem wait restart.jpg

The parameters are:

  • Sempahore ID (0 to 62)
  • Timeout in microseconds (64-bit integer, long long)

oceos_sem_signal (..)

This directive increments (signals) the specified semaphore.

Oceos sem signal.jpg

The semaphore ID is the sole parameter.

oceos_dataq_create (..)

This directive creates a data queue with specified ID

Oceos dataq create.jpg

  • Data Queue ID (0 to 62)
  • Data Queue Size (1 to 255)
  • Data Queue maximum number of pending jobs (1 to 255)
  • Data Queue roll-overs enable/disable (0 or 1)

oceos_dataq_read_restart(..)

This directive reads a pointer from the specified data queue. If the data queue is empty the task is put on a queue to be restarted when a new pointer has been added to the queue.

A timeout in microseconds can be set, after which the task will not be restarted.

Oceos dataq read restart.jpg

  • Data Queue ID (0 to 62)
  • Timeout in microseconds (64-bit integer)

oceos_dataq_write(..)

This directive puts a pointer on the specified data queue.

Oceos dataq write.jpg

Parameters are :

  • Data queue ID (0 to 62)
  • Pointer to be put on queue (void *ptr)

Normal termination

It is not normal for OCEOS to terminate. Application tasks usually continue to execute without stopping. If all tasks terminate the system goes to sleep mode and only an interrupt can restart activity.

Error conditions

Error conditions are handled in 4 different ways by OCEOS as follows:

  • Directive return codes allowing application take appropriate action
  • The System Log keeps a record of problems and when they occurred
  • The System State variable indicates the current state of OCEOS
  • The watchdog provides a fallback reset in unpredictable situations

Directive return codes

OCEOS is driven by directives described in the Reference section below. All directives return a status to be checked by the application. The directive return status is either success or a code indicating why it failed. The table below is taken from oceos.h. It is the responsibility of the application to take appropriate action if a directive call fails.

Enum directive status.jpg

The System Log

When a problem arises, either in application code or in OCEOS itself, it is useful to have a record of

  1. when it happened
  2. what kind of problem it was
  3. where it happened.

The primary purpose of the System Log is to provide this ‘when’, ‘what’, and ‘where’ information.

It does so as follows:

when: the low 32-bits of the system time are automatically recorded when a log entry is made.

what: an 8-bit number allows 256 different types of entry be specified. Half of these entry types (0x80 to 0xff) are reserved for internal us in OCEOS. The other 128 entry types (0 to 0x7f) are assigned their meanings by the application. (An OCEOS internal code reserved for that purpose is used if an application attempts to use an OCEOS internal code.)

where: a 24-bit number is included in each log entry. OCEOS itself uses this to indicate the OCEOS function where the log entry call was made and other information. Application code can use this number in whatever way is appropriate for the application.

The log is located in the log area. This area starts at an address specified by the application developer, begins with a 32-bit constant sentinel value followed by a 32-bit count of the number of 32-bit words making up the area (including the sentinels) and ends with a 32-bit constant sentinel.

As frequent access to the log area is not likely to be needed the log area can be located in memory external to the GR716. The log area contents will then usually be preserved across system resets, and across system power on/off cycles if the external memory is non-volatile.

The log is structured as a circular buffer, with its read and write indices also stored in the log area. It is set up initially by oceos_init(). This does not overwrite the data currently in the log but resets the read and write indices to be equal and as far as can be determined after the most recent log entry. As the log indices are equal the log now appears empty when an attempt to read it in the usual way is made, but a special directive allows the application program read the data at any position in the log. Another directive allows the application reset all log entries.

Any system that uses OCEOS must have a system log. The number of entries in the log is specified in the system configuration structure, and must be within the specified range or 0, in which case a default value is used.

The header file oceos.h gives the structure of the log area and the values of the associated constants.

struct log_entry{
 U32_t time32;
 unsigned int entry_type :8;
 unsigned int entry_comment :24;
};
/*
* Log entry types used when creating a log entry (8 bits, from 0 to 255 only)
*/
enum LOG_ENTRY_TYPE{
        NOT_VALID_ENTRY,
        ALL_OK,
        NO_JOB_FREE,
        READYQ_FULL,
        MUTEX_WAIT_REPEAT,
        MUTEX_SIGNAL_REPEAT
};
// add a log entry (32-bit time is included automatically)
enum DIRECTIVE_STATUS oceos_log_add_entry(
enum LOG_ENTRY_TYPE,// 8 bits
const unsigned int // information (24 bits)
);
// read and remove the oldest unread log entry
enum DIRECTIVE_STATUS oceos_log_remove_entry(struct log_entry * const);

// read the log entry at the given index
enum DIRECTIVE_STATUS oceos_log_get_indexed_entry(
const unsigned int,
struct log_entry * const
);
// clear all log entries and reset to empty
enum DIRECTIVE_STATUS oceos_log_reset();
// get the number of log entries
unsigned int oceos_log_get_size();

The System State variable

OCEOS maintains a system state variable which shows the current state of the system. States are shown below. If the system state deviates from normal (value 0) the user defined function oceos_on_error(..) is called by OCEOS. The source code for this function is located config.c and should be written by the application developer to take appropriate action.

Directives oceos_system_state_get and oceos_system_state_set are provided to allow the application to access the system state variable.

void oceos_on_error(void* ptr) { // Application dependent error handling code should be included here
return;
}
/*
* The 32-bit system state variable indicates the current system state
*/
#define STATUS_NORMAL 0u // System State Normal
#define STATUS_EDAC_INTERNAL 1u // Uncorrectable error in internal memory
#define STATUS_EDAC_EXTERNAL 2u // Uncorrectable error in external memory
#define STATUS_MEM_PROT_1 4u // Invalid access memory protection unit 1
#define STATUS_MEM_PROT_2 8u // Invalid access memory protection unit 2
#define STATUS_DISABLED_TASK_START 0x10u // An attempt to start a disabled task
#define STATUS_TASK_JOB_LIMIT_OVER 0x20u // An attempt to execute a task when its jobs limit is already reached.
#define STATUS_JOB_OVER_TIME 0x40u // Job time from creation to completion exceeds allowed maximum for a task.
#define STATUS_JOB_INTERVAL_SHORT 0x80u // Minimum time between job creations is less than the allowed minimum for task
#define STATUS_READYQ_FULL 0x100u // Ready queue unable to accept job as result of being full
#define STATUS_MUTEX_ALREADY_HELD 0x200u // Mutex wait() when mutex already held
#define STATUS_MUTEX_NOT_HELD 0x400u // Mutex signal() when not already held
#define STATUS_MUTEX_NOT_RETURNED 0x800u // Mutex not returned before job terminates
#define STATUS_SEMAPHORE_JOBS_FULL 0x1000u // Attempt to add job to semaphore pending list when list full
#define STATUS_DATAQ_FULL 0x2000u // Data queue write when queue already full
#define STATUS_DATAQ_JOBS_FULL 0x4000u // Attempt to add job to queue pending list when list full
#define STATUS_TIMED_JOBS_FULL 0x8000u // Timed jobs queue write when queue already full
#define STATUS_TIMED_JOB_LATE 0x10000u // Timed jobs queue late job transfer to scheduler
#define STATUS_TIMED_OUTPUT_FULL 0x20000u // Timed output queue write when queue already full
#define STATUS_TIMED_OUTPUT_LATE 0x40000u // Timed output late
#define STATUS_BAD_LOG 0x80000000u // System Log corruption
#define STATUS_INVALID 0xffffffffu // Invalid system state

The watchdog

The GR716 provides watchdog functionality (ref: GR716 Advanced Data Sheet and User’s Manual). OCEOS provides functions to allow applications to use this watchdog functionality (see section 10 for more details). It is expected that OCEOS applications using the watchdog have a low priority task to continually reset the watchdog timer using the directive oceos_system_watchdog_reset(..). If the watchdog reaches its timeout value this is an indication that a severe error has occurred and a hardware reset is generated by the watchdog unit.

The application when rebooted can optionally check the system log, and last system state to determine the cause of the watchdog reboot and take appropriate action.

Hardware errors

Hardware errors can come from a number of sources including the following:

  • Uncorrectable memory errors
  • Memory protection violations
  • Brown-out or low power

The OCEOS philosophy to dealing with these and other hardware faults causing interrupts is to provide directives enabling the application to take appropriate actions. OCEOS directives available for application recovery are:

  • oceos_interrupt_handle_register This directive may be used to setup application recovery interrupt service routines.
  • oceos_exit This directive may be called to exit OCEOS back to the application for appropriate recovery actions.
  • oceos_log_get_indexed_entry This directive may be used as part of the application fault analysis.
  • oceos_system_state_get This action allows the application obtain the system status to determine appropriate recovery action.

Overload and system failure

Overloads and system failure are different but have some common features. Both can lead to an unexpected system restart, and to the question ‘what caused that?’. Both can sometimes be anticipated if warning signs are noticed. It may even be possible to avoid them by taking appropriate action.

Some thoughts and suggestions about them and OCEOS follow below.

Overload

Overload generally refers to insufficient CPU time to handle a situation. Can also be a memory shortage, for example insufficient stack space.

Results in:

  1. A task missing a soft deadline - needs to be noted
  2. A task missing a hard deadline - may be catastrophic
  3. Watchdog timeout - hardware reset
  4. Inability to start a task due to memory shortage

Typical causes:

  1. Interrupts more often than expected causing task completion to be delayed
  2. Task pre-empted more often than expected causing task completion to be delayed
  3. Tasks start requests more frequent than expected
  4. Inappropriate allocation of task priorities
  5. One or more tasks using more than the expected CPU time (should not happen)
  6. Holding one or more mutexes for too long
  7. Using mutexes in a non-nested manner
  8. Insufficient stack space
  9. (With OCEOS unbounded priority inversion cannot occur.)
  10. (With OCEOS chained blocking cannot occur)
  11. (With OCEOS deadlocks cannot occur)

Information available for comparison to expected values to detect potential overload:

  1. Task maximum duration – job creation time to termination time
  2. Task maximum CPU time
  3. Task minimum time between start requests
  4. Task number of times pre-empted
  5. Task maximum current jobs for a task
  6. Watchdog timer nearly timed out when reset
  7. Stack pointer approaching limit of allowed range
  8. Et alia

Run time avoidance actions:

  1. Disable certain interrupts
  2. Disable a task or tasks
  3. Kill currently executing job

System failure

System failure generally refers to the CPU ceasing to execute instructions.

Can also be a reference to the operating system (OCEOS) having to exit.

Usually associated with hardware problems, but software can do it too.

Results in:

  1. OCEOS exit (in some cases)
  2. CPU stops execution (in some cases)
  3. A subsequent restart (hopefully)

Typical causes:

  1. Critical OCEOS data corrupted
  2. Un-correctable memory error
  3. Memory hardware failure
  4. Power failure
  5. Environment more demanding than design anticipated causing hardware problems

Warning signs that can indicate something wrong at system level:

  1. Sentinels on OCEOS critical data areas corrupted
  2. Stack pointer approaching limit of range
  3. Corrected memory error counts
  4. Uncorrected memory error counts
  5. Voltage levels - brown out detection interrupt

Run time avoidance actions:

  1. Use EDAC scrubber units to continually regenerate all RAM
  2. Use MEMPROT unit to restrict OCEOS fixed data write access to the scrubber only

Current System State

Before discussing reports it seems appropriate to describe how the system state is stored and accessed.

Current System State - OCEOS and the Application software:

For OCEOS this is given by its fixed data area, its dynamic data area, and its log data area. Also by some heap items and the stack and registers, including the PC. The latter are also relevant to the application state.

The locations and sizes of three OCEOS areas are determined by the application. Their lay-outs are known. A meta-data structure in the fixed area provides pointers to the principal fields within each data area and simplifies access by OCEOS and by the application.

The three OCEOS areas are available for inspection after OCEOS exits and before it restarts as well as while it is running. This helps after a restart in assessing what went wrong, as well as allowing an application check for lurking problems while OCEOS is running.

(OCEOS does not allow tasks etc. be created once scheduling has begun, and as a result does not need to use dynamic memory allocation. The memory areas it uses can thus be fixed in advance, their locations and sizes determined by the application.)

The task information items mentioned above are automatically updated by OCEOS in the dynamic data area and can be accessed by the application at any time.

The OCEOS system state variable contains flags that indicate a certain problem has occurred. It is automatically updated by OCEOS using ‘OR’ to avoiding losing information, typically a user defined function is called to deal with the problem.

Two sets of flags are kept as part of the system state variable. One accumulates indicators of all problems that have occurred, and typically is reset by the application only after a restart. The other indicates current problems, and typically is reset by the user defined problem handling function.

The OCEOS system log records when, what, and in some cases where something occurred. It is automatically written to by OCEOS in certain conditions and can also be read and written by the application. (Codes for what from 0x80 to 0xff are reserved for use by OCEOS, codes 0 to 0x7f are available for use by the application. At present only 24 bits are used to indicate where, this can be expanded to e.g. include the PC or stack pointer.)

At present the log area contains the system state variable as well as the system log. The system state variable could equally well be stored in the dynamic data area.

Current System State - Hardware:

Registers

The GR716 has 504 32-bit IU registers, organized in 31 windows. In addition there are the usual processor registers PC, SP, Y, %psr, %wim and %tbr, and ancillary state registers %asr16, %asr17, %asr20, %asr22-23, %asr24-31. There are also 32 floating point registers, each 32-bits.

Other units with registers that have a key role in determining the current hardware state include the interrupt controller, the clock gating units, the EDAC and memory scrubber units, and the statistics unit.

After a reset registers may have been automatically reset to some initial value and their previous content lost.

EDAC

The internal data and instruction RAMs and the external RAM have individual EDAC and scrubber units.

Normally the scrubber units will be configured to cycle through the memory, reading each location and writing it back if an error is detected, thus helping reduce the likelihood of a double bit error.

If a double bit error is detected, the scrubber units can be configured to halt with the current address in a scrubber register and generate an interrupt.

The two scrubbers for the internal memory each have a 4-bit counter (ECNT) for the number of single bit errors corrected. This can be reset by writing 0 to it. Correctable errors do not cause an interrupt.

The external scrubber unit counts both correctable and un-correctable errors it has encountered, and allows thresholds be set for each of these so that an interrupt is generated if either threshold is exceeded.

EDAC errors give rise to an interrupt on interrupt line 63 which is mapped by default to interrupt 19. A handler for this interrupt needs to be put in place by the ASW as what should be done depends on the application.

if set up appropriately the EDAC system can also be used to get an indication of background radiation levels by periodically checking the number of single bit errors that have been corrected in a given time period. Again should be done, if required, by the ASW.

Reporting

Two main types of report

  1. Current system state, including checking for potential problems
  2. System state when a problem has occurred
    1. Non-fatal problem
    2. Short death report
    3. Long death report

Reporting the Current System State

As all OCEOS data is readily available, and the application itself is running in kernel/supervisor mode and is able to access all registers, there is no obvious help required from OCEOS to enable an application analyse the current state of the system whenever it wishes and report on this.

A different question is what conditions OCEOS itself should monitor and bring to the attention of the application so that the application can take action and perhaps prevent the emergence of a problem.

The standard OCEOS mechanism for bringing an anomaly to the attention of the application is to call a user defined problem handling function after updating the system state variable and making a system log entry.


The question then becomes in what circumstances OCEOS should do this.

OCEOS will always notify the application if it detects a problem such as a task using a mutex whose ceiling is lower than the task’s priority (an application configuration error), or a task exiting while holding a mutex.

When a task is created, a deadline for the task (number of microseconds after request to start by which it should have completed) can be specified. OCEOS can automatically notify the application if a task has missed its deadline. Or alternatively, if the time when a task completed was dangerously close to its deadline.

When a task is created, an expected minimum time between task start requests can be specified. OCEOS can automatically notify the application if a start request occurs sooner than this, or alternatively if the time between start requests was too close to the expected value.??

OCEOS can check the state of the system when it does a context switch, or when any of its directives are used, or when it performs a timed action. (Unlike many operating systems OCEOS does not depend on a regular timer tick interrupt.)

Reporting System State after a problem has occurred

  1. If the problem is non-fatal, so that OCEOS can continue, when OCEOS detects the problem it will call the user defined problem handling function after updating the system state variable and system log.
  2. If the CPU continues to function but OCEOS cannot continue, e.g. due to an OCEOS data area being corrupted by the application, OCEOS will exit. The application can consult the OCEOS data areas and various hardware registers to determine why OCEOS had to exit. It can then restart OCEOS.
  3. If the CPU itself must shut down, and warning is given of this before it must happen by e.g. a brown-out detector, then two reports are involved
    1. a short death report for use before the CPU shuts down
    2. a long death report put together after the system restarts
  4. Short death report: Comments
    1. Usually generated in an interrupt handler (e.g. brown-out detected)
    2. The type of problem is identified by the interrupt caused
    3. OCEOS does not seem to have any role in reading registers, whether in EDAC or scrubber units or elsewhere, simpler to just read them.
    4. Same applies to system time if needed, just requires a timer register read.
    5. Current task and job (equivalent of thread) can be stored by OCEOS in same area as OCEOS system state variable to make both readable as one block.
    6. Other items that might be stored in this block by OCEOS include
      1. system priority ceiling (1 byte)
      2. interrupt nesting level (1 byte)
      3. number of jobs on ready queue (2 bytes)
      4. current number of mutexes held (1 byte)
      5. number of jobs on counting semaphore pending queues (2 bytes)
      6. number of jobs on data queue pending queues (2 bytes)
      7. number of actions on timed actions queue (1 byte)
    7. The interrupt handler might also store items in the block, e.g.
      1. PC when the interrupt was called
      2. Stack pointer
      3. EDAC/scrubber registers
    8. OCEOS can keep the block up to date each time a new job begins executing (not done at present but no problem). But just what should the block include??
    9. The block data will be preserved across resets depending on the memory used. If this is non-volatile it will be preserved across power fail/restore cycles, and accessible in creating the long death report after restart.
    10. Don’t think OCEOS has a role in relation to the line or function where the error happened (using GCC/LLVM predefined macros, such as __LINE__)
    11. What other OCEOS summary information if any should kept in the quick death report data block? (The detailed information on task times etc are accessible but probably too much for the short death report.)
  5. Long death report: Comments
    1. The application chooses the memory to be used for the OCEOS dynamic area and for the OCEOS log area.
    2. If both of these are non-volatile a very detailed description of the state of the system when the reset occurred is available to the application after the restart and can be used by it to generate a report before it restarts OCEOS (which will overwrite the dynamic area).
    3. If the short death report block is stored in one of these areas it will also be available.
    4. If the two areas are volatile the above will still be true but only for resets that did not involve power failure.
    5. If the areas are in GR716 internal memory the GR716 boot sequence needs to be taken into account.
    6. After a reset it is likely that register contents, whether in the CPU or elsewhere, will not have been preserved. This is true a fortiori if the reset was caused by a power failure. Critical register information, such as the PC and stack pointer, will need to be captured in the quick death report data block.

Changing the default trap handler

The default trap handler provided by BCC may be changed by modifying trap_table_svt_tables.S provided with BCC. The default BCC trap handler __bcc_trap_table_svt_bad should be replaced with the ASW trap handler or the OCEOS default trap handler (__oceos_default_trap_handler) provided in the example applications (shown below). The application should now be recompiled and linked with the new trap_table_svt_tables.S. When application starts, all trap handlers will be in memory ready for use by the application.

/** 
* OCEOS default trap handler. 
*/ 
void __oceos_default_trap_handler () 
{ 
  U32_t tt; 
  __asm__ volatile ( "mov %%tbr, %0" : "=&r" (tt));
  tt = (tt & 0xff0) >> 4; 
  // Log ERROR and return 
  oceos_log_add_entry(LOG_SYSTEM_ERROR, tt); 
  // Update System Variable 
  U32_t current_state = oceos_system_state_get(); 
  current_state |= STATUS_SYSTEM_ERROR; 
  oceos_system_state_set(current_state); 
  __asm__ volatile ("restore"); 
  __asm__ volatile ("jmp %l2"); 
  __asm__ volatile ("rett %l2 + 0x4"); 
}

Context Switch Logging

Basic approach:

If enabled, each time a context switch or interrupt occurs four 32-bit words are added to the context switch log.

The first of these describes the type of switch and also gives the current window pointer (CWP) and the low 8 bits of the high 32 bits of the 64-bit system time when the switch occurred. The second gives the 32-bit system time in microseconds at which the switch occurred.

The third gives the start delay if a task is being started, otherwise a signed int giving the remaining time until the deadline or INT_MAX if the task has no deadline, a negative value gives by how much the deadline has been missed. The fourth gives the current system stack pointer.

The user specifies the number of entries in the context switch log in the application configuration. This size is given as a number in the range 3 to 12, resulting in from 2^3 to 2^12 entries. If a size outside this range is specified no context switch log is present (no header and no entries).

If valid the size is stored in sysMetaPtr->CS_log_entries_base2. If invalid this holds 0, CS logging is off.

CS Log Structure:

CS Header followed by circular buffer, each buffer entry four 32-bit words.

Switch information (first word):

  • 8 bits high time bits (31 to 24)
  • 5 bits CWP (23 to 19)
  • 1 bit (18) switch type, 0=> starting, 1 => interrupt or ending
  • 1 bit (17) ‘from’ ID type, 0=> from task, 1 => from interrupt
  • 1 bit (16) ‘to’ ID type, 0 => to task, 1 => to interrupt
  • 8 bits (15 to 8) ID of task/interrupt from which switching
  • 8 bits (7 to 0) ID of task/interrupt to which switching
TASK_ID_INVALID (0xff) is used to indicate switching to or from the scheduler itself. The interrupt ID 0 is not expected to occur, used as TO when CPU put to sleep Initialised by oceos_start() to 0x0000ffff (invalid as scheduler doesn’t switch to itself).

Context switch time word (second word):

Low 32 bits of the time at which switch happened. The next 8 bits of the 64-bit system time are given in the switch information word, giving an overall 40-bit time value. Initialised to 0.

Start wait time / deadline time margin word (third word):

When first starting a task, the delay between the start request and execution beginning. Otherwise a signed int giving remaining time until deadline or 0x7fffffff if no deadline. Initialised to 0.

System stack pointer (fourth word):

System stack pointer when switch was made. Initialised to 0.

Storage:

Stored in log area immediately after last system log entry and before END_SENTINEL for log area.

Space needed for CS log is included in log area size recorded at the start of the log area.

Number of CS log entries is 2^size unless size is invalid, in which case no storage is used.

If sysMetaPtr->CS_log_entries_base2 is outside the allowed range (typically 0) no context switch logging is done. The CS header is not present, END_SENTINEL is at that position instead.

Recover runs

The previous section describes the error diagnostic facilities provided by OCEOS. The recovery procedure is the responsibility of the application. In the worst case a system reset can be initiated by the application. The table below shows some failure modes and recommended actions to recover.

SFMEA ID Failure cause Recommended action
SFMEA-4.4.1.FM-0010 ASW sets up a non-schedulable set of tasks ASW uses oceos_task_get_info to determine schedulability headroom
SFMEA-4.4.1.FM-0020 ASW sets up a task that may demands excessive CPU or Memory resources ASW uses oceos_task_get_info to determine schedulability headroom and map file to view memory usage
SFMEA-4.4.1.FM-0040 Human error in OCEOS code Email details to support@ocetechnology.com
SFMEA-4.4.1.FM-0045 Divide by zero occurs in OCEOS or in ASW. ASW to implement user defined trap
SFMEA-4.4.1.FM-0060 Single Event Upset, temporary in RAM ASW should use EDAC scrubber to correct error and report log
SFMEA-4.4.1.FM-0070 Single Event Upset, permanent (burnt bit) in RAM ASW to refactor memory location and report error
SFMEA-4.4.1.FM-0080 Single Event Upset, temporary in PROM housing master image of OCEOS for initialisation ASW to rewrite PROM location and report error
SFMEA-4.4.1.FM-0090 Single Event Upset, permanent (burnt-proton impact etc) in PROM housing master image of OCEOS for initialisation ASW to set indicator to boot from backup image
SFMEA-4.4.1.FM-0100 Double bit Single Event Upset, temporary in RAM ASW to initiate warm or cold restart
SFMEA-4.4.1.FM-0110 RAM memory corruption by any reason e.g. incorrect write address ASW dependent but warm or cold restart are options
SFMEA-4.4.1.16FM-0 continuous false alarm ASW to reduce alarm check interval amd mark as ignore
SFMEA-4.4.1.FM-0130 deadlock occurs Deadlock cannot occur from OCEOS directives
SFMEA-4.4.1.FM-0140 clock corrupted, frozen or rate drifts (HSIA crystal failure) ASW to trigger watchfod restart and report issue
SFMEA-4.4.1.FM-0150 output before required time ASW to report log
SFMEA-4.4.1.FM-0160 output after required time ASW to report log