System Logs
The runner uses AWS Cloudwatch for logging. Some of the logs created automatically relate to deployment and are not typically monitored. The following logs are relevant for services (other logs are used primarily for ancillary services startups, etc, and not often used):
Log Name | Logged Information |
---|---|
/internal/wrangle/<clusterArn>/WrangleService/application | This log group contains the application logs for the cloud runner service cluster. Ex: /internal/wrangle/arn.aws.ecs.eu-west-1.850858299741.cluster/WrangleService/application |
/internal/wrangle/WrangleWorkerCluster-<poolId>/application | This log group contains the application logs for the wrangle worker cluster. There is one log group per work cluster defined in configuration. For example, the log group name /internal/wrangle/WrangleWorkerCluster-public/application Contains the logs for the the worker pool with ID public |
Alarms
Alarms start with the prefix <appId>-nextmv
to make it clear that the alarms relate to the Nextmv runner. There are two alarms for each condition, one for an alert level and one for a critical level. Additionally alarms that start with TargetTracking are created by the Elastic Container Service (ECS) and Autoscale Groups to manage automatic scaling, and should not be altered.
Alarm Name | Purpose |
---|---|
EvaluateErrorCompositeAlarm | A composite alarm for all errors conditions related to failing to evaluate runs (panics, solver errors, evaluation failures) |
ScaleErrorCompositeAlarm | A composite alarm for all error conditions that indicate the system is not scaling to handle load (duration, memory, queue depth). This generally indicates either the system has hit scale limits, or is not scaling fast enough. It may also indicate memory leaks. |
ApiErrorCompositeAlarm | A composite alarm for gRPC related error or warning conditions. These generally indicate unexpected latency for API actions, for example something is causing a long delay in command execution. |
DataErrorCompositeAlarm | Failures reading/writing to data stores. The data stores are S3 and DynamoDB. This typically means there is some issue with AWS. |
Api_GetRunErrorAlarm | Failure occurred when performing a GetRun action. This is an internal error that is not expected to occur. |
Api_GetStartRunErrorAlarm | Failure occurred when performing a StartRun action. This is an internal error that is not expected to occur. |
APIEngine_InvalidJobTypeIDAlarm | The job type ID submitted with the start run request wasn’t a supported type |
APIGRPC_wrangle.v1.WrangleService_GetRunAlarm | The duration of the GetRun request exceeded the maximum expected duration |
APIGRPC_wrangle.v1.WrangleService_StartRunAlarm | The duration of the StartRun request exceeded the maximum expected duration |
Worker-<poolId>ApproximateNumberOfMessagesVisibleAlarm | The depth of a run work queue exceeds expected range. This indicates that either scaling limits have been reached, or the cluster is not scaling fast enough to keep up with the queue. |
Worker-<poolId>MemoryUtilizationAlarm | The amount of memory being consumed by a worker pool exceeds the expected maximum level |
WorkerEngine_JobPanicAlarm | A worker received a panic during execution. This indicates an unexpected internal failure |
WorkerIntegration_OnfleetErrorAlarm | An error was received for a job executing an Onfleet run |
WorkerJob_EvaluateErrorAlarm | An unexpected error occurred during the evaluation phase of a job |
WorkerJob_InputFileErrorAlarm | An error occurred when reading an input file from S3. This indicates an issue with reading the object from S3. |
WorkerJob_PostEvaluateErrorAlarm | An unexpected error occurred during the post evaluation phase of a job |
WorkerJob_PreEvaluateErrorAlarm | An unexpected error occurred during the pre evaluation phase of a job |
WorkerJob_ProcessingDurationAlarm | The processing of an optimization exceeded the expected p90 threshold for successfully completed jobs. This could either indicate that something systemic is slowing down runs, or that there are an inordinate number of long runs being conducted. |
WorkerJob_RunTimeoutAlarm | The processing of a job timed out. There is an absolute threshold set of 20 minutes after which the a run will be terminated before completion. |
WorkerJob_SolverErrorAlarm | The solver unexpectedly failed during evaluating a run. This indicates an internal error. |
WorkerJob_UpdateStatusErrorAlarm | A worker failed to update the status of a run record during a state change |
WorkerJobFailure_Binary_1_evaluationAlarm | A binary execution job failed during any phase of job execution (pre, execute, post). |
WorkerJobFailure_Fleet_1_evaluationAlarm | A fleet execution job failed during any phase of job execution (pre, execute, post). |
WorkerJobFailure_Onfleet_1_evaluationAlarm | An Onfleet execution job failed during any phase of job execution (pre, execute, post). |
WorkerStore_CredentialsErrorAlarm | Failed retrieving a secrets credential from the secrets store (used for Onfleet processing) |
WorkerStore_DecisionCountWriteErrorAlarm | Failure for a fleet job writing the decision count to the DynamoDB store. Decision count is an internal artifact of Nextmv Cloud and can be ignored. |
WorkerStore_RunErrorAlarm | Failure when saving run information to the run store (DynamoDB) |
WorkerStore_ObjectErrorAlarm | An error occurred when reading or writing input/output files to S3 |