Nextmv private cloud operations

System Logs

The runner uses AWS Cloudwatch for logging. Some of the logs created automatically relate to deployment and are not typically monitored. The following logs are relevant for services (other logs are used primarily for ancillary services startups, etc, and not often used):

Log NameLogged Information
/internal/wrangle/<clusterArn>/WrangleService/applicationThis log group contains the application logs for the cloud runner service cluster. Ex: /internal/wrangle/arn.aws.ecs.eu-west-1.850858299741.cluster/WrangleService/application
/internal/wrangle/WrangleWorkerCluster-<poolId>/applicationThis log group contains the application logs for the wrangle worker cluster. There is one log group per work cluster defined in configuration. For example, the log group name /internal/wrangle/WrangleWorkerCluster-public/application Contains the logs for the the worker pool with ID public


Alarms start with the prefix <appId>-nextmv to make it clear that the alarms relate to the Nextmv runner. There are two alarms for each condition, one for an alert level and one for a critical level. Additionally alarms that start with TargetTracking are created by the Elastic Container Service (ECS) and Autoscale Groups to manage automatic scaling, and should not be altered.

Alarm NamePurpose
EvaluateErrorCompositeAlarmA composite alarm for all errors conditions related to failing to evaluate runs (panics, solver errors, evaluation failures)
ScaleErrorCompositeAlarmA composite alarm for all error conditions that indicate the system is not scaling to handle load (duration, memory, queue depth). This generally indicates either the system has hit scale limits, or is not scaling fast enough. It may also indicate memory leaks.
ApiErrorCompositeAlarmA composite alarm for gRPC related error or warning conditions. These generally indicate unexpected latency for API actions, for example something is causing a long delay in command execution.
DataErrorCompositeAlarmFailures reading/writing to data stores. The data stores are S3 and DynamoDB. This typically means there is some issue with AWS.
Api_GetRunErrorAlarmFailure occurred when performing a GetRun action. This is an internal error that is not expected to occur.
Api_GetStartRunErrorAlarmFailure occurred when performing a StartRun action. This is an internal error that is not expected to occur.
APIEngine_InvalidJobTypeIDAlarmThe job type ID submitted with the start run request wasn’t a supported type
APIGRPC_wrangle.v1.WrangleService_GetRunAlarmThe duration of the GetRun request exceeded the maximum expected duration
APIGRPC_wrangle.v1.WrangleService_StartRunAlarmThe duration of the StartRun request exceeded the maximum expected duration
Worker-<poolId>ApproximateNumberOfMessagesVisibleAlarmThe depth of a run work queue exceeds expected range. This indicates that either scaling limits have been reached, or the cluster is not scaling fast enough to keep up with the queue.
Worker-<poolId>MemoryUtilizationAlarmThe amount of memory being consumed by a worker pool exceeds the expected maximum level
WorkerEngine_JobPanicAlarmA worker received a panic during execution. This indicates an unexpected internal failure
WorkerIntegration_OnfleetErrorAlarmAn error was received for a job executing an Onfleet run
WorkerJob_EvaluateErrorAlarmAn unexpected error occurred during the evaluation phase of a job
WorkerJob_InputFileErrorAlarmAn error occurred when reading an input file from S3. This indicates an issue with reading the object from S3.
WorkerJob_PostEvaluateErrorAlarmAn unexpected error occurred during the post evaluation phase of a job
WorkerJob_PreEvaluateErrorAlarmAn unexpected error occurred during the pre evaluation phase of a job
WorkerJob_ProcessingDurationAlarmThe processing of an optimization exceeded the expected p90 threshold for successfully completed jobs. This could either indicate that something systemic is slowing down runs, or that there are an inordinate number of long runs being conducted.
WorkerJob_RunTimeoutAlarmThe processing of a job timed out. There is an absolute threshold set of 20 minutes after which the a run will be terminated before completion.
WorkerJob_SolverErrorAlarmThe solver unexpectedly failed during evaluating a run. This indicates an internal error.
WorkerJob_UpdateStatusErrorAlarmA worker failed to update the status of a run record during a state change
WorkerJobFailure_Binary_1_evaluationAlarmA binary execution job failed during any phase of job execution (pre, execute, post).
WorkerJobFailure_Fleet_1_evaluationAlarmA fleet execution job failed during any phase of job execution (pre, execute, post).
WorkerJobFailure_Onfleet_1_evaluationAlarmAn Onfleet execution job failed during any phase of job execution (pre, execute, post).
WorkerStore_CredentialsErrorAlarmFailed retrieving a secrets credential from the secrets store (used for Onfleet processing)
WorkerStore_DecisionCountWriteErrorAlarmFailure for a fleet job writing the decision count to the DynamoDB store. Decision count is an internal artifact of Nextmv Cloud and can be ignored.
WorkerStore_RunErrorAlarmFailure when saving run information to the run store (DynamoDB)
WorkerStore_ObjectErrorAlarmAn error occurred when reading or writing input/output files to S3

Page last updated

Go to on-page nav menu