Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestrator stuck in running when saving large custom status values #2918

Open
cliedeman opened this issue Sep 25, 2024 · 5 comments
Open

Comments

@cliedeman
Copy link

Description

A clear and concise description of what the bug is. Please make an effort to fill in all the sections below; the information will help us investigate your issue.

I have several instances of the same orchestrator that moreo often than not get stuck in Running.

The ochestrator calls 5 sub orchestrators and takes about 2 hours total. The inputs are not large so nothing suspcious there.
If I check the history table I can see that an OrchestratorComplete event is fired with a null instanceId - indicading it should be in Completed state but is not.

NOTE: JavaScript issues should be reported here: https://github.com/Azure/azure-functions-durable-js

Expected behavior

A clear and concise description of what you expected to happen.

Orchestrator leaves the Running state and becomes Completed

Actual behavior

A clear and concise description of what actually happened.

Orchestrator remains in running state

Relevant source code snippets

// insert code snippet here

Known workarounds

Provide a description of any known workarounds you used.

App Details

Dotnet 8
Isolated Worker

  <ItemGroup Label="Azure Functions Worker">
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.4.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker" Version="1.23.0" />
    <!-- Don't upgrade this library because of this issue -> https://github.com/microsoft/durabletask-dotnet/issues/247 -->
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.4" />
  </ItemGroup>
  <ItemGroup Label="Azure Functions Worker Extensions">
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.2.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.3.2" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.1" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Storage" Version="6.6.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.5.0" />
    <!-- https://github.com/Azure/azure-sdk-for-net/pull/34783 -->
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.5" />
    <PackageVersion Include="Microsoft.DurableTask.Generators" Version="1.0.0-preview.1" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.11.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.22.0" />
  </ItemGroup>

Screenshots

If applicable, add screenshots to help explain your problem.

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

  • Timeframe issue observed: Past Week
  • Function App name: functions-prod-fulcrum-mobilemart
  • Function name(s): DataLakeDataExportOrchestrate
  • Azure region: North Europe
  • Orchestration instance ID(s): a053d5a6566544539670bb04989d7c6b, 999a607c87074b7b9a08d3b825f29622, ec599cef5cd545bcb126ed3e4f94bfc8, 86608ce0275c4b35ae1807f5021bdaa9, f168030031c24d5a9ab31bc8e13d18e1
  • Azure storage account name: fulcrumprodfunctionapp

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

@cgillum
Copy link
Member

cgillum commented Sep 25, 2024

Hi @cliedeman. Are you using Application Insights? If so, can you try enabling the Durable Task Framework logging (warnings and errors as shown in the sample should be fine) and then querying the traces collection in App Insights to see if there are any clues about what might be going on?

@cgillum cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Triage 🔍 labels Sep 25, 2024
@cliedeman
Copy link
Author

@cgillum I do. When I run another batch in the coming days I will try to get some extra logging output

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Sep 26, 2024
@cgillum cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Attention 👋 labels Sep 26, 2024
@cliedeman
Copy link
Author

@cgillum I found this error in the logs

An unexpected failure occurred while processing instance '36790fe4a8e3465ca3ca68f210483553': DurableTask.AzureStorage.Storage.DurableTaskStorageException: Bad Request
 ---> Microsoft.WindowsAzure.Storage.StorageException: Bad Request
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId)
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force)
Request Information
RequestID:7f4f8aad-3002-001f-59d8-11bb64000000
RequestDate:Sat, 28 Sep 2024 18:59:54 GMT
StatusMessage:Bad Request
ErrorCode:
ErrorMessage:The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:7f4f8aad-3002-001f-59d8-11bb64000000
Time:2024-09-28T18:59:54.6725330Z

   --- End of inner exception stack trace ---
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 141
   at DurableTask.AzureStorage.Storage.Table.ExecuteAsync(TableOperation operation, String operationType) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 113
   at DurableTask.AzureStorage.Storage.Table.InsertOrMergeAsync(DynamicTableEntity tableEntity) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 101
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, String instanceId, String executionId, String eTagValue, Object trackingStoreContext) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1159
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1179

I confirmed that 36790fe4a8e3465ca3ca68f210483553 is my newest instance so this error does not fail the orchestration.

I suspect that it is my customStatus (which reports on the job progress) which is exceeding the limit

Ciaran

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Sep 28, 2024
@cgillum
Copy link
Member

cgillum commented Sep 30, 2024

@cliedeman thanks for this info! I checked the code, and I think you're right that this could be caused by a large custom status value. We have checks to ensure that it doesn't exceed 16 KB, but it looks like there aren't any checks to ensure that the custom status value combined with other semi-large values (like inputs or outputs) doesn't exceed the 64 KB limit imposed by Azure Storage.

I'm labeling this as a bug that needs to be fixed. In the meantime, I recommend reducing the size of your custom status values to avoid this issue in the future. For the current stuck instance, you can terminate it to get it out of the "Running" status.

@cgillum cgillum added bug P1 Priority 1 and removed Needs: Attention 👋 labels Sep 30, 2024
@cgillum cgillum changed the title Orchestrator Stuck in Running Orchestrator stuck in running when saving large custom status values Sep 30, 2024
@cgillum
Copy link
Member

cgillum commented Sep 30, 2024

We have checks to ensure that it doesn't exceed 16 KB, but it looks like there aren't any checks to ensure that the custom status value combined with other semi-large values (like inputs or outputs) doesn't exceed the 64 KB limit imposed by Azure Storage.

I was wrong - while we do have checks for the custom status size for the .NET in-proc SDK, we don't have any such checks in the .NET Isolated SDK, which otherwise would have caught this kind of issue. We may need to introduce a breaking change to ensure that the serialized custom status payload size matches the in-proc limit: 16 KB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants