Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf-runner-azure image broken #1020

Closed
aurel333 opened this issue Sep 28, 2023 · 5 comments
Closed

tf-runner-azure image broken #1020

aurel333 opened this issue Sep 28, 2023 · 5 comments

Comments

@aurel333
Copy link

aurel333 commented Sep 28, 2023

TF-runner image for Azure is not working

Context

I am trying to use TF-controller version 0.15.1, deployed from the Helm chart, to deploy a Storage Account on Azure as a PoC.

Expected behavior

I expected the runner to try to create the resource, and maybe fail due to authentication not being set up yet.

Observed behavior

The runner begins to process the request and do the initialization correctly, then it fails on the plan phase with the following error:

Error: unable to build authorizer for Resource Manager API: could not configure AzureCli Authorizer: could not parse Azure CLI version: running Azure CLI: exit status 1: ERROR: The command failed with an unexpected error. Here is the traceback:
ERROR: argument _command_package: conflicting subparser: version
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/knack/cli.py", line 233, in invoke
    cmd_result = self.invocation.execute(args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py", line 565, in execute
    self.parser.load_command_table(self.commands_loader)
  File "/usr/lib/python3.11/site-packages/azure/cli/core/parser.py", line 100, in load_command_table
    command_parser = subparser.add_parser(command_verb,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/argparse.py", line 1192, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument _command_package: conflicting subparser: version
To open an issue, please run: 'az feedback'

  with provider["registry.terraform.io/hashicorp/azurerm"],
  on provider.tf line 14, in provider "azurerm":
  14: provider "azurerm" {

{"level":"error","ts":"2023-09-28T08:45:24.114Z","logger":"runner.terraform","msg":"error creating the plan","instance-id":"1a6efb83-5572-41c7-be3b-422980d84efd","error":"exit status 1\n\nError: unable to build authorizer for Resource Manager API: could not configure AzureCli Authorizer: could not parse Azure CLI version: running Azure CLI: exit status 1: ERROR: The command failed with an unexpected error. Here is the traceback:\nERROR: argument _command_package: conflicting subparser: version\nTraceback (most recent call last):\n  File \"/usr/lib/python3.11/site-packages/knack/cli.py\", line 233, in invoke\n    cmd_result = self.invocation.execute(args)\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py\", line 565, in execute\n    self.parser.load_command_table(self.commands_loader)\n  File \"/usr/lib/python3.11/site-packages/azure/cli/core/parser.py\", line 100, in load_command_table\n    command_parser = subparser.add_parser(command_verb,\n                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/lib/python3.11/argparse.py\", line 1192, in add_parser\n    raise ArgumentError(self, _('conflicting subparser: %s') % name)\nargparse.ArgumentError: argument _command_package: conflicting subparser: version\nTo open an issue, please run: 'az feedback'\n\n  with provider[\"registry.terraform.io/hashicorp/azurerm\"],\n  on provider.tf line 14, in provider \"azurerm\":\n  14: provider \"azurerm\" {\n\n"}

Elements that can help

I searched about this error and found that it is probably caused by the Python version used in the image (3.11) being incompatible with the Azure CLI. I also took a look at the dockerfile for the image on the master branch and the python version is not specified so I assume it is using the latest release available which means the problem is still present.

Used manifests

Terraform resource:

apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
  name: poc-tf
spec:
  interval: 1m
  approvePlan: auto
  path: ./terraform/
  sourceRef:
    kind: GitRepository
    name: poc-tf
    namespace: flux-system
  runnerPodTemplate:
    spec:
      image: ghcr.io/weaveworks/tf-runner-azure:v0.15.1

TF-controller chart values (basically all default besides the AWS package):

# -- If `true`, install CRDs as part of the helm installation
installCRDs: true
# ServiceAccount
serviceAccount:
  # -- If `true`, create a new service account
  create: true
  # -- Additional Service Account annotations
  annotations: {}
  # -- Service account to be used
  # @default -- tf-controller
  name: ""
# Controller
# -- Provide a name
nameOverride: ""
# -- Provide a fullname
fullnameOverride: ""
# -- Additional pod annotations
podAnnotations: {}
# -- Additional pod labels
podLabels: {}
# -- Number of TF-Controller pods to deploy
replicaCount: 1
image:
  # -- Controller image repository
  repository: ghcr.io/weaveworks/tf-controller
  # -- Controller image pull policy
  pullPolicy: IfNotPresent
  # -- Overrides the image tag whose default is the chart appVersion.
  # @default -- `.Chart.AppVersion`
  tag: "v0.15.1"
# -- Controller image pull secret
imagePullSecrets: []
# -- Resource limits and requests
resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 64Mi
# -- Additional container environment variables.
extraEnv: {}
# -- Pod-level security context
podSecurityContext:
  fsGroup: 1337
# -- Container-level security context
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65532
  seccompProfile:
    type: RuntimeDefault
rbac:
  # -- If `true`, create and use RBAC resources
  create: true
# -- Node Selector properties for the TF-Controller deployment
nodeSelector: {}
# -- Tolerations properties for the TF-Controller deployment
tolerations: []
# -- Affinity properties for the TF-Controller deployment
affinity: {}
# -- Volumes properties for the TF-Controller deployment
volumes: []
# - name: policy-agent
#   secret:
#     secretName: policy-agent-tls
#     optional: false
# -- Volume mounts properties for the TF-Controller deployment
volumeMounts: []
# - name: policy-agent
#   mountPath: "/etc/certs/policy-agent.policy-system.svc/"
#   readOnly: true

# -- PriorityClassName property for the TF-Controller deployment
priorityClassName: ""
# -- Argument for `--log-encoding`. Can be 'json' or 'console'. (Controller)
logEncoding: json
# -- Level of logging of the controller (Controller)
logLevel: info
# -- Concurrency of the controller (Controller)
concurrency: 24
# -- Argument for `--cert-rotation-check-frequency` (Controller)
certRotationCheckFrequency: 30m0s
# -- Argument for `--cert-validity-duration` (Controller)
certValidityDuration: 6h0m
# -- Argument for `--ca-cert-validity-duration` (Controller)
caCertValidityDuration: 168h0m
# -- Argument for `--events-addr` (Controller). The event address, default to the address of the Notification Controller
eventsAddress: http://notification-controller.flux-system.svc.cluster.local./
# -- Argument for `--kube-api-qps` (Controller).
#  Kube API QPS indicates the maximum queries-per-second of requests sent to the Kubernetes API, defaults to 50.
kubeAPIQPS: 50
# -- Argument for `--kube-api-burst` (Controller).
#  Burst indicates the maximum burst queries-per-second of requests sent to the Kubernetes API, defaults to 100.
kubeAPIBurst: 100
# -- Argument for `--allow-break-the-glass` (Controller).
#  AllowBreakTheGlass allows the controller to break the glass and modify Terraform states when the sync loop is broken.
allowBreakTheGlass: false
# -- Argument for `--cluster-domain` (Controller).
#  ClusterDomain indicates the cluster domain, defaults to cluster.local.
clusterDomain: cluster.local
awsPackage:
  install: false
  tag: v4.38.0-v1alpha11
  repository: ghcr.io/tf-controller/aws-primitive-modules
# -- Runner-specific configurations
runner:
  image:
    # -- Runner image repository
    repository: ghcr.io/weaveworks/tf-runner
    # -- Runner image tag
    # @default -- `.Chart.AppVersion`
    tag: "v0.15.1"
  grpc:
    # -- Maximum GRPC message size (Controller)
    maxMessageSize: 4
  # -- Timeout for runner-creation (Controller)
  creationTimeout: 5m0s
  serviceAccount:
    # -- If `true`, create a new runner service account
    create: true
    # -- Additional runner service Account annotations
    annotations: {}
    # -- Runner service account to be used
    name: ""
    # -- List of namespaces that the runner may run within
    allowedNamespaces: []
# EKS-specific configurations
# -- Create an AWS EKS Security Group Policy with the supplied Security Group IDs [See](https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html#deploy-securitygrouppolicy)
eksSecurityGroupPolicy:
  # -- Create the EKS SecurityGroupPolicy
  create: false
  # -- List of AWS Security Group IDs
  ids: []
  # For example:
  # - sg-1234567890
  # - sg-1234567891
  # - sg-1234567892
# Metrics
metrics:
  # -- Enable Metrics Service
  enabled: false
  # ServiceMonitor
  serviceMonitor:
    # -- Enable ServiceMonitor
    enabled: false
    # -- Install the ServiceMonitor into a different Namespace, as the monitoring stack one
    # @default -- `.Release.Namespace`
    namespace: ''
    # -- Assign additional labels according to Prometheus' serviceMonitorSelector matching labels
    labels: {}
    # -- Assign additional Annotations
    annotations: {}
    # -- Change matching labels
    matchLabels: {}
    # -- Set targetLabels for the serviceMonitor
    targetLabels: []
    endpoint:
      # -- Set the scrape interval for the endpoint of the serviceMonitor
      interval: "15s"
      # -- Set the scrape timeout for the endpoint of the serviceMonitor
      scrapeTimeout: ""
      # -- Set metricRelabelings for the endpoint of the serviceMonitor
      metricRelabelings: []
      # -- Set relabelings for the endpoint of the serviceMonitor
      relabelings: []
# -- Branch Based Planner-specific configurations
branchBasedPlanner:
  enabled: false
  image:
    repository: ghcr.io/weaveworks/branch-based-planner
    pullPolicy: IfNotPresent
    tag: ""
@chanwit
Copy link
Collaborator

chanwit commented Sep 29, 2023

Please feel free to try changing the version of Azure CLI here: https://github.com/weaveworks/tf-controller/blob/main/runner-azure.Dockerfile

Contributions are welcome!

@aurel333
Copy link
Author

Ok, I did a quick and dirty test image and it seems to work. I will do a proper Pull Request in a few days.

@aurel333
Copy link
Author

aurel333 commented Sep 30, 2023

I think we have a bigger issue than I thought. My quick and dirty image was made from the release 0.15.1 code and the runner images were since reworked to have a common base, which I think is good.

However the common base distro is Alpine 3.18, which does not supply any other python version than 3.11, and some packages installed in the common base are python 3.11 related (py3-pip for example). This forbids the downgrading of the python version and thus blocks a proper and clean install of the Azure CLI. Also the Azure CLI has no release that support python 3.11, but it may have one soon (link to the issue tracking the progress).

With all this taken into account there is two possibilities:

  1. Downgrade the runner-base Alpine version to 3.17, which is the last Alpine release with Python 3.10
  2. Wait the release of the Azure CLI supporting python 3.11.

What do you think is best?

@chanwit
Copy link
Collaborator

chanwit commented Oct 2, 2023

We upgraded to 3.18 because many CVEs, so downgrading to 3.17 wouldn't be an option for us.

@aurel333
Copy link
Author

aurel333 commented Nov 2, 2023

Ok for closing as we cannot progress currently and I fully agree with your decision to keep Alpine 3.18. I will open another issue when a version of the azure-cli that officially supports python 3.11 is available.
Until then, the workaround is that people have to build a custom runner image with python 3.10 to use the tf-controller latest version on Azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants