1. pin_cpu & poa(operation=”cpupin”)

pin_cpu & poa(operation=”cpupin”)

Home Forums FABRIC General Questions and Discussion pin_cpu & poa(operation=”cpupin”)

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #9126
    yoursunny
    Participant

      I’m poking around the CPU pinning feature and noticed some problems around pin_cpu() and poa(operation="cpupin") APIs.

      pin_cpu(cpu_range_to_pin=) syntax

      In Node.pin_cpu function, the cpu_range_to_pin= parameter is described as:

      cpu_range_to_pin: range of the cpus to pin; example: 0-1 or 0

      However, passing cpu_range_to_pin="0" would raise ValueError: not enough values to unpack (expected 2, got 1) at this line:

      start, end = map(int, cpu_range_to_pin.split("-"))
      

      pinned CPUs still in other VM’s affinity list

      I created two slices each having a node on the same worker host.
      After successfully pinning two physical CPUs to VCPUs on node1, I checked the output of node2.get_cpu_info():

      {
        "atla-w2.fabric-testbed.net": {
          "pinned_cpus": ["117", "53"]
        },
        "instance-0000187f": [
          {
            "CPU": "116",
            "CPU Affinity": "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127",
            "CPU time": "4.6s",
            "State": "running",
            "VCPU": "0"
          }
        ]
      }
      

      Notably, the other node’s CPU affinity list still includes the pinned CPUs.
      I’m hoping that the pinned CPUs can be reserved exclusively for node1 and removed from the CPU affinity list of all other nodes.
      This would allow CPU-intensive workloads on node1 to be executed more accurately.

      poa_cpupin with a CPU range

      Normally, the pin_cpu function would send a POA command like this:

      node.poa(operation="cpupin", vcpu_cpu_map=[
        {"vcpu":"2","cpu":"15"},
        {"vcpu":"3","cpu":"79"},
      ])
      

      I attempted to send a variation of this command:

      node.poa(operation="cpupin", vcpu_cpu_map=[
        {"vcpu":"2","cpu":"14-15"},
        {"vcpu":"3","cpu":"78-79"},
      ])
      

      The latter command would return "SentToAuthority" instead of "Success".
      Subsequently, further POA commands including node.get_cpu_info() would return 500 errors.

      When I adjust QEMU process with taskset command on my own server, I can set affinity of a VCPU to multiple physical CPUs.
      If FABRIC cannot support that, the server side should have rejected the poa_cpupin command, instead of letting the node fall into an error state.

      poa_cpupin with in-use CPUs

      I created two slices each having a node on the same worker host.
      Then, I sent POA commands pinning the CPUs of these two nodes to the same physical CPUs:

      node1.poa(operation="cpupin", vcpu_cpu_map=[{"vcpu":"2","cpu":"15"}])
      node2.poa(operation="cpupin", vcpu_cpu_map=[{"vcpu":"2","cpu":"15"}])
      

      The first POA completely successfully, while the second POA returns "SentToAuthority".
      Subsequently, further POA commands on the second node would return 500 errors.

      This error could happen even if the user is only calling the high-level API node.pin_cpu() .
      When pin_cpu() calls are running concurrently against two separate nodes, which could belong to different users, they could pick the same physical CPU cores and send the conflicting POAs.

      Again, I’d suggest the server side to reject the poa_cpupin command that causes a conflict, instead of letting the second node fall into an error state.

      #9131
      Komal Thareja
      Participant

        Thank you, @yoursunny, for sharing these observations and the detailed steps to reproduce them. This appears to be a bug. I’ll work on addressing it and will update you once the patch is deployed.

        Best,
        Komal
        1 user thanked author for this post.
      Viewing 2 posts - 1 through 2 (of 2 total)
      • You must be logged in to reply to this topic.