1. Mert Cevik

Mert Cevik

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 222 total)
  • Author
    Posts
  • in reply to: Maintenance on SEAT node on May 8th #9774
    Mert Cevik
    Moderator

      Work is completed.

      in reply to: I cannot access some of my nodes #9741
      Mert Cevik
      Moderator

        Same situation. Rebooted, devices attached.

        I’m not sure what is causing this, worker node is not extremely loaded, but inside the VMs there seem to be mellanox driver issues. If you share some context about the actual experiment and traffic (generated/exchanged) we can try to understand and find a way to have it sustain reliably. Otherwise, I don’t have any clues right now. You can directly reach out if you prefer.

        in reply to: I cannot access some of my nodes #9739
        Mert Cevik
        Moderator

          Both VMs were crashed. I’m attaching the console outputs.
          console.7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e-r-2-1
          console.c834417a-7393-4cae-bd62-722358b6451f-r-2-3

          I restarted them, they are online. I also attached their PCI devices (IP addresses need to be re-assigned).

          in reply to: Cannot allocate GPU + ConnectX-6 on same node #9724
          Mert Cevik
          Moderator

            We are checking on the status information for cern-w2 with respect to potential mismatch
            due to a reservation that is currently consuming the resource but health of the reservation is not clear.
            We will send updates.

            1 user thanked author for this post.
            in reply to: Cannot allocate GPU + ConnectX-6 on same node #9722
            Mert Cevik
            Moderator

              An easy way that works for me is checking the portal for the specific worker node’s resources. On the CERN, cern-w2 seems to be matching your needs. I will attach a screenshot from the portal but I’m not sure how it will show up on this comment, you can go to portal.fabric-testbed.net, click a link that leads to the CERN page (either from the map or from the table), then see the available resources. (if these are already known to you, then please disregard)

              To target a specific worker node that has the desired resources, there may be some example functions within the example Jupyter notebooks that show filtering the worker nodes, and listing their resources. Or Fablib API documentation may reveal some ways, I don’t know much about that part. I guess knowledgable users from the community may share their methods.

              For scheduling resources in advance, this resource may reveal some ways -> https://artifacts.fabric-testbed.net/artifacts/32938b00-5036-4a1e-84b5-063283618669

              There may be some other ways to show the resource availabilities, but I will leave it to more advanced users or FABRIC team, they may have better pointers.

               

               

              in reply to: Issue Accessing Nodes Across My FABRIC Slices #9719
              Mert Cevik
              Moderator

                You need to provide the slice IDs.

                in reply to: Cannot allocate GPU + ConnectX-6 on same node #9718
                Mert Cevik
                Moderator

                  ConnectX-6 SmartNICs are located on the “FastNet Worker”
                  GPUs are located on “GPU Worker” and “SlowNet Worker”

                  You can find information on this page -> https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/

                  So, it will not possible to have both GPU and ConnectX-6 on the same VM.
                  However, CERN is an exception. It has 3x “FastNet Worker” servers. Each server has 2x ConnectX-6 SmartNIC and 1x A30 GPU on them.

                  in reply to: Slices stuck at configuring…….state #9698
                  Mert Cevik
                  Moderator

                    MAX is available for the experiments.

                    in reply to: Inquiry Regarding MAX Site Maintenance Completion Timeline #9697
                    Mert Cevik
                    Moderator

                      Hi Ajay,

                      MAX is back online. Maintenance status is released but it may take some time to indicate the available status on the portal. Regardless it’s available for the experiments now.

                      Best regards,
                      Mert

                      in reply to: FABRIC SEAT – Outage on seat-w1 #9690
                      Mert Cevik
                      Moderator

                        Problem on the server (seat-w1) was caused by the Nvidia BlueField-3 DPU card. Currently, server is back online (active VM slivers are recovered), however we took out the DPU card for investigation. All other resources on the SEAT node are available for experiments.

                        in reply to: Slices stuck at configuring…….state #9687
                        Mert Cevik
                        Moderator

                          Hi Ajay,

                          The problem is caused by a hardware failure on the head-node of the MAX site. Work is in progress to recover the server, however it’s very likely that it will require some extra time. I wanted to let you know in case these are the slices for your demo, you may need to re-create them on other FABRIC nodes/sites.

                          I will notify if we are able to resolve the probkem on MAX and your current slices can be recovered.

                           

                          Best regards,

                          Mert

                          in reply to: BlueField-3 host-DPU communication issue on FABRIC #9664
                          Mert Cevik
                          Moderator

                            Hi Plabon,

                            BlueField-3 DPU on UCSD node is the one that you can test your work.

                            I’m attaching some outputs, but I confirmed that there is improvement on the Accelerated UPF Reference Application runtime. Please let us know about your status. (Also, due to the upcoming KNIT12, some other experimenters may be requesting the specifically the UCSD DPU resource. Please, follow the availability, and try out as soon as possible)

                            ubuntu@localhost:~$ sudo mlxfwmanager --query
                            Querying Mellanox devices firmware ...

                            Device #1:
                            ----------

                            Device Type: BlueField3
                            Part Number: 900-9D3B6-00CC-EA_Ax
                            Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
                            PSID: MT_0000001115
                            PCI Device Name: /dev/mst/mt41692_pciconf0
                            Base MAC: cc40f38f0356
                            Versions: Current Available
                            FW 32.48.1000 N/A
                            PXE 3.9.0101 N/A
                            UEFI 14.41.0014 N/A
                            UEFI Virtio blk 22.4.0014 N/A
                            UEFI Virtio net 21.4.0013 N/A

                            Status: No matching image found

                            ubuntu@localhost:~$ sudo mlxconfig -d 03:00.0 q
                            Device #1:
                            ----------

                            Device type: BlueField3
                            Name: 900-9D3B6-00CC-EA_Ax
                            Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
                            Device: 03:00.0

                            Configurations: Next Boot

                            . . .

                            FLEX_PARSER_PROFILE_ENABLE 3

                            . . .

                            in reply to: BlueField-3 host-DPU communication issue on FABRIC #9663
                            Mert Cevik
                            Moderator

                              Sorry for the trouble, and thank you. Google Drive linked worked well.

                              in reply to: BlueField-3 host-DPU communication issue on FABRIC #9660
                              Mert Cevik
                              Moderator

                                Hi Plabon,

                                Thank you for sharing the updates about my inquiry.

                                For the firmware and settings, I will notify later today or tomorrow morning.

                                For the attachment failure of the Jupyter notebook, following is the suggestion from the FABRIC team.
                                – rename the extension to txt to upload the notebook.

                                in reply to: BlueField-3 host-DPU communication issue on FABRIC #9656
                                Mert Cevik
                                Moderator

                                  Hello Plabon,

                                  Can you please let me know if you could use the steps that I shared on April 1st (the attached PDF file) and the issues you indicated last week were resolved or not?

                                  For the firmware update procedure, there is seems to be some discrepancy between the documentation and actual outcome from the DOCA framework. Without a cold-reboot of the server, new firmware cannot be activated. I will need to clarify this with Nvidia.

                                  I haven’t read the UPF Reference Application Guide yet, but from your descriptions, I understand that you need to use newer firmware versions for some additional features (although under Test Environment and Setup, firmware is listed as 32.43.1014). Also, you need to enable the Flex Parser Profile. And both items require cold-reboot of the server.

                                  Rebooting of the servers on the FABRIC Testbed is not a straightforward task as all resources are shared by users. I will see what I can do and let you know.

                                  Best regards,
                                  Mert

                                   

                                Viewing 15 posts - 1 through 15 (of 222 total)