Intel® DevCloud for oneAPI

Overview Get Started Documentation Forum external link

Get started with the Intel oneAPI Base Toolkit on the DevCloud

What is the Intel® DevCloud

The Intel® DevCloud is a cluster composed of CPUs, GPUs, and FPGAs, and it is preinstalled with several Intel® oneAPI toolkits. The Intel DevCloud will be kept up to date with the latest hardware and software from Intel—allowing you to evaluate them soon after they are released.

In order to request access to the Intel® DevCloud, please sign up here: Intel® DevCloud Sign-Up.

Once your request is approved, you will receive an e-mail message with the necessary information to configure a connection to the Intel® DevCloud and to sign in.

All users who log in to the Intel DevCloud are directed to a staging area—a login node. Actual work is submitted to dedicated compute nodes composed of CPUs, GPUs ,and FPGAs. The interaction with these compute nodes is achieved through the Portable Batch System (PBS); that is, users must employ PBS utilities such as qsub, pbsnodes, qstat, etc. to request and use compute resources. The specific PBS implementation running on Intel DevCloud is called TORQUE*. For more information about PBS TORQUE and an overview of the PBS commands please check the links in the resource section at the bottom of this document.

The following will help you to get started quickly with Intel® oneAPI on Intel DevCloud using the vector-add sample. vector-add is a simple program that adds two large vectors and verifies the results. This program is implemented using SYCL* for CPU, GPU and FPGA.

CPU/GPU Vector-Add Sample Walkthrough

  1. Connect to the DevCloud.
    ssh devcloud
  2. Download the samples.
    git clone https://github.com/oneapi-src/oneAPI-samples.git
  3. Go to the vector-add sample.
    cd oneAPI-samples/DirectProgramming/C++SYCL/DenseLinearAlgebra/vector-add/

Build and Run the Sample in Batch Mode

The following describes the process of submitting build and run jobs to PBS.

A job is a script that is submitted to PBS through the qsub utility. By default, the qsub utility does not inherit the current environment variables or your current working directory. For this reason, it is necessary to submit jobs as scripts that handle the setup of the environment variables. In order to address the working directory issue, you can either use absolute paths or pass the -d <dir> option to qsub to set the working directory.

NOTE: You must open a text editor in your terminal window’s sample directory (example: Vi, Vim, Nano) and create two scripts, build.sh and run.sh, with the information found in steps 1 and 2.

Create the Job Scripts

  1. Create a build.sh script with the following contents.
    #!/bin/bash
    source /opt/intel/inteloneapi/setvars.sh
    make clean
    make all
  2. Create a run.sh script with the following contents for executing the sample.
    #!/bin/bash
    source /opt/intel/inteloneapi/setvars.sh
    make run

Build and Run

Jobs submitted in batch mode are placed in a queue waiting for the necessary resources (compute nodes) to become available. The jobs will be executed on a first come basis on the first available node(s) having the requested property or label.

  1. Build the sample on a GPU node.
    qsub -l nodes=1:gpu:ppn=2 -d . build.sh
    Notes -l nodes=1:gpu:ppn=2 (lower case L) is used to assign one full GPU node to the job.
    The -d . is used to configure the current folder as the working directory for the task.
  2. In batch mode, the commands return immediately; however, the job itself may take longer to complete. In order to inspect the job progress, use the qstat utility.
    watch -n 1 qstat -n -1
    Note: The watch -n 1 command is used to run qstat -n -1 and display its results every second.
  3. Run the sample on a GPU node after the build job completes successfully.
    qsub -l nodes=1:gpu:ppn=2 -d . run.sh
  4. The best way to determine whether a job completed or not is by using the qstat utility. When a job terminates, a couple of files are written to the disk:
    • <script_name>.sh.eXXXX, which is the job stderr
    • <script_name>.sh.oXXXX, which is the job stdout
    Here XXXX is the job ID, which gets printed to the screen after each qsub command.
  5. Inspect the output of the sample.
    cat run.sh.oXXXX
  6. Remove the stdout and stderr files and clean-up the project files.
    rm build.sh.*; rm run.sh.*; make clean
  7. Disconnect from the Intel DevCloud.
    exit

Build and Run the Sample in Interactive Mode

The interactive mode offers the most familiar way to work with the DevCloud. In this mode you will be able to work with a compute node just like you would normally do on a local system.

There are some caveats. First, a request for an interactive session will be placed in a queue. A compute node will be allocated as soon as possible, and no guarantees can be made regarding the wait time. Secondly, any tasks started on a compute node in interactive mode will be terminated when the connection is interrupted.

Our recommendation is to use the interactive mode sparingly and only for scenarios not feasible in batch mode like stepping through code when debugging applications.

  1. Request an interactive session on a DevCloud GPU node using qsub.
    qsub -I -l nodes=1:gpu:ppn=2 -d .
    Note: -I (upper case i) is the argument used to request an interactive session.
  2. Within the interactive session on the GPU node, build and run the sample.
    make all && make run
  3. Clean the sample and terminate the interactive session.
    make clean
  4. Terminate the interactive session.
    exit

FPGA Vector-Add Sample Walkthrough

You can compile or run oneAPI designs on the FPGA platform using the Intel DevCloud, which is set up with either an Intel® Programmable Acceleration Card with Intel® Arria® 10 GX/Stratix® FPGA or a Terasic® DE10-Agilex Development Kit with Agilex® 7 FPGA and the necessary software stack. For more information, refer to FPGA Design Development and Workloads for Hardware Acceleration.

The same vector-add sample from above can be compiled to target the FPGA Emulator or the FPGA Hardware.

FPGA Emulator

  1. Create the build_fpga_emu.sh and run_fpga_emu.sh scripts for targeting the emulator.
    1. build_fpga_emu.sh
      #!/bin/bash
      source /opt/intel/inteloneapi/setvars.sh
      mkdir build
      cd build
      cmake ..
    2. run_fpga_emu.sh
      #!/bin/bash
      source /opt/intel/inteloneapi/setvars.sh
      cd build
      ./vector_add.fpga_emu
  2. Submit the compilation job.
    Note: Compile jobs should be submitted to compute nodes labeled fpga_compile, which are dedicated to FPGA compile jobs.
    qsub -l nodes=1:fpga_compile:ppn=2 -d . build_fpga_emu.sh
  3. Submit the execution job.
    Note: Execution should be performed on nodes labeled fpga_compile which have more CPU compute resources available for intensive workloads.
    qsub -l nodes=1:fpga_compile:ppn=2 -d . run_fpga_emu.sh

FPGA Hardware*

  1. Create the build_fpga_hw.sh and run_fpga_hw.sh scripts for targeting the FPGA hardware.
    1. build_fpga_hw.sh
      #!/bin/bash
      source /opt/intel/inteloneapi/setvars.sh
      export PATH=/glob/intel-python/python2/bin:$PATH
      export QUARTUS_ROOTDIR_OVERRIDE=<QUARTUS_VERSION>
      mkdir build
      cd build
      cmake .. -DFPGA_DEVICE=<BSP_Location>:<Variant>
      Here, the values of <QUARTUS_VERSION>, <BSP_Location> and <Variant> are determined according to the following table.
      BSP BSP_Location Variant QUARTUS_VERSION
      PAC-Arria10GX /glob/development-tools/versions/oneapi/2024.0/oneapi/intel_a10gx_pac pac_a10 $QUARTUS_PRIME_192
      PAC-Stratix10GX /glob/development-tools/versions/oneapi/2024.0/oneapi/intel_s10sx_pac pac_s10, pac_s10_usm $QUARTUS_PRIME_192
      DE10-Agilex /glob/development-tools/versions/oneapi/2024.0/oneapi/de10_agilex B1E1_8GBx4, B2E2_8GBx4 $QUARTUS_PRIME_212
    2. run_fpga_hw.sh
      #!/bin/bash
      source /opt/intel/inteloneapi/setvars.sh
      cd build
      ./vector_add.fpga
  2. Submit the compilation job.
    Note: A hardware compile job can take a long time. You can increase the timeout of a batch job by using the -l walltime=hh:mm:ss option. The maximum timeout available for FPGA compile jobs is 24h.
    qsub -l nodes=1:fpga_compile:ppn=2 -d . build_fpga_hw.sh
  3. Submit the execution job.
    Note: Execution should be performed on compute nodes labeled fpga_runtime:<fpga type> (where <fpga type> can be arria10/stratix10 or agilex, and must match the device type you targeted when compiling your executable), which host FPGA cards.
    qsub -l nodes=1:fpga_runtime:arria10:ppn=2 -d . run_fpga_hw.sh
    or
    qsub -l nodes=1:fpga_runtime:stratix10:ppn=2 -d . run_fpga_hw.sh
    or
    qsub -l nodes=1:fpga_runtime:agilex:ppn=2 -d . run_fpga_hw.sh
    As mentioned before, you can use qstat to monitor the progress of the enqueued jobs. Once a job terminates, its stderr and stdout will be saved to the disk. You can terminate jobs using the qdel utility.

* SYCL and the SYCL logo are trademarks of the Khronos Group Inc.