Issue with Instance Creation Limits in in Anbox cloud appliance

Asked by Kang Wang

I have successfully set up the Anbox cloud appliance on a dedicated server. Each app instance utilizes the default g4.3 instance type. The system is capable of creating sessions correctly.

However, I encounter a limitation where it cannot create more than 13 instances.

I have configured the GPU slots on the node to the default 32. Upon monitoring, I observed that the CPU, GPU, and memory usage do not exceed 40%. Additionally, there is ample storage available. Notably, when I stop a running container, I am able to start a new one.

For detailed information, here is the output from anbox-cloud-appliance.buginfo:
https://1drv.ms/t/s!AhkKpGjamjtljNpuDkxBngCrA3bkSQ?e=FyzY0B

I have reviewed several related topics but have not found a solution. Any advice or suggestions would be greatly appreciated.

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Solved by:
Kang Wang
Solved:
Last query:
Last reply:
Revision history for this message
Kang Wang (hidetorio) said :
#1

The version of Anbox cloud appliance is 1.20.0

Revision history for this message
Keirthana (keirthana) said :
#2

@hidetorio Could you upload the dump of `ams.log` and also provide information on the exact error you are facing when you launch a new instance after 13 instances?

Revision history for this message
Kang Wang (hidetorio) said :
#3

Here are the error logs from a container that was started subsequent to the 13th:
https://1drv.ms/u/s!AhkKpGjamjtljNp_iH5CzuT1n18zog?e=1HaBoh

Interestingly, this creation limit issue does not occur when using the a4.3 instance type. This leads me to suspect that there might be a problem related to the GPU configuration.

For reference, here are the details of the GPU in use:

Model: Tesla T4
NVIDIA-SMI: 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2

Revision history for this message
Simon Fels (morphis) said :
#4

Hey Kang!

Thanks for sharing the logs. The following lines from the system.log are most interesting:

Dec 05 14:49:09 ams-clnjgnqgdmrevsmol18g anbox-starter[149]: I1205 14:49:09.489117 229 vk_context.cpp:376] Using Vulkan device: Tesla T4 (driver 535.516.192 api 1.3.242)
Dec 05 14:49:09 ams-clnjgnqgdmrevsmol18g anbox-starter[149]: E1205 14:49:09.645359 229 vk_context.cpp:450] Failed to create Vulkan device: VK_ERROR_INITIALIZATION_FAILED
Dec 05 14:49:09 ams-clnjgnqgdmrevsmol18g anbox-starter[149]: E1205 14:49:09.666379 229 base_gpu_implementation.cpp:57] Failed to create Vulkan context

This hints that resources on the GPU you're using are exhausted and the driver is not able to initialize another context. If you look at the output of

$ nvidia-smi

you will most likely find a high memory consumption by your existing 13 instances. Each Anbox process in there corresponds to one container instance. The T4 has a total of 16 GB of memory.

Revision history for this message
Kang Wang (hidetorio) said (last edit ):
#5

Hello Simon, thanks for your analysis. I revisited the nvidia-smi log and confirmed that there is sufficient memory usage capacity for an additional instance.

However, I'm beginning to wonder if power usage might be the issue at hand based on your analysis.
But I tried other setting, it can start container even the power usage hit 70W.

Here is the screenshot of nvidia-smi output for your reference:
https://1drv.ms/i/s!AhkKpGjamjtljNtAFl9EkOV5wFvy_Q?e=RYMRPq

Revision history for this message
Simon Fels (morphis) said :
#6

Thanks Kang!

Can you run the following

$ amc config set cpu.limit_mode scheduler

and then recreate all instances again and see if you hit a similar problem? I saw you have a system with two NUMA nodes and the GPU is connected to node 1. If cpu.limit_mode is set to "pinning" the instances may end up with cores from two NUMA nodes instead of just one which can cause problems. With cpu.limit_mode set to "scheduler" we pin each instance to the cores of each NUMA node and respect locality of a GPU.

Revision history for this message
Kang Wang (hidetorio) said :
#7

Thank you, Simon!

I attempted the command you suggested, but now I'm unable to create any instances at all. It seems we might be getting closer to the core of the problem.

Here is the error log from the container. Could you please help me check it?
https://1drv.ms/u/s!AhkKpGjamjtljNxvMxy4EIc8sJ_0qg?e=1LdUZu

Revision history for this message
Gary.Wang (gary-wzl77) said :
#8

Hey Wang
  Thanks for the logs that are shared with us.
  The following logs are concerning.
  https://pastebin.ubuntu.com/p/RtSdC4YT7p/

  Regarding the error logs from a container after the 13th instance, did this occur after settig the cpu.limit_mode to scheduler or not?
  Also would you mind sharing us the info of the LXD that was installed in your environment:
  $ lxc info

  Also could you help to confirm that is kernel version the standard ubuntu kernel 5.15.0-89-generic ?
  Thanks.

BR
Gary

Revision history for this message
Kang Wang (hidetorio) said :
#9

Hello Gary

The error log occurs the first container after setting the cpu_limit_mode to schedule.

The error log of container failed after 13th containers is here.
https://1drv.ms/u/s!AhkKpGjamjtljNp_iH5CzuT1n18zog?e=1HaBoh

Also I check lxc info and and confirm the kernel version is 5.15.0-89-generic.

Revision history for this message
Gary.Wang (gary-wzl77) said :
#10

Could you share the output of the `lxc info` with us too?
Thanks

Revision history for this message
Kang Wang (hidetorio) said :
#11

Hi,

Here is the lxc info output.
https://pastebin.ubuntu.com/p/HpqmC4KJ8n/

Revision history for this message
Gary.Wang (gary-wzl77) said :
#12

Okay, thanks for the logs.

LXD 5.0.02 should be fine.

Based on what you described,
1. With `pinning` cpu.limit_mode, you couldn't launch the 13th instance from g4.3 based instance-type app due to the error `VK_ERROR_INITIALIZATION_FAILED`, however you could be able to create more instances from a4.3 based instance-type app.
2. After applying `scheduler` cpu.limit_mode and recreating instance, you weren't able to launch the first container, the exact error comes from the lxc ` Failed to setup proc filesystem oom_score_adj to -900` when launching the nested Android container however the vulkan context can be initialized successfully from the shared system.log. So it's not GPU related problem in this case.

Hence the underlying issue should be different, for 1), it's most likely GPU related. The following documentation could enable you to query GPU infor and also tweak the max power

https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries

Would you mind giving it a try and chek if that helps?

Also I am wondering which application/game is running in each container during the testing? Or you just launched multiple containers and all stay at the default AOSP launcher when they're up?

Thanks
Gary

Revision history for this message
Kang Wang (hidetorio) said :
#13

Thank for you advice.

I am trying to analyze the nvidia-smi logs.

And the game is just launched and stay at the default AOSP launcher without any operation.

And I have another problem that the WebView in the game cannot run JavaScript.
It is not relate to the topic but if any advice is appreciate.

Revision history for this message
Launchpad Janitor (janitor) said :
#14

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Kang Wang (hidetorio) said :
#15

Sorry for not reply for a long time.

I tried to tweak the parameters with nvidia-smi but instance still failed after the 12th or 13th instances run.

This time I run `amc benchmark` for test. The spec of machine is same as before.

Here is some info

- benchmark log
https://paste.ubuntu.com/p/Mt2m7NcrBt/

- nvidia-smi log
https://paste.ubuntu.com/p/xf25npPtdJ/

Revision history for this message
Simon Fels (morphis) said (last edit ):
#16

Hey Kang!

That reminds of one thing. Can you apply the following modification:

In /etc/modprobe.d/anbox-nvidia.conf change from

options nvidia NVreg_RegistryDwords=RMIncreaseRsvdMemorySizeMB=256

to

options nvidia NVreg_RegistryDwords=RMIncreaseRsvdMemorySizeMB=1024

and see if that allows you to spin up more instances.

You can even set a higher value e.g. 2048 but that will reduce the amount of GPU memory available for userspace by that amount.

After the modification run

$ sudo update-initramfs -u -k all

and reboot the system.

Revision history for this message
Kang Wang (hidetorio) said :
#17

Thanks Simon

I tried to set the value to 1024 and it works.
It allows up to 18 instances to create.

Then I set the value to 2048 but just allows 16 instances to create.
According to the info of nvidia-smi, I think it still have source to create more instances.

Is it there more methods to improve?

Revision history for this message
Simon Fels (morphis) said :
#18

It depends a bit on how the driver allocates memory internally for different things. Using RMIncreaseRsvdMemorySizeMB allows go higher in some cases but depending on the actual application using the GPU it will vary. The current version of Anbox Cloud uses OpenGL as API layer were we're out of control for memory management and it's up to the driver to do the "right thing".

We have support for Vulkan upcoming in 1.21 in ~1.5 weeks which will switch (still behind a feature flag until 1.22) to an entire new graphics stack which will reduce GPU memory usage by 20-30% in the best case and allows us to handle things much better. This might help you for your specific workload.

Revision history for this message
Kang Wang (hidetorio) said :
#19

Thank you very much for your help.

Since the problem has been partially solved with a solution, and I now understand the reason why it occurred, I think it's okay to mark this thread as solved.