# AIGameBots

spatial and temporal attention are giving almost the same loss.
that means only the current instance of the main agent is main contribution for predicting the next step? 
need to verify this by using only the agent current features with an simple feed forward model. 
also test by joint temporal and spatial to see any decrease in loss. 

experiment - tiger = predict the next delta_x and delta_y of agent 
data =  20x10x6 
20 =  timesteps
10 - agents
6 - feature_dim = [ team_id, rel_x, rel_z, shr_key, deltax, deltay]
dataset  = "dataset_exp_tiger_0p02_0p3_100000.h5"
loss =  MSE 
epochs = 30 
temporal  only = loss = train = 0.000749 val = 0.001431
spatial only =  loss =  train = 0.001142 val = 0.001898
both temporal and spatial = loss = train = 0.000510 val = 0.001491
only main agent current features = loss = train = 0.001698  val = 0.001736

need to add long term prediction, for next 5 steps. 
experiment - hawk =  predict next 5 delta_x and delta_y of agent
data =  15 x 10 x 6
15 = timesteps
10 = agents
6 = feature_dim = [ team_id, rel_x, rel_z, shr_key, deltax, deltay]
output = 5 * [dx, dy]
data = dataset_exp_hawk_0p02_0p3_100000.h5
loss =  MSE
epochs = 20
temporal  only = loss = train = 0.002663 val = 0.004804
spatial only =  loss =  train = 0.002712 val = 0.003898 
both temporal and spatial = loss = train = 0.001898 val = 0.003654
only main agent current features = loss = train =  0.003980 val = 0.004889

Problem statement -

third gear - scalable context adaptive foundation model for imitation learning on high fps 3d games

adaptive attention in frames for scalable behavior cloning in fps games

Abstract

environment aware dynamic attention for scalable and efficient game bots?

deep reinforcement learning has shown great results in developing ai bots for playing highly compelx and strategic games often beating the human opponent [cite papers here]. but their game play often does not mimic how a human player often plays the game, deep reinforcement learning focuses on maxmizing the rewards instead of mimicing a human player. recent advancement in AI has sparked interest in gaming companies to create ai bots that mimic human players instead of bots that achieves superhuman performance by maximing rewards , mimicing human players of varied difficulty helps in 1- increased user interaction and experience in the game 2- solving cold start problem for gaming companies .

Behavior cloning have been applied to games to mimic human players. behavior cloning along with offline reinforcement learning [decision transformers] has shown good improvement in simple and trivial games like atari but behavior cloning + RL has not been scaled to high fps and high resolution games like counter strike and Doom [ https://arxiv.org/pdf/2504.17891] . main reason being sparsity of rewards and complex state representation in high fps 3d games. behavior cloning has shown promise in mimicing human players just from raw pixel data for many different high fps 3d games such as counter strike [ counter strike deathmatch with large scale behavior cloning ] [ behavior cloning in VIZDOOM] [ Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos ] [ Pixels to Play: A Foundation Model for 3D Gameplay ]. behavior cloning though promising require a lot amount of data for training. few solutions have been proposed like inverse dynamic action recognition on large scale unlabelled videos, their improvement in memory requirement and resource constraint is still an active area of research. scaling them in large production environment require enormous amount of compute . in our work, we propose a transformers architecture for behavior cloning from visual data which dynamically adapts its attention on past frames based on the immediate danger it senses. with this approach , the model learns to save resources when not engaged in a danger situation by thinking faster with less context.

introduction

Inspired by recent behavior cloning research on memory intensive high fps games, we propose a method for model training and inference which adapts its context window and computation requirement based on environment feedback thus varying its computation load over the course of game saving highly on compute and makes it very scalable and effective for real life application. Our work is inspired from how a human player thinks while playing a game, when a user playing a game does not see any immediate danger any enemy in sight the cognitive load and attention of the player is relaxed as compared to when the user senses any incoming enemy or is engaged in a battle.

Our novel transformer architecture trained on past input visual frames tries to mimic that by dynamically adapting its past tokens/input frames based on the current environment feedback. the transformer architecture scales quadratically with the number of input tokens so the computation also grows qudratically with input. We propose a gated transformer which selects top k frames to process at each timestep where k frames are selected by a simple mlp based on input feedback, the k in top k is kept variable to make the model flexible to attend to its past input based on the environment feedback.

conclusion

we show that the asymptotic time of inference becomes O(n) instead of O(n2) by keeping the model context variant without loosing any performance and imitating the human player as good as a behavior cloning baseline. with this approach, the number of ai bots simultaneously playin a game can be increased to n times .

Abstract

introduction

conclusion

model architecture