Your model here are running at insane speed,which is really cool.
But it only has max context window of 32k,which a max output 8k.Which is even a bit unsufficient for complex single problem.As you know DeepSeek R1 is known for outputing lots of tokens while thinking,and your speed will fill 8k within seconds.
I’m wondering what is limiting that number.If you can trigger it to 128k,as the normal NNML model,then it will start to have lots of application.
Is your way of running the model having the drawback of context window?