> Grok-4 significantly underperformed compared to expectations. Many of its initial responses were extremely short, often consisting only of a final answer without explanation.
this is very weird. given how verbose most models usually are, there must have been something wrong in the system prompt.
also: grok used 89996 input tokens compared to 591624 for o3 high. What kind of tokenizer are they using that compresses the input so much? I suppose all inputs are actually the same, since the math problem + instructions are the same. only difference is the tokenizer or the system prompt. but i suppose it would not make up the difference. is o3 using 500k more tokens for their system prompt?
> Grok-4 significantly underperformed compared to expectations. Many of its initial responses were extremely short, often consisting only of a final answer without explanation.
this is very weird. given how verbose most models usually are, there must have been something wrong in the system prompt.
also: grok used 89996 input tokens compared to 591624 for o3 high. What kind of tokenizer are they using that compresses the input so much? I suppose all inputs are actually the same, since the math problem + instructions are the same. only difference is the tokenizer or the system prompt. but i suppose it would not make up the difference. is o3 using 500k more tokens for their system prompt?