top of page
pmmucsd_a_field_of_small_As_purple_geometric_shapes_that_seem_ab322f7d-61e1-4d60-b807-97c0

Aligned's IRL Social Media Eval: Which AI Models Can Write Banger Tweets?




In the ever-evolving landscape of social media, artificial intelligence has emerged as a powerful tool for content creation. But which AI models truly excel at crafting engaging tweets that leave readers hungry for more? We've put the top contenders to the test in a comprehensive human evaluation, and the results are in. The short answer: There is still a lot of work to do before the robots go viral.


Before we dive into the results, it's important to note that much of the success or failure of these tweets is driven by the prompting the models receive. In this test, we gave each of the state-of-the-art models the same system prompt, which you can see at the end of this post. We used Aligned's creative writing expert pool and asked which tweet was better in a blind, side-by-side rating. Experts were asked for the reasons they chose the tweet they did and characteristics of the losing tweet that pushed them away from choosing it.



The Leaderboard: A Neck-and-Neck Race



1. Llama 3.1 405B Instruct Turbo and Claude 3.5 Sonnet (Tied for 1st): Both models achieved an impressive elo rating of 1249, with nearly identical win rates of 59.0%. These models were the best at crafting engaging, creative content without annoying the reader.


2. GPT-4o (3rd Place): With a score of 1203 and a 51.0% win rate, GPT-4o showed strong performance but couldn't quite match the top two.


3. Mistral Large (4th Place): Scoring 1188 points with a 48.0% win rate, Mistral Large held its own but fell short of the leaders. Mistral tends to be more creative but also more annoying with lots of hashtags and clickbait.


4. Gemini 1.5 Pro (5th Place): Despite its prowess in other areas, Gemini struggled in this task, achieving only a 34.0% win rate. We saw this in our IRL 25 benchmark as well. Google’s model’s are relatively weaker when it comes to creative writing. 



The Secret Sauce: What Makes a Great AI-Generated Tweet?


Aligned's workforce of creative writing experts provided invaluable insights into what set the top-performing models apart. Here's what they had to say:


Most Tweets sound robotic and generated by AI (these models love hashtags and emoji):


Both of these seem too AI-written. The emojis and hashtags need to go (I think it disrupts the flow of the reading). - Creative Writing Expert (pk93EBQ0W04aM4Gspzli)

Response A sounds dull, robotic, uninspired and doesn't end with a question, making it less likely to perform well. - Creative Writing Expert (VyRqFL4Z0Szf2Zqna9d6)

Both sound like it's been put together by a robot. - Creative Writing Expert (gaqMHtuH1zDbCoL2BAvZ)

Good tweets tend to be thought-provoking, relatable, and engaging:


Response B is better because it's direct, concise and thought-provoking, while using the right tone of voice for this target market - Creative Writing Expert (BTYA9lZ8mQoQs8YQfDw8)


Models try to use analogies and questions to help get complicated ideas across (partly due to our system prompt) but it doesn’t always work:


Both responses do a good job of posing questions and making the debate around augmented reality relatable to the reader. - Creative Writing Expert (e76I96QL0ZmP4nBcQjDN)

Both demand too much concentration. - Creative Writing Expert (htnYJarNSd9TUGcLaENd)


Strengths and Weaknesses: A Closer Look


Model

Strengths

Weaknesses

Llama 3.1 405B Instruct Turbo and Claude 3.5 Sonnet

Consistently produced relevant, engaging content with a natural tone

Occasional overuse of hashtags or emojis

GPT-4o

Excelled at crafting creative analogies and thought-provoking questions

Sometimes strayed off-topic or used overly complex language

Mistral Large

Demonstrated a good grasp of trending topics and current events

Occasionally produced tweets that felt too promotional or lacked nuance

Gemini 1.5 Pro

Showed potential for handling technical topics well

Struggled with maintaining a consistent tone and often produced overly long tweets



The Human Touch: Still Irreplaceable

While these AI models showed impressive capabilities, our experts unanimously agreed that human oversight remains crucial.

Key Takeaways for Leveraging AI in Social Media

  1. Choose the right tool for the job: Llama and Claude showed particular promise for engaging, click-worthy tweets.

  2. Craft clear prompts: The quality of AI-generated content heavily depends on the clarity and specificity of your instructions.

  3. Embrace the strengths, mitigate the weaknesses: Use AI to generate ideas and rough drafts, but be prepared to refine and polish the output.

  4. Maintain authenticity: Ensure your brand's voice shines through, even when using AI assistance.

  5. Always review and edit: Human oversight is crucial for catching nuances and avoiding potential pitfalls.



Evaluating Your Own Models with Aligned’s Experts

Want to conduct your own human evaluations of AI-generated content? Aligned's Evaluation Platform makes it easy to tap into a pool of expert raters across various domains. Whether you're testing social media posts, article drafts, or any other type of content, our platform provides valuable insights into how your AI models perform in real-world scenarios. With customizable evaluation criteria and detailed feedback from human experts, you can fine-tune your models and ensure they're delivering the results you need. Learn more about how our Evaluation Platform can help you optimize your AI-powered content creation process.



 

SYSTEM PROMPT You are a world-class creative content writer specializing in Twitter posts. Your task is to create engaging tweets under 180 characters, balancing commercial content with personal opinions and societal observations. Generate posts that are thought-provoking, traffic-driving, and subtly provocative without being inflammatory. Aim for a mix that maintains authenticity, drives engagement, and encourages reflection.


Style Guide:

  1. Use direct, punchy language with minimal fluff.

  2. Incorporate rhetorical questions to engage readers.

  3. Present contrasts or paradoxes to highlight complex issues.

  4. Include specific examples or scenarios to make abstract concepts tangible.

  5. Use hashtags sparingly and strategically.

  6. Avoid excessive enthusiasm or overuse of exclamation points.

  7. Balance promotion with genuine social commentary.

  8. Encourage dialogue without being divisive.

  9. Use metaphors or analogies to explain complex ideas succinctly.

  10. Aim for a tone that's confident yet contemplative.


Provide only the tweet text, without any explanations or preamble.

100 views0 comments

Comments


bottom of page