Aindriya Barua, a non-binary individual and the founder of ShhorAI, a platform designed to filter hate speech against queer communities, encountered significant online harassment as they pursued their Bachelor of Technology in Bengaluru and later in their professional career. The hateful messages became so severe that Barua’s sister and her child were threatened. Recognizing the potential of technology to combat this issue, Barua leveraged their expertise in Indic Natural Language Processing (NLP) to create a solution.
NLP is a field of computer science and artificial intelligence that enables machines to understand and communicate in human language. Indic NLP refers to teaching machines Indian languages. Building large language models in India is challenging due to the lack of appropriate datasets. However, some AI startups have succeeded in creating models that understand and respond in 12 of India’s major languages.
Barua faced challenges in finding the necessary datasets for their specific use case. Hateful content often bypasses existing filters due to Big Tech’s content moderation not accounting for abuse in Indian languages, the use of a combination of numbers, letters, and symbols to evade filters, and the regional variations in how abuse is spelled. This makes it difficult for there to be a single word for a filter to flag.
Barua utilized their social media to gather links to hate speech and created a spreadsheet to store this data. To create a dataset that could train an AI, they manually tagged each link as either hate speech or not. With the help of 45 volunteers during a United Nations Population Fund hackathon, they managed to create a dataset containing 45,000 instances of hate speech, which Barua believes is sufficient for an AI to learn from.