← Back

CROSPELL ENGINE - Natural Language Processing Engine

·3 min read

This slide covers CROSPELL Engine; an engine made with multiple approaches for Natural Language Processing. It covers a wide variety of topics in text and image processing. From spell checking to topics prediction. It's a project made in late 2012 and delivered in early 2013 at the F.I.T.E of Damascus, Syria as the final project in NLP course (with Ola Al Naameh and Mhd Hasan Sarhan.)  System Specification (Implementation details can be found in the doc.) 1. Auto-correction The auto-correction algorithm make sure that the misspelled word is matched with a proper correct word. Many approaches can be implemented for this. The option I opted to is the distance between keys on the keyboard map. cr2 But ones should make sure he got the right algorithm. Keys on the keyboard map are not scattered linearly. cr3 The distance between keys are also not linear. The best thing for this is Gaussian curve to measure the right distance. cr4 The CyperSpell Algorithm maps the (possible) misspelled words with their correct-spelled counterparts (using a dictionary). cr5 The user can, in realtime, write and the system will auto-correct (or suggest) the correct words when the user misspell. The system also knows what words the user has misspelled before and rank their chosen correct words higher in the list of suggestions. cr19 2. Language Identification The user can input any language and the system can figure out what than language is (as long as the corresponding corpses are provided). cr6 if there are more than one language in the text, the system will list them (rank them) according to their occurrences (frequencies in the text). cr7 3. Word Prediction Using bi-grams and tri-grams the system can successfully suggest auto-completion while writing words. cr8 3. Topic Prediction Using bi-grams and tri-grams the system can successfully suggest the best topic that match the paragraph. The system, actually, lists all the possible topics prediction and rank them according to the best match. cr9 4. Dictionary The system also provide and Arabic-English dictionary. cr10 5. Image Processing using NLP Approaches Using Minimum Edit Distance (MED), we can match images with others having similar properties (colors in our case). Though, this approach is shallow since it fail completely when images are re-sized or rotated. Anyway, it's just for fun! cr11 The system can best compare images having similar sized and not-transformed. cr12 cr20 6. ISRI and Porter Stemming Algorithms Both, ISRI and Porter stemming algorithms are implemented in the engine. cr13 cr14 7. Genome Matching using Minimum Edit Distance The engine interestingly implement Genome matching using MED. The initial interface is: cr15 The user can input two genomes and the system will find the match between the two. cr16 cr17 8. Sentiment Analysis The system implement a light sentiment analyzer. Just write a sentence or a paragraph and the system will provide the corresponding emotion for it. cr18 You can download the full project documentation [in Arabic - بالعربية] here. I would be happy to upload the engine source code along with its interface for anyone to use! but the languages corpus are quite big (the project in 400 MB!) so if anyone is interested don't hesitate to contact me by mail and I'll figure something out!