View on GitHub

Gender-identifier

Name based gender identification using NLTK

Download this project as a .zip file Download this project as a tar.gz file

Gender Identification using NLTK

This project is based on a simple idea that usually the female names end in vowels like 'a', 'e' and 'i'. Whereas male names usually end in 'k', 'o', 'r', 's' and 't'. Using this feature, we can generate a confidence value if the input name belongs to a male or a female. This is the first such study on Hindi names. The study on English names can be found here. The English name corpus has been included in the 'names' package of NLTK. You can use it by:

>>> from nltk.corpus import names

NLTK

Natural Language Processing though quite intense and arduous becomes manageable with the Natural Language Tool Kit, originally authored by Steven Bird, Edward Loper and Ewan Klein.

Procedure

Data Collection

I started by scraping a few websites that provided names of boys and girls and formatted the data into a csv file. Two separate data corpses were made, one for the male and the second for female.

Feature Extraction

The last alphabet of the word is the major distinguishing factor, hence only that alphabet is extracted and used by the classifier.

Technique

It's a supervised classification technique. Hence in the training data set labels are provided that help to learn the classification. Naive Bayes Classifier has been used.

Cross Validation

When provided with a data set, it is advisable to reserve part of it as a test data set so that the generated hypothesis can be tested. Usually a 70/30 or 60/40 division is used. So 70/60 of the data is used in training and 30/40 in testing. Based on a similar analogy cross validation is implemented so that the hypothesis generated from the training data set can be validated and worked upon to create a better and efficient hypothesis. For this a division of 60/20/20 is used. Training data constitutes 60%, cross validation set 20% and the testing data set another 20%. Cross validation is much in use because it provides a chance to fine tune the feature vector.

Accuracy

Based on the test set we can measure the precision of the classifier using:

print nltk.classify.accuracy(classifier, test_set)

Initially we had a hypothesis stating that female names usually end in vowels like 'a', 'e' and 'i'. Whereas male names usually end in 'k', 'o', 'r', 's' and 't'. This was based on prior experience. Now let's see if this is actually true or not. Based on the corpus of data provided to the classifier it has generated its own hypothesis. It tells us based on which letters is it performing the classification.

print classifier.show_most_informative_features(5)