Naïve Bayes and Character encoding in Java
One of my most recent assignments for the Introduction to Machine Learning unit has me creating a Naïve Bayes classifier for identifying spam emails. The most recent stage I’ve implemented is the 10-fold cross validation.
10-fold cross validation involves first randomising the order of your set of training emails, then splitting the emails into 10 even groups (or folds). Next, you train on 9 out of the 10 folds, and test the remaining fold. You do this 10 times so each of the folds is tested by the classifier trained on the other 9 folds. You then take an average of the success rate from the classifications to give the final result.
The puzzling thing was a friend and I were getting different end results. We tracked the cause of this down to us having a different number of words read in for a couple of the emails. After further investigation, it turned out that these emails were using the Windows-1251 character set (Cyrillic alphabet).
Yes, that’s correct, we were expected to classify Russian emails!
The reason we were getting different results is the default character encoding Java uses depends on the operating system. I was using Linux (Ubuntu 10.04) and the other guy was using Windows (7). When reading in from a file in Linux, Java assumes the character encoding of the file to be UTF-8 which unfortunately means that when it tries to read the Windows-1251 encoded file, all the Cyrillic characters become {?} or whatever you call the “unknown” character. On the other hand, when on Windows, Java seems to assume the ISO 8859-1 encoding which will (although incorrectly) read each of the Cyrillic characters as a unique character.
Simply forcing my BufferedReader to use ISO 8859-1 seemed to solve the problem:
stream = new InputStreamReader(new FileInputStream(file), "ISO8859_1") reader = new BufferedReader(stream);
That’s all for now, back to work!
Latest tweet
- I just unlocked the "Swimmies" badge on @foursquare! Splish splash! http://t.co/6aR4sPlb
Categories


