Classifier settings can be set when creating a new classifier or modified from the settings tab. This settings may have great impact in the performance of the classifier, and the correct values to use depend on the particular classification problem you want to resolve.

If you edit any of these settings, you must retrain the project (and redeploy if you are using a live tree) in order to see the changes when classifying.

Language

This setting should match the language in your samples, currently we support the following languages:

  • English
  • Dutch
  • French
  • German
  • Italian
  • Portuguese
  • Spanish
  • Russian
  • Chinese
  • Japanese
  • Korean
  • Arabic
  • Danish
  • Swedish
  • Romanian
  • Hungarian
  • Finnish
  • Norwegian
  • Other / Multi-language

Selecting the correct language is important, MonkeyLearn uses this information for the stemming and tokenization process, and for the default stopwords selection.

If we don’t support the right language for your data yet, you should definitively try with the Other / Multi-language option. You can get very good results without stemming and you can override the default stopwords with your own if you need to.

If your samples uses more than one language, for example for a language detection classifier, the Other / Multi-language is the right option for you.

If you edit this setting you must retrain the project (and redeploy if you are using a live tree) in order to see the changes when classifying (remember to save the new configuration) .

Normalize weights

This settings informs the classifier whether it should take into account the number of samples for each category when defining its probability, think of prior probability in a bayesian statistical inference.

In the case your categories are not balanced, should they be normalized to ignore this fact? Or should the categories with more samples have more prior probability? The first option corresponds to normalize weights enabled, and the the second to normalize weights disabled.

By default weight normalization is enabled.

Use stemming

This setting sets if words should be stemmed. The stemming process transforms words into their root form, so inflected and derived words are grouped together. For example the words fishing, fished or fisher are transformed to the root word fish.

This is enabled by default if a particular language is selected, usually will help the classifier generalize and improve the classification, but it depends on your data and what are you trying to do with it. You can always edit this setting, retrain and check out the stats to decide what suits your case.

Filter stopwords

This settings enables or disables the stopword filtering. Stopwords are words that usually do not contribute as classification features. Usually stopwords are high frequency words like articles, connectors, etc.

Stopwords are selected form a predefined set of words that depends on the language you have chosen. If you selected Other / Multi-language stopwords won’t be filtered.

Custom stopwords

If you want to override the default stopwords for your language or particular case you can add here a list of comma separated stopwords that will be ignored. This feature is useful when you find that some wrong keywords are used as features in the classifier (by looking at the keyword cloud). In that case, you can filter them by adding them to the list of stopwords.

N-gram range

N-gram range sets if features to be used to characterize texts will be:
  • Unigrams or words (n-gram size = 1)
  • Bigrams or terms compounded by two words (n-gram size = 2)
  • Trigrams or terms compounded by up to three words (n-gram size = 3)

Currently we support the following combinations:

  • Unigrams (Default)
  • Unigrams and Bigrams
  • Unigrams, Bigrams and Trigrams
  • Bigrams
  • Bigrams and Trigrams
  • Trigrams

For problems like Sentiment Analysis, setting n-gram ranges that use sizes bigrams or trigrams can improve dramatically the accuracy of classification, as they can capture more complex expressions formed by the composition of more than one word. The rationale is that in Sentiment Analysis the outcome depends not only on the frequency of words but also on how they are combined: good has a different meaning alone than when preceded by a not as in not good.

 

Max features

This sets the maximum number of features to be used to characterize texts in the training/classification process. This number affects how many computation resources are needed to train the model. More features means more computation times. Also classification times are affected.

Intuitively, adding more features should improve results, but this is not always the case, adding more features could decrease accuracy and also increase computation times.

The default value is set in 10,000 features which is a reasonable value for text mining applications. More features may allow better representation of the model, but too many may have the risk of overfitting the data, and decrease the generalization performance.

Classifier Algorithm

Use this setting to choose which classification algorithm you want to use for this classifier. Currently you have two options:

  • Multinomial Naive Bayes is a very simple and fast algorithm that has very good performance in most cases. Read more in wikipedia.
  • Support Vector Machines is a more complex algorithm, slightly slower than Naive Bayes but delivers a higher accuracy in general. Read more in wikipedia.

An intersting advantage in using Multinomial Naive Bayes is that you can get more insights on how the model is working than using SVM. After training a module with MNB, if you click on a particular sample in the Sandbox/Sample tab, you’ll see a detail of the sample with the corresponding prediction and the positive and negative influenece that particular features had in this prediction.

Is Multilabel

Define if the module is single-label (default) or multi-label. You can only set this option when you first create the module and you cannot change it later.

In a multi-label classifier you can put a single sample into more than one category. This means that when you perform a new classification, it can return more than one possible category. More about this in the API Reference documentation.