MonkeyLearn classifiers use Comma Separated Values (CSV) or Excel files when importing data into classifiers and CSV when exporting its data (the category tree and their samples). The following sections show more details on the format accepted by MonkeyLearn.

CSV Primer

CSV files are just plain text files, with a specific format to represent rows and columns: usually each line represents a row and in each line commas are used to separate columns.

Take for example the following table:

Title Year
2001: A Space Odyssey Science Fiction
Kung Fu Panda Animation

This same data represented as a CSV file would look like this:

If the value contains a comma itself you can wrap it with quotation marks (”) and if you need to escape quotation marks you can add another quotation mark before it. Let’s add a couple of lines to our CSV to illustrate this:

Finally note that you can have multi-line values but you must wrap those in quotation marks. Take a look at the following example:

The last three lines represent a single row, with columns that have new line characters in it.

Now you have the basics of the CSV format, we can now proceed to describe MonkeyLearn’s specific CSV requirements for representing classifiers and datasets data.

MonkeyLearn CSV/Excel format

MonkeyLearn uses CSV or Excel when importing or exporting data from the classification modules.

When importing data into MonkeyLearn, the CSV or Excel file can have multiple columns, but you’ll have to select in particular two columns that have the:

  • Sample text
  • Full category path

With a set of this two values you are able to describe the full category tree hierarchy, the samples data and their relation.

Consider the following CSV example:

When importing a CSV like this, each of the two lines would create a new sample and every category in the path that are not already created.

For the first line this particular data will:

  • Create a sample with the text Terusan Sutami III Setrasari Bandung Hotel …
  • Create the category Travel & Vacations (as a root category subcategory)
  • Create the Hotels category (as a Travel & Vacations subcategory)
  • Associate the sample with the Hotels category.

while the second line will:

  • Create a sample with the text Malacca: 2D1N Stay at 5-Star Hotel Casa Del Rio with Breakfast …
  • Associate the sample with the already created Hotels category.

Note that the category path is an ordered list of categories from the root to the category the sample belongs to, they are separated using the slash “/” character. A category path must always start with a slash. The same notation applies for Excel files to denote the category hierarchies with the slash charecter.

MonkeyLearn supports zipped CSV or Excel files. When importing a CSV or Excel, it’s recommended to always zip your file before uploading it, so the upload takes less time and, more important, the file sizes is reduced avoiding reaching a file size limit. When you export a classifier you’ll always get a zipped CSV file to speed up the downloading time.

Note that MonkeyLearn enforces some limitations on the tree structure:

  • A maximum of 4000 categories in total.
  • A maximum of 50 categories with the same parent.
  • A maximum tree depth of 8 levels.

Multilabel classifiers CSV/Excel file

In the case your classifier is multilabel, in order to upload the same sample to many categories, you have to use a format like in the following CSV example:

As you can see, it is almost the same format as in the single-label case described before but you can put a sample in as many categories as you want separating the categories with a colon “:” character.

The same applies for the Excel files when denoting the multiple labels for a single sample.

Encoding requirements

The CSV file should be UTF-8 encoded, latin-1 should also work if that’s a good choice for your data. We make our best to auto detect other encodings but this process might fail so we strongly recommend to always use UTF-8 if possible.

If you are exporting the CSV from XLS file it’s not recommended to use Excel to save as CSV since this tool is known to have issues with some Unicode characters, use an alternative tool. Google Docs Spreadsheets or Libre Office are good alternatives for this task.

If you find yourself having issues with your data encoding when importing a CSV, contact our support team at hello@MonkeyLearn.com

Uploading data files

Uploading data files is really easy using the GUI wizard. Within the classifier, just go to the Sandbox/Samples tab and click the Upload button. Then follow the wizard to select the file format (CSV or Excel), select the file from your drive and choose the columns that will be used as the sample content and the column that will be used as the category:

save image

save image

When you upload a tagged dataset, you can specify the column that has the text content and the column that has the category. Use the combo boxes at the top of the column to select “Use as text” or “Use as category” respectively.

File upload size limitations

You must keep your file size (whether it is zipped or not) under 100 MB.