The following quick start shows how to create and use a custom pipeline.

Pipeline structure

Building a pipeline is almost like programming in your favorite language! A pipeline is declared with a JSON object that must have the following properties:

  • steps: This property is an array of steps that define the processing (modules to be executed) on your text input.
  • version: The version of the Pipeline API you are using.
  • input: The type of input your pipeline accepts.
  • input_example: An example of an input for your pipeline.
  • return: An array that defines what your pipeline’s output is going to be.

The execution of a pipeline can be seen as a state machine where the initial state is provided by the input property and in each step transforms the state by transforming the current state. The current state of the execution of the Pipeline is stored in the state variable.

Pipeline example

Let’s take a look at our first pipeline:

In this example, at line 2 we first define which version of the Pipeline’s API we are going to use.

Next, at line 3, we define how is going to be the input of our pipeline. It’s going to be named texts and is defined as an array of objects with a maximum lenth of 30 objects.

We then define an input example at line 10. This shows an example of a valid input, but also defines the structure of each object within the texts input array. As you can see in line 12, each object inside the array will have a string property named textthat will contain each text to be processed by our pipeline.

Each text passed as an input will be processed by the steps defined at line 17:

  • First step will classify the text with a language classifier. The result of the classification will be stored as a new property named lang inside each input object.
  • Second step will be executed if the text is written in English (this control is made by the when property). If the when condition is true, then it is classified by the English sentiment analysis classifier and the value of this classification is stored in the sentiment property.
  • Third step defines that if the text is in Spanish, then the text will be classified by the Spanish sentiment analysis classifier and the value of this classification will be stored in the sentiment property.
  • Finally, the fourth and last step, uses a keyword extractor in order to extract keywords from the text. The keywords will be stored in the keywords property.

At line 53 we define how the result will be structured. You can see that we’ll return a json object with a property named ‘sentiment_labels’ which will contain an array of objects, where each of this objects will have 3 properties: ‘sentiment’, ‘lang’, and ‘keywords’.

Pipelines syntax

Expression syntax

Many fields in a pipeline can be an expression to be evaluated. These expressions are very similar to the ones you can find in languages such as Javascript or Python. For example, 1 + 5 == 6 evaluates to true. Parentheses work as usual.

Available operators:

Operator Meaning
* multiplication
/ division
% modulus
+ addition
- substraction
< less than
<= less than or equals
> greater than
>= greater than or equals
== equals
!= not equal
&& and
|| or
! not

Variables:

Within a pipeline, there are two useful variables that are defined by default: state and item.

  • state is a global variable that stores the current state of the execution of the pipeline. If we want to reference, for example, a property named property1 within the current state, we would do that with the following expression: state[‘property1’].
  • item is an auxiliar variable that can be used to reference an item from an item_list. For example, you can use it to reference the items to check a particular condition in the when clause: “when”: “item[‘lang’][0][‘label’] == ‘Spanish'”.

Constants:

There is only one constant: null. It references the null value.

Brackets []:

Can be used to reference properties inside objects like state[‘input’] or to access an element inside an array like item[‘lang’][0][‘label’].

Functions:

  • len(array): returns the length of array.
  • abs(x): returns the absolute value of number x.
  • range(array, start, end): returns a subarray from array, starting at position start and ending at position end.
  • shuffle(array): returns array randomly shuffled.
  • singleList(array, key): returns an array where each element is: element[key] for each element in array.
  • regex_replace(pattern, repl, string): returns the string obtained by replacing the occurrences of pattern in string by the replacement repl.
  • regex_findall(pattern, string): returns all non-overlapping matches of pattern in string, as an array of strings.
  • get(object, key, default_value): if key is not in object then returns default_value, else returns object[key].
  • join(array, string): returns a string which is the concatenation of the strings in the array. The separator between elements is the string provided as the second argument.
  • lowercase(string): returns a string with all the characters transformed to lowercase.
  • uppercase(string):returns a string with all the characters transformed to uppercase.
  • split(string, separator): returns a list of the words in the string, using separator as the delimiting string.
  • count(string, x): returns the total number of occurrences of x in string.
  • strip(string, chars): returns a copy of the string with the leading and trailing characters removed. The chars argument is a string specifying the set of characters to be removed.
  • replace(string, old, new): returns a copy of the string with all occurrences of substring old replaced by new.
  • append(array, element): element is appended at the end of array. Returns array.
  • extend(array1, array2): array1 changes to be the concatenation of array1 with array2. Returns array1.
  • insert(array, index, x): inserts item x at position index in array. Returns array.
  • remove(array, x): removes item x from array. Returns array.
  • pop(array): removes the las element of array and returns that element.
  • index(array, x): returns the index of item x in array.
  • sort(array): sorts and returns array.
  • reverse(array): reverses the elements of array.
  • in(x, array): returns true if x is in array, false otherwise.
  • zip(array1, array2): returns the array resulting of zipping together array1 with array2. For example, zip([1,2], [3, 4]) == [(1, 3), (2, 4)].
  • set(object, key, x): object[key] = x. Returns object.
  • sorted(object): returns a new sorted array from the items in object.

Steps syntax

There are 5 different types of pipeline steps:

classify

This step allows you to call a classifier and store its result. You have to specify the following arguments:

  • module_id: An expression that evaluates to a valid classifier id (string).
  • item_list: An expression that evaluates to an array of objects where the classification is going to be performed.
  • in_key: An expression that evaluates to the name of a property of the object that is being iterated. It must containt the text to be classified.
  • out_key: An expression that evaluates to the name of a property where the result will be stored. This property will be added to the current object being iterated.
  • when (optional): An expression that evaluates to a boolean value that indicates if the current object being iterated is going to be classified or not.
  • is_sandbox (optional): An expression that evaluates to a boolean value that indicates if sandbox mode will be used on the classifier. Defaults to 0.

Example:

extract

This step allows you to call an extractor and store its result.

  • module_id: An expression that evaluates to a valid extractor id (string).
  • item_list: An expression that evaluates to an array of objects where the extraction is going to be performed.
  • in_key: An expression that evaluates to the name of a property of the object that is being iterated. It must containt the text where the extraction will be performed.
  • out_key: An expression that evaluates to the name of a property where the result will be stored. This property will be added to the current object being iterated.
  • when (optional): An expression that evaluates to a boolean value that indicates if the current object being iterated is going to be used for extraction or not.

Example:

pipeline

This step allows you to call some other pipeline.

  • module_id: An expression that evaluates to a valid pipeline id (string).
  • item_list: An expression that evaluates to an array of objects that the pipeline is going to use at its initial state.
  • pipeline_key: An expression that evaluates to the key used in the initial state of the pipeline.
  • out_key: An expression that evaluates to a name of a property that is going to be added to the current pipeline’s state.
  • when (optional): An expression that evaluates to a boolean value that indicates if the current object being iterated is going to be used in the pipeline or not.

Example:

transform

This step allows you to take an array of objects and apply some transformation to each of its elements.

  • transform_item: An expression whose result is going to be stored in the out_key property.
  • item_list: An expression that evaluates to an array of objects that are going to be used in the transformation.
  • out_key: An expression that evaluates to a name of a property that is going to be added to the current object being iterated.
  • when (optional): An expression that evaluates to a boolean value that indicates if the current object being iterated is going to be transformed or not.

Example:

twitter.*

This step is of the form twitter.get_user, twitter.followers, etc. The available functions for connecting with the Twitter API can be read in the documentation. The parameters of each of these functions can be passed as additional parameters.

  • twitter_consumer_key: An expression that evaluates to a valid Twitter Consumer Key.
  • twitter_consumer_secret: An expression that evaluates to a valid Twitter Consumer Secret.
  • twitter_access_token_key: An expression that evaluates to a valid Twitter Access Token Key.
  • twitter_access_token_secret: An expression that evaluates to a valid Twitter Access Token Secret.
  • out_key: An expression that evaluates to a name of a property that is going to be added to the pipeline state.

Example:

Return syntax

The return property is an array of objects that define what the pipeline is going to return. Each of this objects need to have the following properties:

  • item_list: An expression that evaluates to an array of objects that are going to be used to get data to be returned.
  • in_keys: An array of expressions that evaluate to the name of a property of the object that is being iterated. This are the properties that are going to be returned by the API.
  • out_key: An expression that evaluates to a name of a property that is going to be added to the return object.

Example:

Input syntax

The purpose of this property is to declare the input that the pipeline accepts. The input is an object where each of its keys is an input that your pipeline expects to receive in order to give a response. Each of this objects can have the following properties:

  • type: The type of the input, it can be string, array, object, int or float.
  • help: This is some help text for documentation purposes.
  • max_len: If the type of the input is array, use this property to determine the maximum length that is allowed as an input.

Example:

Input example syntax

This is an optional property that shows a helpful input example for documentation.

Example:

Version syntax

This is the version of the Pipeline’s API you are going to use. The latest version is 0.1