Setup

Our HTTP API is accessible from any environment that can make HTTP calls and process JSON. We recommend the use of Unirest, a multi-language REST/HTTP/JSON library with support for Node, Ruby, PHP, Java, Python, Objective-C or .NET languages.

For the remainder of this document we'll use

Basic Usage & Error Handling

Sensium has a single endpoint to which you submit JSON via a HTTP POST request:

https://api.sensium.io/v1/extract

You have to specify two headers with your HTTP POST request, and one optional header

Content-Type: application/json
Accept: application/json
Accept-encoding: 'gzip'

All requests and responses Sensium understands are encoded as UTF-8 JSON. If your client supports gzip encoded responses, you can specify an additional HTTP header so Sensium compresses all responses. While this is optional, we highly suggest to make use of this functionality, as it saves bandwidth and your and our end.

A simple request using CURL looks as follows:

curl -i \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    --compressed \
    -X POST \
    -d '{ "apiKey": "YOUR_API_KEY", "text": "I am a little pea, i love the sky and the trees" }' \ 
    http://api.sensium.io/v1/extract

The POST body is a simple JSON object that defines the request. Every request needs to at least specify the API key (for which you have to sign up) and the text or url set, which specify the source to be analyzed. Sensium will then process the request and return a JSON object as the response.

Sensiums client side error model is very simple. If your request was processed successfully the JSON response is returned with an HTTP status code 200. If an error occured, e.g. your request was invalid or you ran out of quota, Sensium will return a HTTP status code 400, and a JSON object containing a reason field describing the details of the error. The following reasons will be returned:

Reason Meaning
SourceNotSpecifiedNeither the url nor the text field of the request where set.
MalformedUrlThe provided URL was malformed
UrlCommunicationFailedCouldn't download the contents from the provided URL
HtmlCleaningFailedCouldn't extract the main text from the URL contents
InvalidMimeTypeThe provided or guessed mime-type is not supported
QuotaExceededYour quota for the given API key has been exceeded, or you provided an invalid API key, or you omitted an API key
InternalServerErrorSomething unforseen happened, please report!

An examplary error response looks like this:

{
   "reason": "QuotaExceeded"
}

Preparing the Request

The JSON object you provide in your POST request body is your way of telling Sensium what to do. It specifies the source to be analzyed as well as the types of analysis to be performed.
JSON Request field Meaning
apiKeyyour API key
textplain-text to be analyzed, or omitted if url is set
urlURL to download content from, or omitted if text is set
mimeTypemime-type, or omitted if Sensium should guess the mime-type, currently only "text/plain" and any "html" mime-type are supported
languagetwo-letter ISO 639-1 code, or omitted if Sensium should guess the language
extractorslist of extractors to apply to the source, or omitted if all extractors should be applied

The minimal request object has to specify an API key and either a plain-text or a url:

{
   "apiKey": "YOUR_API_KEY",
   "text": "I'm a little pea, i love the sky and the trees."
}

If you specify an URL instead of a plain-text, Sensium will download the contents of that URL and try to find the main text if it's mime-type contains the string "html". Alternatively you can provide your main text in the request's text field.

Sensium will automatically guess the mime-type and language of the source to be analyzed. You can manually overwrite this by setting the mimeType and language fields in the request.

If you don't specify any extractors, Sensium will apply all available extractors. To minimize the processing time and only return what you really need, you can specify a list of extractors in the extractors field. The following extractors are available:

Extractor NameFunctionLanguage Support
Tokensreturns tokens, including their stemmed and normalized form.English, French, German, Spanish, Italian
Sentencesreturns sentence boundariesEnglish, French, German, Spanish, Italian
PosTagsadds part-of-speech tags to the tokensEnglish, French, German, Spanish, Italian
Entitiesreturns named entities such as persons or locationsEnglish, French, German, Spanish, Italian
TemporalEventsreturns datesEnglish, French, German, Spanish, Italian
Summaryreturns keyphrases and a sentence-based summaryEnglish, French, German, Spanish, Italian
Sentimentreturns the sentiment of sentences and the full textEnglish
Extractor names are case-sensitive!

Here's a more complicated request:

{
   "apiKey": "YOUR_API_KEY",
   "url": "http://example.org",
   "language": "en",
   "extractors": [ "Summary", "Sentiment" ]
}

With this request we tell Sensium to download the content from http://example.org, extract the main text, and then return a summary as well as the sentiment of the main text. We also tell Sensium that it should assume the main text to be English.

Processing the Response

Depending on the extractors you specified in the request, Sensium will return different results in the response JSON object. The general layout of this JSON object looks like this:

JSON Response field Meaning
authenticationTimetime in seconds used to authenticate the API Key
downloadTimetime in seconds used to content from provided URL if set
processingTimetime in seconds used to process the request (excluding auth and download time)
textthe (extracted) main text that was analyzed
languagetwo-letter ISO 639-1 code, or omitted if Sensium should guess the language
tokensarray of Token objects generated by the token and part-of-speech extractor
sentencesarray of Sentence objects generated by the sentence extractor
entitiesarray of NamedEntity objects generated by the named entity extractor
temporalEventsarray of TemporalEvent objects generated by the temporal event extractor
summarySummary object containing keyphrases and sentences summarizing the main text, generated by the summary extractor
polaritySentiment object specifying which parts of the text are negative/positive, generated by the sentiment extractor
objectivitySentiment object specifying which parts of the text are subjective/objective, generated by the sentiment extractor

An example response may look like this (for details of array and object fields see below):

{
   "authenticationTime": 0.0002,
   "processingTime": 0.002,
   "downloadTime": 0.432,
   "text": "the main text that was analyzed",
   "language": "en",
   "tokens": [ ... ],
   "sentences": [ ... ],
   "entities": [ ... ],
   "temporalEvents": [ ... ],
   "summary": { ... },
   "polarity": { ... },
   "objectivity": { ... }
}

Fields may be omitted if the respective extractor was not specified.

Language and Text

No matter what extractors you specify, Sensium will always return the plain-text in the text field and the two-letter ISO 639-1 language code in the language field. If you specified an URL in your request, Sensium will try to extract the main text from the URL content if it's mime-type contains the string "html".

Tokens, Stems & POS-Tags

If you specified the Tokens or PosTags extractor, the tokens field of the response will contain a list of Token objects. A Token exposes the following fields:

Token field Meaning
start, endstart and (exclusive) end position of the token in the main text. Zero based.
stemthe stemmed form
normalizedthe normalized (language-specific, lower-cased) form
posTagthe part-of-speech tag, using the languages tag set
posTagUnifiedthe unified part-of-speech tag, using our language independent, tiny tag set

These are the part-of-speech tag sets we currently support:

Language Tag Set
EnglishPenn Treebank Tagset
GermanStuttgart/Tübinger Tagsets
SpanishParole Reduced Tagset
FrenchFrench Treebank Tagset
ItalianTani POS Tagset

The tags of these and future tagsets are all mapped to our language independent unified tagset which consists of these tags:

ADJECTIVE, ADVERB, CONJUNCTION, DETERMINER, NOUN, PROPER_NOUN, NUMBER, OTHER, PARTICLE, PRONOUN, PREPOSITION, PUNCTUATION, VERB, UNKNOWN

If you want to process part-of-speech tags in a language agnostic way, then this tagset is for you.

A Token may look like this in JSON:

{
  "start": 14,
  "end": 17,
  "posTagUnified": "NOUN",
  "posTag": "NN",
  "stem": "pea",
  "normalized": "pea"
}

Sentences

If you specified the Sentences extractor, the sentences field of the response will contain a list of sentences found in the source. A Sentence has the following fields:

Sentence field Meaning
start, endstart and (exclusive) end position of the sentence in the main text. Zero based.

A Sentence may look like this in JSON:

{
   "start": 0,
   "end": 47
}

Named Entities

If you specified the Entities extractor, the entities field of the response will contain a list of named entities such as persons, locations or organizations, found in the source. A NamedEntity has the following fields:

NamedEntity field Meaning
typetype of the entity, one of "Location", "Person", "Organization"
linklink to a linked (open) data resource on DBPedia. May be null.
normalizednormalized surface form, e.g. "Obama" and "B. Obama" would become "Barack Obama"
occurenceslist of Ocurrence objects, marking the spans in the main text this entity can be found at.

A NamedEntity may look like this in JSON:

{
  "type": "Location",
  "link": "http:\/\/dbpedia.org\/resource\/Switzerland",
  "normalized": "Switzerland",
  "occurrences": [
    {
      "start": 205,
      "end": 216
    }
  ]
}

Temporal Events

If you specified the TemporalEvents extractor, the temporalEvents field of the response will contain a list of TemporalEvent objects. A TemporalEvent has the following fields:

TemporalEvent field Meaning
start, endstart and (exclusive) end position of the token in the main text. Zero based.
timestamp number of milliseconds since the standard base time known as "the epoch", namely January 1, 1970, 00:00:00 GMT.

A TemporalEvent may look like this in JSON:

{
  "start": 29,
  "end": 38,
  "timestamp": -3573421200000
}

Keyphrases & Summaries

If you specified the Summary extractor, the summary field of the response will be set to a Summary object, which has the following fields:

Summary field Meaning
textsummary of the text consisting of the 5 most important sentences in the text
keySentenceslist of Occurrance objects delimiting the sentences making up the summary, ranked by decreasing importance
keyPhraseslist of KeyPhraseobjects, describing the 10 most important keyphrases, ranked by decreasing importance

The KeyPhrase class has the following fields

KeyPhrase field Meaning
texttextual representation of the keyphrase
scorerelative score of the keyphrase, specifying it's importance relative to the other keyphrases
occurenceslist of Occurrence objects, demarking the occurrences of the keyphrase in the main text

A Summary may look like this in JSON:

{
  "text": "Omega, the official timekeeper for the 2014 Sochi Winter Olympics in Russia, has added a unit capable of transmitting speed, acceleration, G-force and vertical track positioning data during their runs. Technology underpins almost every aspect of the Games: cross-country skiers are tracked by GPS technology, while speed skaters' times are measured to the nearest thousandth of a second using light beams on the surface of the ice at the finish line. If you can secure the Games you can secure pretty much anything else on earth\u201d The rise in the use of such data-transmitting sensors and mobile devices has led to a surge in data collection and usage, with a big knock-on effect for networking and security, IT providers say. At the Vancouver Winter Olympics in 2010, the ratio of wired to wireless devices was four-to-one, according to Dean Frohwerk, head of networking architecture for Avaya, an official IT Olympic Partner providing services to the 40,000 officials, athletes, journalists and support staff at the Games. Russian telecoms provider MegaFon is responsible for providing the local network for spectators, and the US State Department has warned visitors that: \"Russian Federal law permits the monitoring, retention and analysis of all data that traverses Russian communication networks, including internet browsing, email messages, telephone calls, and fax transmissions.\"",
  "keySentences": [
    {
      "start": 169,
      "end": 370
    },
    {
      "start": 957,
      "end": 1205
    },
    {
      "start": 1367,
      "end": 1641
    },
    {
      "start": 1642,
      "end": 1939
    },
    {
      "start": 4813,
      "end": 5176
    }
  ],
  "keyPhrases": [
    {
      "text": "data",
      "score": 0.52432751655579,
      "occurrences": [
        {
          "start": 132,
          "end": 136
        },
        {
          "start": 347,
          "end": 351
        },
        {
          "start": 1541,
          "end": 1545
        },
        {
          "start": 2217,
          "end": 2221
        },
        {
          "start": 2422,
          "end": 2426
        },
        {
          "start": 2539,
          "end": 2543
        },
        {
          "start": 3253,
          "end": 3257
        },
        {
          "start": 3467,
          "end": 3471
        },
        {
          "start": 3813,
          "end": 3817
        },
        {
          "start": 5039,
          "end": 5043
        }
      ]
    },
    {
      "text": "Games",
      "score": 0.41496595740318,
      "occurrences": [
        {
          "start": 609,
          "end": 614
        },
        {
          "start": 941,
          "end": 946
        },
        {
          "start": 1005,
          "end": 1010
        },
        {
          "start": 1290,
          "end": 1295
        },
        {
          "start": 1389,
          "end": 1394
        },
        {
          "start": 1933,
          "end": 1938
        },
        {
          "start": 2933,
          "end": 2938
        },
        {
          "start": 4164,
          "end": 4169
        },
        {
          "start": 4384,
          "end": 4389
        },
        {
          "start": 4622,
          "end": 4627
        },
        {
          "start": 4755,
          "end": 4760
        },
        {
          "start": 5253,
          "end": 5258
        }
      ]
    },
    {
      "text": "networks",
      "score": 0.24062933027744,
      "occurrences": [
        {
          "start": 2162,
          "end": 2170
        },
        {
          "start": 2401,
          "end": 2409
        },
        {
          "start": 3037,
          "end": 3045
        },
        {
          "start": 3385,
          "end": 3393
        },
        {
          "start": 5081,
          "end": 5089
        }
      ]
    },
    {
      "text": "devices",
      "score": 0.21658238768578,
      "occurrences": [
        {
          "start": 1511,
          "end": 1518
        },
        {
          "start": 1715,
          "end": 1722
        },
        {
          "start": 2033,
          "end": 2040
        },
        {
          "start": 2153,
          "end": 2160
        }
      ]
    },
    {
      "text": "security",
      "score": 0.21384307742119,
      "occurrences": [
        {
          "start": 1614,
          "end": 1622
        },
        {
          "start": 4315,
          "end": 4323
        }
      ]
    },
    {
      "text": "data security",
      "score": 0.21384307742119,
      "occurrences": [
        {
          "start": 3879,
          "end": 3892
        }
      ]
    },
    {
      "text": "Olympics",
      "score": 0.18786655366421,
      "occurrences": [
        {
          "start": 4282,
          "end": 4290
        }
      ]
    },
    {
      "text": "technology",
      "score": 0.18376669287682,
      "occurrences": [
        {
          "start": 390,
          "end": 400
        },
        {
          "start": 709,
          "end": 719
        },
        {
          "start": 1052,
          "end": 1062
        },
        {
          "start": 2658,
          "end": 2668
        },
        {
          "start": 4036,
          "end": 4046
        }
      ]
    },
    {
      "text": "Technology",
      "score": 0.18376669287682,
      "occurrences": [
        {
          "start": 957,
          "end": 967
        }
      ]
    },
    {
      "text": "Winter Olympics",
      "score": 0.17981751263142,
      "occurrences": [
        {
          "start": 541,
          "end": 556
        },
        {
          "start": 1659,
          "end": 1674
        }
      ]
    },
    {
      "text": "something",
      "score": 0.16984289884567,
      "occurrences": [
        {
          "start": 4010,
          "end": 4019
        },
        {
          "start": 4463,
          "end": 4472
        },
        {
          "start": 4648,
          "end": 4657
        }
      ]
    },
    {
      "text": "Sochi",
      "score": 0.15481390058994,
      "occurrences": [
        {
          "start": 1944,
          "end": 1949
        },
        {
          "start": 2803,
          "end": 2808
        }
      ]
    },
    {
      "text": "Sochi Winter Olympics",
      "score": 0.15481390058994,
      "occurrences": [
        {
          "start": 213,
          "end": 234
        },
        {
          "start": 5186,
          "end": 5207
        }
      ]
    },
    {
      "text": "security event",
      "score": 0.14266432821751,
      "occurrences": [
        {
          "start": 4441,
          "end": 4455
        },
        {
          "start": 4572,
          "end": 4586
        }
      ]
    },
    {
      "text": "event",
      "score": 0.14266432821751,
      "occurrences": [
        {
          "start": 3652,
          "end": 3657
        }
      ]
    },
    {
      "text": "networking",
      "score": 0.13659007847309,
      "occurrences": [
        {
          "start": 1599,
          "end": 1609
        },
        {
          "start": 3541,
          "end": 3551
        },
        {
          "start": 3606,
          "end": 3616
        }
      ]
    },
    {
      "text": "times",
      "score": 0.13619311153889,
      "occurrences": [
        {
          "start": 883,
          "end": 888
        },
        {
          "start": 1085,
          "end": 1090
        },
        {
          "start": 1262,
          "end": 1267
        },
        {
          "start": 3995,
          "end": 4000
        }
      ]
    },
    {
      "text": "Russian",
      "score": 0.12990795075893,
      "occurrences": [
        {
          "start": 4813,
          "end": 4820
        },
        {
          "start": 4965,
          "end": 4972
        },
        {
          "start": 5059,
          "end": 5066
        }
      ]
    },
    {
      "text": "equipment",
      "score": 0.12056763470173,
      "occurrences": [
        {
          "start": 788,
          "end": 797
        },
        {
          "start": 1356,
          "end": 1365
        }
      ]
    }
  ]
}

Sentiment

If you specified the Sentiment extractor, the polarity and objectivity fields of the response will be set. Polarity describes whether a text talks positively or negatively about a given topic. Objectivity describes whether a text is written objectively (news article) or subjectively (editorial). Both fields are of type Sentiment which has the following fields:

Sentiment field Meaning
scorethe document wide sentiment score, between -1 and 1
occurrenceslist of SentimentOccurrance objects delimiting spans in the main text for which sentiment exists

The SentimentOccurrance object has the following fields

SentimentOccurrance field Meaning
start, endstart and (exclusive) end position in the main text. Zero based.
scorescore of the occurance, between -1 and 1

In case of polarity a negtive score means negative polarity (e.g. a bad review), while a positive score indicates positive polarity (a good review). In case of objectivity, a negative score indicates subjectivity, and a positive score indicates objectivity. The closer a score is to 0, the more neutral it is.

A Sentiment may look like this in JSON:

{
  "score": -0.046744345359129,
  "occurrences": [
    {
      "start": 0,
      "end": 26,
      "score": -0.087169368026717
    },
    {
      "start": 27,
      "end": 61,
      "score": -0.035232597176918
    },
    {
      "start": 62,
      "end": 200,
      "score": -0.11045730047733
    },
    {
      "start": 294,
      "end": 402,
      "score": -0.010411028077226
    },
    {
      "start": 403,
      "end": 502,
      "score": 0.17532608715019
    },
    {
      "start": 503,
      "end": 723,
      "score": -0.00614353801096
    },
    {
      "start": 918,
      "end": 944,
      "score": -0.087169368026717
    },
    {
      "start": 945,
      "end": 979,
      "score": -0.035232597176918
    },
    {
      "start": 980,
      "end": 1085,
      "score": -0.18513380696119
    },
    {
      "start": 1086,
      "end": 1209,
      "score": 0.074054352578357
    },
    {
      "start": 1210,
      "end": 1361,
      "score": -0.4717752488144
    },
    {
      "start": 1362,
      "end": 1499,
      "score": -0.24717707171131
    },
    {
      "start": 1500,
      "end": 1611,
      "score": -0.30828466458156
    },
    {
      "start": 1612,
      "end": 1660,
      "score": -0.04155703919874
    },
    {
      "start": 1661,
      "end": 1764,
      "score": -0.023096440035794
    },
    {
      "start": 1813,
      "end": 1875,
      "score": 0.12322307952137
    },
    {
      "start": 1876,
      "end": 1920,
      "score": -0.23876018480756
    },
    {
      "start": 1921,
      "end": 2017,
      "score": -0.089300791535217
    },
    {
      "start": 2127,
      "end": 2303,
      "score": -0.19152074634999
    },
    {
      "start": 2304,
      "end": 2381,
      "score": 0.057648045073718
    },
    {
      "start": 2504,
      "end": 2597,
      "score": -0.27649551942229
    },
    {
      "start": 2598,
      "end": 2760,
      "score": -0.010411028077226
    },
    {
      "start": 2761,
      "end": 2841,
      "score": 0.14730090007754
    },
    {
      "start": 2842,
      "end": 2963,
      "score": 0.2441611252635
    },
    {
      "start": 3340,
      "end": 3446,
      "score": -0.03376349177178
    },
    {
      "start": 3447,
      "end": 3496,
      "score": -0.03376349177178
    },
    {
      "start": 3497,
      "end": 3525,
      "score": 0.12151099538396
    }
  ]
}