Sensium offers a wide variety of data analysis features, from low-level technology like tokenization or language detection, to higher-level functionality like sentiment analysis or summarization. For each feature we have a simple live demo which let's you specify a text or URL to be analyzed. You can then inspect the JSON request/response as well as the 'pretty' output. For detailed information on the API visit our HTTP or Java guides!

Parsing & Cleaning

Main Text Extraction

Text data is stored in a multitude of different formats, such as plain-text, HTML or PDFs. Sensium supports the parsing and extraction of the main text from such formats. Currently we support plain-text and HTML, with PDF and Office format support in the pipeline.

{{result.text}}
{{response}}

Natural Language Processing

Language Recognition

Many text mining tasks start with identifying the language a given text is written in. The result of this recognition step then governs what type of other analysis steps should/can be applied. Our language recognition is based on n-gram statistics for the languages English, German, French, Spanish and Italian.

Detected Language: {{result.language}}
{{response}}

Tokenization

Once the language of a text is identified, we can identify individual tokens (words, punctuation, symbols, urls) within the text. Tokens are the input for higher-level functionality like part-of-speech tagging or keyword extraction. The characteristics of a text's tokens can also help us identify and compare authors, indicating their style!

Text Position
{{result.text.substring(token.start, token.end)}} {{token.start}}:{{token.end}}
{{response}}

Stemming & Normalization

We can process the tokens further by stemming and normalizing them. Through stemming we can get rid of inflections, making words more comparable. E.g. houses, housed, housing would become "hous" in their stemmed form. Normalization changes the capitalization of every word to lower-case. These two operations on token are a general preprocessing step for text-based machine learning.

Text Stem Normalized
{{result.text.substring(token.start, token.end)}} {{token.stem}} {{token.normalized}}
{{response}}

Part-of-Speech Tagging

Part-of-speech tagging assigns each token a word category like noun, verb or adjective. Most languages have their own set of tags, like the Stuttgart-Tübingen-Tagset for German, or the Penn Treebank Tag Set for English. Sensium supports part-of-speech tagging for English, German, French, Spanish and Italian, returning tags from the respective standard tagset for each language. In addition to the language specific tag sets we provide a unified tag set of 12 categories. Each tag from the language specific tag sets is mapped to one unified tag to simply processing of multiple languages.

Text Part of Speech Unified Part of Speech
{{result.text.substring(token.start, token.end)}} {{token.posTag}} {{token.posTagUnified}}
{{response}}

Sentence Segmentation

Sentence segmentation allows us to identify the start and end of each sentence within a text. Identified sentences can then be the base to analyse the style of an author or summarize a text.

Sentence Position
{{result.text.substring(sentence.start, sentence.end)}} {{sentence.start + ":" + sentence.end}}
{{response}}

Information Extraction

Named Entity Recognition

By applying Named Entity Recognition we can discover things like persons, locations or organizations within text. These entities can be used to tag a document, make it comparable to other documents. Once we find an entity, we try to link it to one of the popular open linked data sources like Wikipedia or Freebase. In this step we first have try to figure out the best match for the found entity, which may be ambiguous, e.g. Michael Jordan may refer to the basketball player or mathematician. Disambiguation of entities is based on their surrounding text.

Entity Type URL Position
{{entity.normalized}} {{entity.type}} {{entity.link}} {{entity.start}}:{{entity.end}}
{{response}}

Temporal Event Extraction

Often we are interested in extracting temporal events or dates from a text to figure out what time period it might talk about. Sensium can extract occurrences of such temporal events within text.

Text UTC Date
{{result.text.substring(date.start, date.end)}} {{dateString(date.timestamp)}}
{{response}}

Keyphrase Extraction

In many usage scenarios, users may only be intersted in keyphrases that describe a long text. Sensium can extract the most relevant keyphrases from a text based on statistical analysis.

Keyword Score
{{keyphrase.text}} {{keyphrase.score}}
{{response}}

Summarization

Media consumption can become an overwhelming task for end users due to the sheer amount of text to read. Summarization can be used to provide users with a short summary composed of the most relevant sentences in the source text.

Sentences
{{result.text.substring(sentence.start, sentence.end)}}
{{response}}