Text Analytics

Skip to end of metadata
Go to start of metadata
Table of Contents

Service Endpoints

service
endpoint
methods
Twitter analytics
https://text.s4.ontotext.com/v1/twitie POST
News analytics https://text.s4.ontotext.com/v1/news POST
News analytics (German) https://text.s4.ontotext.com/v1/news-de POST
Bio-medical analytics https://text.s4.ontotext.com/v1/sbt POST
Healthcare-Tagger
https://text.s4.ontotext.com/v1/healthcare-tagger
POST
News Classifier
https://text.s4.ontotext.com/v1/news-classifier POST

HTTP Headers

Header name Required Description Valid values Default value
Content-Type yes
The MIME type of the request.
  • application/json
  • multipart/mixed
  • multipart/form-data
n/a
Accept-Encoding no
Set if the client supports transparent compression. gzip if omitted, response is not compressed
Accept yes
Determines the response format, as per the usual HTTP content negotiation rules.
  • application/json
  • text/mate
n/a

POST Request

Parameters

There are no parameters to the POST request - all the configuration information is provided in a JSON structure in the request body.

Request Body

The processing body is a JSON structure containing the input text document - either directly included in the request (document), or as a reference to remote URL (documentUrl). In any case, the documentType property should specify the format of the input data (plain text, html page, twitter message, MS Word, etc.).

The following table provides the details on the attributes of the JSON request structure:

Attribute name Required
Description
Valid values
Default value
document No
The document to be processed. Either the document or documentUrl parameter must be specified. Specifying both parameters is an error. JSON String representing the document content n/a
documentUrl
No
The URL of the document to be processed. Either the documentUrl or document parameter must be specified. Specifying both parameters is an error.
The URL must be accessible to the service i.e. it must be publicly accessible and should not require any authentication or setting cookies for access.
JSON String representing a publicly accessible URL n/a
documentType
Yes
The MIME type of the document to be processed
  • text/plain for plain text documents,
  • text/xml for XML documents,
  • text/html for HTML documents,
  • text/x-json-twitter for Twitter JSON format
  • application/msword MS Word documents (.doc)
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document for MS Word documents (.docx)
  • application/rtf for Rich Text Format documents
n/a

Plain Text Processing

For plain text input, the document content can be directly included in the request. The documentType property should specify the format of the input data (plain text, html, twitter message, etc.).

HTML Page Processing

For processing of HTML pages, the document URL has to specified in the request. The documentType property should specify the format of the input data ("text/html").

Office Document Formats Processing

If the input for the text processing services of S4 is not a plain text or and HTML document, then the structure of the service request is different. It consists of two sections:

  1. request metadata specifying the documentType and annotationSelectors;
  2. the input data (MS Word file) as a binary attachment

The request implementation is done as HTTP Multipart message with two attachments:

Annotation Descriptor (application/json)

Binary Data (application/octet-stream)

The content type (Content-Type) header of each of the attachments should be: application/json for the metadata and application/octet-stream for the binary data.

Response Format

News, Bio-medical and Twitter Analytics Services

JSON format

The simplest response format for the S4 text annotation services (news, bio-medical, Twitter) is application/json. For each annotated document, it consists of a JSON object with two properties:

  • text - containing the plain text of the original document, stripped down from any markup (e.g. HTML/XML, etc. tags)
  • entities - containing the annotations for the entities identified in the text

    The annotation position within the plain text ("text" field) is represented as "indices":[start,end] (zero-based character offsets, start inclusive, end exclusive). The annotation features are represented as the other JSON properties of this object.

If the original document is Twitter JSON (i.e. is sent with text/x-json-twitter MIME type), the output JSON will attempt to preserve the JSON structure of the original Tweet as much as possible. If the original Tweet contains "entities", the output annotations will be merged with the ones from the original JSON.

Mate format

MATE format is a special kind of HTML containing annotation results from the S4 Text Analytics services. The purpose of this format is to preserve the original layout of the documents while adding the metadata in not visually rendered HTML elements.

A document in MATE format is:

  • a valid HTML document
  • enriched with annotations represented as JSON
  • containing additional SPAN elements to mark annotations location in the text.

Using this format requires HTML input documents.

Format Description

The annotations representation is included in the header of the input HTML documents as JSON object, containing all the annotations generated by the service. Each annotation is a collection of HTML elements referenced by ids ("data-custom-id" attribute). The IDs are unique withing the document and they are generated during processing time.

Each annotation (JSON) object has the following structure:

  • type - annotation type
  • features - annotation features
  • nodes - the identifiers of all elements that the annotation covers

JSON representation of a single annotation:

The contents in the features element contain the annotations for the entities identified in the text as key-value pairs. For semantic annotations the relevant annotation properties can be:

  • class - the ontology class URI of the recognized entity
  • inst - the URI of the recognized entity

Refer the example in the following section for complete annotations representations.

MATE Example

The following example is based on a BBC's news article introducing the new Amazon delivery drone. For brevity the result example is cut down to few annotations on a fragment of the original document. The annotations are generated by the News Annotation service.

The proper request to the service should provide a reference to the original document:

The result HTML document contains a JSON object, enclosed by the <script> tags, recognizing entities like person and organization. The annotations boundaries are marked by span elements in the document content.

News Classification Service

The response format for the S4 news classification service is application/json. The result is a JSON object providing document classification information as well as ranked list of the top 3 category candidates. The latter provides confidence level for the selected category with respect to the other best category candidates.

Example

You may refer to the News analytics example which has a detailed explanation of the input request format as well as the JSON response for a sample document.

Swagger

The Swagger description of the Text Analytics REST API is available at http://swagger.s4.ontotext.com/

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.