Twitter IE

Skip to end of metadata
Go to start of metadata
Table of Contents

Introduction

Twitter IE is a named entity recognition service specially tuned to use Twitter data. It performs:

  • tokenisation, sentence splitting and part-of-speech tagging, using a model trained specifically for Tweets
  • normalisation of abbreviations and shortened word forms frequently found in Tweets ("brb", "ttyl", "gr8", "2day", etc.)
  • tagging of Twitter-specific entities such as hashtags and @mentions, as well as URLs and emoticons
  • general named-entity recognition to identify basic entity types such as Person, Location, Organization, Money amounts, Time and Date expressions, etc.

Acknowledgements

The Twitter analytics service of S4 is based on the TwitIE open source information extraction pipeline by the GATE platform. More information about GATE TwitIE is available here.

Supported annotations

Type Description
:Person Standard named entity types
:Location Standard named entity types
:Organization Standard named entity types
:Date Standard named entity types
:Address Includes email and IP addresses as well as street addresses
:Token The individual tokens of the text, with "category" feature for POS
:Emoticon Emoticons such as
:Hashtag Hashtags, including the leading # character
:URL URL mentions
:UserID The username part of @user mentions, not including the leading @ sign

Supported entities

Name Description
type The entity type
class The entity type from dbpedia: http://dbpedia.org/ontology/Person for persons http://dbpedia.org/ontology/Organization for organisations http://dbpedia.org/ontology/Place for locations
inst The unique URI of the extracted entity (person, location, organisation) mapped to DBpedia (e.g. a URI starting with"http://dbpedia.org/")

REST API

The details on the REST API for the Twitter Analytics service are available on the Text Analytics page.

Example

In our example we will use a very simple request with embedded Tweet text, which looks like:

(Please refer to the Text Analytics page for details on the JSON input/output formats)

RESTful Request

We are now ready to send a simple RESTful request to the S4 text analytics services using a simple command line tool like curl:

Lets go step-by-step through the sample code above:

  1. we specify the API Key and secret - all S4 requests need a valid API key and secret pair which can be generated from the S4 Management Console
  2. we specify the S4 RESTful service to be used - in this case the "TwitIE" analytics service. Note that as part of the endpoint URL we also provide the API key and secret
  3. we have chosen to analyse directly the Tweet text (the original twitter-json obtained from the Twitter API can also be analysed by S4)
  4. we construct the proper JSON request document - comprised of the tweet content + "text/plain" as content type
  5. we make a RESTful request to the S4 service via curl, providing the JSON request document (from step 4), the S4 service endpoint (from step 2) and we specify in the HTTP header that this HTTP request type is "application/json" (note that this is different from the actual tweet content type, which was "text/plain")

JSON Result

The result of the service invocation is another JSON document (the structure is described on the Text Analytics page) which contains annotations and their offsets for various entities found in text:

  • Person ("Obama")
  • Person ("Donald Trump")
  • URL ("http://nydn.us/23TGeo6 ")
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.