We encounter text in many forms—emails, messages, documents, reports, and social media posts, to name a few. These contain a wealth of information as they capture the essence of human and business interactions. But most of this text is unstructured data, with only loosely defined formats, and extracting information from unstructured text isn’t straightforward.
Organizations large and small, however, ignore unstructured data sources at their peril, as those who analyze unstructured data stand to gain significant benefits and actionable insights.
In this article, we explore how Natural Language Processing (NLP) is used to analyze and manage unstructured data in the form of text. In particular, we look at 12 NLP techniques and workflows that play a key role in structuring unstructured data.
We also highlight how text analysis is more accessible and cost-effective with the help of no-code NLP solutions—these break down the barriers of unstructured data analysis for a wide range of users and use cases.
What is Unstructured Data?
Unstructured data is data that’s in a form that hasn’t been sorted and structured according to a pre-defined data model.
While text data has some inherent structure, due to the rules of grammar and language usage, text that lacks a well-defined format is unstructured text. Web pages, tweets, books, reports, and emails are examples of unstructured text.
Scale and Scope of Unstructured Data
There’s a vast and growing volume of structured and unstructured data that surrounds us, and most of it is unstructured.
A recent article by CIO magazine estimates that unstructured data accounts for 80–90 percent of all digital data, i.e., both structured and unstructured data, available today. This applies to both qualitative data (e.g., text) and quantitative data.
Classification of Unstructured Data
Data can be classified into structured vs semi-structured vs unstructured data, as follows:
- Structured data is highly organized into pre-defined data models and easily processed by computers
- Semi-structured data has loosely organized elements of structure without a fixed schema or data model
- Unstructured data is unorganized and difficult to process
The focus of this article is on unstructured text data.
How to Structure Unstructured Data Using NLP
Unstructured data offers an opportunity for businesses and researchers to harvest valuable information and insights.
There have been significant advances in unstructured data analysis of text using algorithms, collectively called natural language processing, and we’ll discuss several of these in this article.
What is Natural Language Processing (NLP)
Natural language processing (NLP) is a set of techniques used for analyzing unstructured data in the form of human (natural) language.
NLP may be categorized into natural language understanding (NLU) and natural language generation (NLG). The focus of this article is NLU, that is, the techniques used for understanding text data through machine comprehension.
How to Convert Unstructured Data to Structured Data: The NLP Workflow
NLP transforms unstructured data, in the form of text, into structured data and outputs that are either helpful in understanding something about the text or as inputs for further analysis.
To understand the sentiment of a document, for instance, you can use NLP to extract sentiment categories that may be helpful in their own right or could be used for further analysis.
NLP involves three stages of unstructured information workflow, often referred to as the NLP workflow:
- Text pre-processing— prepares input text data for analysis
- Text representation— converts text into data forms (e.g., numeric) that can be easily analyzed
- Analysis and modeling—applies a range of data analytics techniques to extract information and meaning
Challenges of NLP
NLP is an evolving field with powerful capabilities, but it’s not always easy to implement due to:
- Difficulties sourcing data—suitably labeled open-source data is hard to find
- Computational demands—data analysis using NLP is computationally intensive, expensive, and time-consuming to deploy
- Relevant skills and expertise—suitably skilled individuals who can analyze unstructured data using NLP are in short supply
Despite these challenges, many organizations persevere with NLP to stay relevant and competitive.
Fortunately, however, there are no-code NLP solutions that you can use to analyze unstructured data in a more accessible and cost-effective way.
Accern, for instance, offers a range of text analytics models with no-code solutions using curated data for financial services applications including asset management, ESG, crypto, and commercial banking.
By using automation and economies of scale, Accern offers effective yet customizable solutions for individual applications.
Applications of NLP
NLP is enjoying a rapid uptake as its use cases continue to evolve. Common applications include:
- Machine translation—automatic translation of text, HTML, social media feeds, financial and legal documents, etc., from one language into another
- Chatbots—simulated human conversations to assist companies with customer queries, sales enquiries, and service requests
- Email classification—segmentation of emails to filter out spam and send emails to the correct folders
- Sentiment analysis—understanding customer perceptions about products or services by analyzing social media and other repositories of customer interactions
12 Techniques for Extracting Information from Unstructured Data
Let’s now look at several techniques that underpin NLP unstructured data analysis.
The first four techniques fall into the first stage of the NLP workflow, i.e., text pre-processing.
Tokenization separates text into smaller units called tokens. These can be words, parts of words (e.g., syllables), or terms that contain more than one word.
As an example, consider the sentence: The rock band, “Genesis”, was popular in 1986.
This could be tokenized as: [The] [rock band] [,] [“] [Genesis] [”] [,] [was] [popular] [in]  [.]
Notice how the sentence is broken up into separate units based on words, punctuation, numbers, and commonly used terms.
The term ‘rock band’, for instance, is treated as an individual token rather than separate tokens for ‘rock’ and ‘band’. This makes sense in the context of the sentence. The punctuation is separated out as well.
The exact tokenization approach will vary with the NLP model that you use. Some models may not recognize ‘rock band’ as a single token, for example, or may treat the word ‘popular’ as three tokens, i.e., [pop], [u], and [lar].
Normalization identifies the base form of a word by using methods such as stemming and lemmatization.
Consider the word: studies.
- Stemming, as its name suggests, identifies the stem of a word by removing the suffix. So, the stem of studies is simply studi. The stem in this case isn’t a proper word but it’s closer to the base form of the word (i.e., study).
- Lemmatization goes a step further—it removes both suffixes and prefixes (when required) and uses vocabulary to identify words in their proper form. So, the lemmatization of studies would be study, i.e., a word used in natural language.
Not surprisingly, lemmatization is often preferred to stemming as it results in more meaningful words.
3. Part-of-Speech (POS) Tagging
Part-of-speech (POS) tagging uses morphology, i.e., the way that words relate to each other, to tag words based on their functions in sentences. It’s also referred to as grammatical tagging.
In its simplest form, POS tagging applies tags to words in a sentence based on whether they are nouns, verbs, adjectives, or adverbs, for instance.
More complex POS tagging would identify more ambiguous word usage.
For example, consider the sentence: The sailor dogs the hatch.
Is the word dogs a (plural) noun or is it a verb in this case? Correct POS would tag dogs as a verb in this context rather than a noun, provided it can draw on sufficiently sophisticated grammatical rules.
The next six techniques focus on representing text data for subsequent analysis, i.e., text representation.
4. Bag of Words (BoW)
Bag of words (BoW) represents text by keeping a tally of the number of times that words appear in a document. It compares the words with a known list of reference words, i.e., a vocabulary, to form vectors, or bags, of word counts.
There’s no information about the structure of the document or the context of the words in each bag, only the word count.
To illustrate BoW, consider the sentence: The dog jumped over the fence.
The words in this sentence are: the, dog, jumped, over, and fence. The BoW vector for these words is [2, 1, 1, 1, 1, 0, …, 0].
Here, each word occurs once in the sentence, other than the word the, which occurs twice. The zeros are for words that are contained in the vocabulary used to create the word counts but aren’t contained in the sentence—the larger the vocabulary relative to the sentence, the more zeros there will be.
BoW is simple and easy to understand but has two drawbacks:
- When there are too many zeros in a BoW vector, it becomes a sparse vector that’s computationally difficult to work with and contains little information
- The lack of word context may make subsequent analysis less meaningful
5. Bag of n-grams
A better approach than BoW is to count groups of words, or n-grams, rather than individual words. This approach is called bag of n-grams.
In our previous example the words over the fence, for instance, form a 3-gram and have better context than the individual words on their own.
Word-count approaches, such as BoW or bag of n-grams, tend to result in certain words dominating the word counts. Words like the or and, for instance, appear more frequently than other words but don’t convey much information.
TF-IDF, or term frequency–inverse document frequency, compensates for this by discounting the counts of words that appear more frequently in a text collection. It counterbalances a word’s term frequency with its inverse document frequency, where:
- The term frequency is the number of times a word appears in a document
- The inverse document frequency is a measure of how rare the word is across all documents in a text collection
TF-IDF often produces better results than BoW or bag of n-grams by emphasizing words that have more meaning in sentences.
7. Word Embedding
Word embedding is a more sophisticated approach than BoW or bag of n-grams for mapping vectors to words.
It produces vectors that don’t have unnecessary zeros in them, hence they’re called dense vectors in comparison to the sparse vectors of simpler approaches.
Word embedding considers each word and the words around it when forming vectors. It captures context better in this way, allowing more meaningful analysis and comparisons between words.
Modern word embedding approaches use machine learning or deep learning to map words to vectors. Popular approaches include Word2Vec, GloVe, FastText, and BERT.
8. Named Entity Recognition (NER)
Named entity recognition (NER) identifies parts of text that are associated with known categories such as organizations, locations, quantities, monetary values, time expressions, and names of people.
Consider the following sentence: Jeff Bezos sold USD 9 million worth of Amazon stock and donated it to the Smithsonian’s Air and Space Museum.
NER would annotate this sentence as follows:
Jeff Bezos [PERSON] sold USD [CURRENCY] 9 million [NUM] worth of Amazon [ORG] stock and donated it to the Smithsonian’s Air and Space Museum [ORG].
NER is very helpful in understanding text. If you know the persons, locations, amounts, or organizations involved in a section of text, for instance, this goes a long way in helping you make sense of what the text means.
9. Semantic Representation
Semantic representation is the process of identifying the meaning of words used in sentences.
Some words can have more than one meaning, such as the word pass, for instance, which can have one of three meanings:
- To hand over something to someone
- A decision to not participate in something
- A score on a test or exam
The actual meaning of the word depends on the way it’s used in text, i.e., the words that appear before and after it.
Semantic representation assigns meaning to words based on their usage in the text document being analyzed.
The final three techniques highlight popular use cases of NLP data extraction through analysis and modeling.
10. Text Summarization
Text summarization in NLP is a process of creating summaries of text documents to capture the most relevant information from them.
It aims to produce concise, accurate, and informative summaries that retain the meaning of the original documents.
Text summarization uses two broad approaches:
- Extraction—summaries are formed using a subset of the words contained in a document
- Abstraction—summaries are formed by, first, understanding the semantic meaning of the text being analyzed, then generating text to form a summary that may contain words not found in the original text
Most text summarization uses the extraction approach following a three-step process:
Step 1—Captures the key aspects of the original text and stores the results as an intermediate representation
Step 2— Scores sentences (in the original text) based on the results of Step 1
Step 3—Forms a summary using the sentences that were scored in Step 2
The intermediate representation in Step 1 uses one or more of the methods outlined earlier in this article, such as TF-IDF or word embeddings.
Scoring is based on how well sentences capture the meaning of the original documents. It may be determined using a variety of approaches, such as applying machine learning or traditional data mining tools to data collected from multiple indicators.
Sentences are selected for summarization using approaches such as ranking sentences by their summary lengths, selecting sentences according to relevance measures and iterative techniques, or applying optimization procedures.
Text summarization is a useful technique in a range of applications including:
- Media monitoring—automatically summarize key media pieces to assist in monitoring the news flow
- Newsletters—prepare concise summaries of key points from newsletters
- Corporate knowledgebase—review summaries of existing documents before adding them to the corporate knowledge base
- Financial research—cover a large body of financial news flow through convenient summaries that make the research process more timely and efficient
- Legal contract analysis—extract summaries of the key clauses from legal documents
- Help desk and customer support—provide summaries of help documents to assist staff in quickly reviewing and finding information for customer support
11. Sentiment Analysis
Sentiment Analysis is one of the most widely used applications of unstructured text analysis. It’s used by businesses to understand what customers are saying about their products and services and by researchers to gauge people’s attitudes toward the economy, their financial circumstances, or other areas of their lives.
In its simplest form, sentiment analysis classifies text into three categories: positive, negative, or neutral sentiment. More sophisticated forms of sentiment analysis include:
- Fine-grained analysis—assigns more granular sentiment classifications, i.e., more than three categories, such as a 5-point scale
- Aspect-based analysis—assesses sentiment about specific aspects of a product or service, such as the ease of operating a product or whether specific features are well- or poorly-received
- Emotion detection—identifies emotions that are expressed about products or services, such as joy, sadness, fear, or worry, e.g., “This coffee machine is awful!”
- Intent analysis—recognizes whether consumers have purchase-intent or are merely browsing a product or service website
Sentiment analysis can be applied to social media feeds, website browsing histories, customer conversations, and other repositories of consumer feedback.
Beyond its value for businesses, sentiment analysis is also useful for financial researchers when assessing the prospects of companies based on the types of sentiment that they are receiving.
12. Topic Modeling
Topic modelling identifies hidden relationships in data. It works in an exploratory manner, looking for the themes or topics that lie within a set of text documents.
A key benefit of topic modelling is that it’s an efficient form of unsupervised learning. This means that you can apply it to unlabeled (or unannotated) data—a great benefit, as most of the unstructured text available today doesn’t have annotations or labelling, and labelling is time-consuming and expensive to do.
There are several algorithms for topic modelling, including Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Correlated Topic Modeling (CTM). One of the most popular methods, however, is an approach called Latent Dirichlet Allocation (LDA).
LDA works by inferring the relationships between words in text documents. It uncovers the hidden or latent, topics in documents using Dirichlet distributions.
LDA works by using an iterative process as follows:
- Initialize—Randomly assign topics to each word in a set of documents and tally frequency counts
- Update—Update the topic assignments based on the frequency counts, but subject to variability generated by Dirichlet distributions
- Repeat—Again, update the topic assignments for all words in all documents
- Iterate—Repeat the whole process, resulting in better and better topic assignments
With each iteration of LDA, words gravitate towards each other to form more likely word and topic mixes—the more iterations, the better the results.
When running an LDA algorithm, you need to choose the number of iterations, so you’ll need to balance the quality of results against the computational demands of running more iterations.
LDA is used widely in:
- eDiscovery—efficient identification of the key themes in legal document searches
- Content recommendations—personalized content recommendations for users of media portals such as the New York Times
- Search engine optimization (SEO)—efficient search algorithms based on topic maps and clusters
- Word sense disambiguation (WSD)—assisting language models in understanding the meaning of words through topic identification
Extracting information from unstructured text represents a significant opportunity for organizations. They can gain valuable insights into business and research outcomes through the range of techniques outlined in this article.
But it’s not always easy to do—the skills, expertise, resources, and time required for the analysis are beyond the reach of many.
Fortunately, there are no-code NLP solutions that lower the barriers to entry and make cost-effective yet powerful analysis possible.
Accern, for instance, offers no-code solutions to help a wide range of organizations leverage the capabilities of NLP and make the most of the rich data that surrounds them.
Schedule a free demo to learn more about how Accern can power ROI for hedge funds, commercial banks, ESG investing, and insurance.Share this Post!