Let’s begin with a fact, over the past 20 years, nothing has evolved as significantly as search engines. Gone are the days of simply matching keywords to web pages. Today’s search engines are much more sophisticated and intelligent. One of the key factors behind this is the concept of entities.
Entities refer to the people, places, things, and concepts that make up our living world, and it would be an understatement to say that search engines have gotten better at understanding the relationships between them. This has led to more accurate and relevant search results for users like you and me.
Back in the early days of search engines, it was all about keyword matching. Websites would stuff their pages with keywords in the hopes of ranking higher in search results. But this approach was flawed and often led to irrelevant or low-quality content ranking higher than more useful pages.
My opinion is that it still does
As search engines became more advanced, they started to use a variety of signals to determine the quality and relevance of web pages. This included the concept of « PageRank, » which was introduced by Larry Page and Sergey Brin in the late 1990s. PageRank was based on the idea that a web page’s importance could be determined by the number of other pages linking to it.
However, even PageRank had its limitations. It didn’t take into account the actual content of the page and could be manipulated through several tactics. This led search engines to look for new ways to determine relevance and quality, which is where entities came into play.
By understanding the entities on a web page and their relationships to each other, search engines can better determine the relevance and usefulness of the content. This has brought to more personalized and accurate search results, and it’s a trend that’s likely to continue as search engines become even more advanced in the years ahead.
What is an entity ?
An entity is a specific object or concept, such as a person, place, thing, event or idea, that can be described and distinguished from other entities. In the context of search engines and natural language processing, an entity can be anything that is referred to by a name or noun in a given text or language. Here 5 examples :
- Tokyo
- Tuesday
- 100 euros
- Car
- The Mandalorian
This uniqueness is a key aspect of entities, as it enables search engines and other natural language processing systems to differentiate between similar entities and better understand the relationships between them – take a look at the entity-relationship model proposed by Chen in 1976.
For example, in a news article about a recent film release, the film’s title, director, cast, and plot could all be considered entities. Similarly, in a review of a restaurant, the name of the restaurant, the type of cuisine, and the location could all be entities.
Entities are important for search engines because they help to understand the context and relationships between different pieces of information, which in turn helps to deliver more accurate and relevant search results.
Defining what constitutes an entity can be a complex task, as it can depend on the context and the specific domain in question. Different fields and disciplines may have their own definitions of what constitutes an entity, which can further complicate matters.
For example: a pastry chef, a baker and a kitchen assistant may, or may not, have the same idea of what French cuisine is.
Krisztian Balog, in his book « Entity-Oriented Search, » discusses the challenges of defining entities in the context of search engines and information retrieval. He notes that entities can be difficult to define because they can take on different forms and have different properties, depending on the specific domain they belong to. For example, a person entity in a social media context may have different properties and relationships than a person entity in a scientific database.
Balog suggests that a useful way to approach the definition of entities is to focus on their characteristics and how they are used in the context of information retrieval. In particular, he suggests that entities can be defined as « objects that have identity, a type, and are used to represent a particular aspect of the world. » This definition takes into account the unique identity of entities, as well as their properties and relationships, and how they are used in the context of search and retrieval.
What differs from named entities to concepts ?
Named entities and concepts are related but distinct concepts in natural language processing and information retrieval.
Named entities refer to specific objects or entities that can be uniquely identified by a name or label, such as a person, organization, or location. Named entities are typically extracted from text using named entity recognition (NER) techniques, which can identify and classify them based on their syntactic and semantic properties.
Concepts, on the other hand, refer to abstract ideas or categories that can be represented by a set of words or phrases. Concepts may or may not be explicitly named in text, but can be inferred from the context and the relationships between different pieces of information. For example, in a text about artificial intelligence, concepts such as « machine learning » or « neural networks » may be inferred from the text, even if they are not explicitly named.
Here 5 examples :
- Distance
- Authority
- Peace
- Wind
- Quantity
While named entities and concepts are distinct concepts, they can be related in that named entities may be used to represent specific instances or examples of a given concept. For example, the named entities « Apple » and « Microsoft » may be used to represent the concept of « technology companies. » In this way, named entities and concepts can work together to help us understand and organize information in natural language text.
Properties of entities
Entities are objects or concepts that can be uniquely identified and described. In the context of data modeling and database design, entities are typically represented as tables with rows and columns, where each row represents a specific instance of the entity and each column represents a specific property or attribute of the entity. Some common properties of entities include:
- Unique Identifier: Entities must have a unique identifier that distinguishes them from other entities in the same domain or context. This identifier can be a primary key in a database table or a combination of attributes that uniquely identify the entity.
- Name : Entities are known and referred to by their name, usually a proper name. Unlike the unique identifier, names do not identify a single entity. Several entities may share the same name:
- Attributes: Entities can have one or more attributes that describe their characteristics or properties.The type entity “person” may have the attribute “birth date”.
- Relationships: Relationships describe how 2 entities can be related to each other. From a linguistic perspective, 2 entities can be proper nouns linked by verbs:
- Type: Entities can belong to a specific class or type that defines their set of properties and relationships. For example, in a university database, « student » and « faculty » could be different entity types with different sets of attributes and relationships.
Example : The entity Albert Einstein is an instance of the entity type « scientist », which itself is a subtype of « person ».
One last word
Entities are fundamental units of information that represent people, places, things, concepts, and more in natural language processing. Entities are crucial in extracting valuable insights from unstructured text data, as they enable us to identify and classify different types of information within a document or text corpus.
By utilizing entity recognition and entity extraction techniques, businesses can gain a deeper understanding of their customers, competitors, and industry trends. This information can help organizations make data-driven decisions, improve customer experience, and enhance operational efficiency.
However, identifying and extracting entities from unstructured text data can be a complex task, as it requires an understanding of the context and domain-specific knowledge. As such, it is important to choose the right entity recognition tool or software and to have a team of experienced data scientists or linguists to perform the analysis.