GliNER2: Extracting Structured Data from Textual content

, we had SpaCy, which was the de facto NLP library for each newcomers and superior customers. It made it simple to dip your toes into NLP, even should you weren’t a deep studying professional. Nonetheless, with the rise of ChatGPT and different LLMs, it appears to have been moved apart.

Whereas LLMs like Claude or Gemini can do all kinds of NLP issues automagically, you don’t all the time wish to carry a rocket launcher to a fist struggle. GliNER is spearheading the return of smaller, targeted fashions for traditional NLP strategies like entity and relationship extraction. It’s light-weight sufficient to run on a CPU, but highly effective sufficient to have constructed a thriving neighborhood round it.

Launched earlier this 12 months, GliNER2 is a big leap ahead. The place the unique GliNER targeted on entity recognition (spawning varied spin-offs like GLiREL for relations and GLiClass for classification), GliNER2 unifies named entity recognition, textual content classification, relation extraction, and structured information extraction right into a single framework.

The core shift in GliNER2 is its schema-driven strategy, which lets you outline extraction necessities declaratively and execute a number of duties in a single inference name. Regardless of these expanded capabilities, the mannequin stays CPU-efficient, making it a super answer for remodeling messy, unstructured textual content into clear information with out the overhead of a big language mannequin.
As a information graph fanatic at Neo4j, I’ve been significantly drawn to newly added structured information extraction through extract_json methodology. Whereas entity and relation extraction are useful on their very own, the flexibility to outline a schema and pull structured JSON immediately from textual content is what actually excites me. It’s a pure match for information graph ingestion, the place structured, constant output is crucial.

Setting up information graphs with GliNER2. Picture by writer.

On this weblog put up, we’ll consider GliNER2’s capabilities, particularly the mannequin fastino/gliner2-large-v1, with a give attention to how nicely it may possibly assist us construct clear, structured information graphs.

The code is accessible on GitHub.

Dataset choice

We’re not operating formal benchmarks right here, only a fast vibe examine to see what GliNER2 can do. Right here’s our take a look at textual content, pulled from the Ada Lovelace Wikipedia web page:

Augusta Ada King, Countess of Lovelace (10 December 1815–27 November 1852), also referred to as Ada Lovelace, was an English mathematician and author mainly recognized for work on Charles Babbage’s proposed mechanical general-purpose pc, the analytical engine. She was the primary to recognise the machine had purposes past pure calculation. Lovelace is commonly thought of the primary pc programmer. Lovelace was the one professional little one of poet Lord Byron and reformer Anne Isabella Milbanke. All her half-siblings, Lord Byron’s different youngsters, had been born out of wedlock to different ladies. Lord Byron separated from his spouse a month after Ada was born, and left England endlessly. He died in Greece through the Greek Battle of Independence, when she was eight. Woman Byron was anxious about her daughter’s upbringing and promoted Lovelace’s curiosity in arithmetic and logic, to forestall her growing her father’s perceived madness. Regardless of this, Lovelace remained concerned with her father, naming one son Byron and the opposite, for her father’s center identify, Gordon. Lovelace was buried subsequent to her father at her request. Though usually unwell in childhood, Lovelace pursued her research assiduously. She married William King in 1835. King was a Baron, and was created Viscount Ockham and 1st Earl of Lovelace in 1838. The identify Lovelace was chosen as a result of Ada was descended from the extinct Baron Lovelaces. The title given to her husband thus made Ada the Countess of Lovelace.

At 322 tokens, it’s a stable chunk of textual content to work with. Let’s dive in.

Entity extraction

Let’s begin with entity extraction. At its core, entity extraction is the method of mechanically figuring out and categorizing key entities inside textual content, comparable to folks, places, organizations, or technical ideas. GliNER1 already dealt with this nicely, however GliNER2 takes it additional by letting you add descriptions to entity varieties, providing you with finer management over what will get extracted.

entities = extractor.extract_entities(
    textual content,
    {
        "Individual": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    }
)

The outcomes are the next:

Entity extraction outcomes. Picture by writer.

Offering customized descriptions for every entity sort helps resolve ambiguity and improves extraction accuracy. That is particularly helpful for broad classes like Occasion, the place by itself, the mannequin won’t know whether or not to incorporate wars, ceremonies, or private milestones. Including historic occasions, wars, or conflicts clarifies the meant scope.

Relation extraction

Relation extraction identifies relationships between pairs of entities in textual content. For instance, within the sentence “Steve Jobs based Apple”, a relation extraction mannequin would establish the connection Based between the entities Steve Jobs and Apple.

With GLiNER2, you outline solely the relation varieties you wish to extract as you’ll be able to’t constrain which entity varieties are allowed as the pinnacle or tail of every relation. This simplifies the interface however might require post-processing to filter undesirable pairings.

Right here, I added a easy experiment by including each the alias and the same_as relationship definitions.

relations = extractor.extract_relations(
    textual content,
    {
        "parent_of": "An individual is the dad or mum of one other particular person",
        "married_to": "An individual is married to a different particular person",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for an additional entity",
        "same_as": "Entity is an alias, nickname, title, or alternate reference for an additional entity"
    }
)

The outcomes are the next:

Relation extraction outcomes. Picture by writer.

The extraction appropriately recognized key relationships: Lord Byron and Anne Isabella Milbanke as Ada’s mother and father, her marriage to William King, Babbage as inventor of the analytical engine, and Ada’s work on it. Notably, the mannequin detected Augusta Ada King as an alias of Ada Lovelace, however same_as wasn’t captured regardless of having an equivalent description. The choice doesn’t appear random because the mannequin all the time populates the alias however by no means the same_as relationship. This highlights how delicate relation extraction is to label naming, not simply descriptions.

Conveniently, GLiNER2 permits combining a number of extraction varieties in a single name so you will get entity varieties alongside relation varieties in a single move. Nonetheless, the operations are unbiased: entity extraction doesn’t filter or constrain which entities seem in relation extraction, and vice versa. Consider it as operating each extractions in parallel slightly than as a pipeline.

schema = (extractor.create_schema()
    .entities({
        "Individual": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the dad or mum of one other particular person",
        "married_to": "An individual is married to a different particular person",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for an additional entity"
    })
)

outcomes = extractor.extract(textual content, schema)

The outcomes are the next:

Mixed entity and relation extraction outcomes. Picture by writer.

The mixed extraction now provides us entity varieties, that are distinguished by colour. Nonetheless, a number of nodes seem remoted (Greece, England, Greek Battle of Independence) since not each extracted entity participates in a detected relationship.

Structured JSON extraction

Maybe probably the most highly effective characteristic is structured information extraction through extract_json. This mimics the structured output performance of LLMs like ChatGPT or Gemini however runs completely on CPU. Not like entity and relation extraction, this allows you to outline arbitrary fields and pull them into structured information. The syntax follows a field_name::sort::description sample, the place sort is str or record.

outcomes = extractor.extract_json(
    textual content,
    {
        "particular person": [
            "name::str",
            "gender::str::male or female",
            "alias::str::brief summary of included information about the person",
            "description::str",
            "birth_date::str",
            "death_date::str",
            "parent_of::str",
            "married_to::str"
        ]
    }
)

Right here we’re experimenting with some overlap: alias, parent_of, and married_to is also modeled as relations. It’s value exploring which strategy works higher on your use case. One fascinating addition is the description discipline, which pushes the boundaries a bit: it’s nearer to abstract technology than pure extraction.

The outcomes are the next:

{
  "particular person": [
    {
      "name": "Augusta Ada King",
      "gender": null,
      "alias": "Ada Lovelace",
      "description": "English mathematician and writer",
      "birth_date": "10 December 1815",
      "death_date": "27 November 1852",
      "parent_of": "Ada Lovelace",
      "married_to": "William King"
    },
    {
      "name": "Charles Babbage",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Lord Byron",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Anne Isabella Milbanke",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "William King",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    }
  ]
}

The outcomes reveal some limitations. All gender fields are null, though Ada is explicitly referred to as a daughter, the mannequin doesn’t infer she’s feminine. The description discipline captures solely surface-level phrases (“English mathematician and author”, “reformer”) slightly than producing significant summaries, not helpful for workflows like Microsoft’s GraphRAG that depend on richer entity descriptions. There are additionally clear errors: Charles Babbage and William King are incorrectly marked as parent_of Ada, and Lord Byron is labeled a reformer (that’s Anne Isabella). These errors with parent_ofdidn’t come up throughout relation extraction, so maybe that’s the higher methodology right here. General, the outcomes suggests the mannequin excels at extraction however struggles with reasoning or inference, doubtless a tradeoff of its compact dimension.

Moreover, all attributes are non-compulsory, which is sensible and simplifies issues. Nonetheless, you need to watch out as typically the identify attribute will likely be null, therefore making the document invalid. Lastly, we might use one thing like PyDantic to validate outcomes and forged to to acceptable varieties like floats or dates and deal with invalid outcomes.

Setting up information graphs

Since GLiNER2 permits a number of extraction varieties in a single move, we are able to mix all above strategies to assemble a information graph. Quite than operating separate pipelines for entity, relation, and structured information extraction, a single schema definition handles all three. This makes it easy to go from uncooked textual content to a wealthy, interconnected illustration.

schema = (extractor.create_schema()
    .entities({
        "Individual": "Names of individuals, together with the Aristocracy titles.",
        "Location": "International locations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the dad or mum of one other particular person",
        "married_to": "An individual is married to a different particular person",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
    })
    .construction("particular person")
        .discipline("identify", dtype="str")
        .discipline("alias", dtype="str")
        .discipline("description", dtype="str")
        .discipline("birth_date", dtype="str")
)

outcomes = extractor.extract(textual content, schema)

The way you map these outputs to your graph (nodes, relationships, properties) is determined by your information mannequin. On this instance, we use the next information mannequin:

Information graph building end result. Picture by writer.

What you’ll be able to discover is that we embrace the unique textual content chunk within the graph as nicely, which permits us to retrieve and reference the supply materials when querying the graph, enabling extra correct and traceable outcomes. The import Cypher appears to be like like the next:

import_cypher_query = """
// Create Chunk node from textual content
CREATE (c:Chunk {textual content: $textual content})

// Create Individual nodes with properties
WITH c
CALL (c) {
  UNWIND $information.particular person AS p
  WITH p
  WHERE p.identify IS NOT NULL
  MERGE (n:__Entity__ {identify: p.identify})
  SET n.description = p.description,
      n.birth_date = p.birth_date
  MERGE (c)-[:MENTIONS]->(n)
  WITH p, n WHERE p.alias IS NOT NULL
  MERGE (m:__Entity__ {identify: p.alias})
  MERGE (n)-[:ALIAS_OF]->(m)
}

// Create entity nodes dynamically with __Entity__ base label + dynamic label
CALL (c) {
  UNWIND keys($information.entities) AS label
  UNWIND $information.entities[label] AS entityName
  MERGE (n:__Entity__ {identify: entityName})
  SET n:$(label)
  MERGE (c)-[:MENTIONS]->(n)
}

// Create relationships dynamically
CALL (c) {
  UNWIND keys($information.relation_extraction) AS relType
  UNWIND $information.relation_extraction[relType] AS rel
  MATCH (a:__Entity__ {identify: rel[0]})
  MATCH (b:__Entity__ {identify: rel[1]})
  MERGE (a)-[:$(toUpper(relType))]->(b)
}
RETURN distinct 'import accomplished' AS end result
"""

The Cypher question takes the outcomes from GliNER2 output and shops them into Neo4j. We might additionally embrace embeddings for the textual content chunks, entities, and so forth.

Abstract

GliNER2 is a step in the suitable path for structured information extraction. With the rise of LLMs, it’s simple to achieve for ChatGPT or Claude every time it is advisable pull data from textual content, however that’s usually overkill. Operating a multi-billion-parameter mannequin to extract a couple of entities and relationships feels wasteful when smaller, specialised instruments can do the job on a CPU.

GliNER2 unifies named entity recognition, relation extraction, and structured JSON output right into a single framework. It’s well-suited for duties like information graph building, the place you want constant, schema-driven extraction slightly than open-ended technology.
Whereas the mannequin has its limitations. It really works greatest for direct extraction slightly than inference or reasoning, and outcomes might be inconsistent. However the progress from the unique GliNER1 to GliNER2 is encouraging, and hopefully we’ll see continued growth on this area. For a lot of use instances, a targeted extraction mannequin beats an LLM that’s doing excess of you want.

The code is accessible on GitHub.

GliNER2: Extracting Structured Data from Textual content

Construct a multimodal generative AI assistant for root trigger prognosis in predictive upkeep utilizing Amazon Bedrock

Mastering LLM Device Calling: The Full Framework for Connecting Fashions to the Actual World

Mastering LLM Device Calling: The Full Framework for Connecting Fashions to the Actual World

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts