In today's data-rich world, valuable insights are often locked away in unstructured text, such as detailed clinical notes, lengthy legal documents, customer feedback threads and evolving news reports. Manually sifting through this information or building bespoke code to process the data is time-consuming and error-prone, and using modern large language models (LLMs) naively may introduce errors. What if you could programmatically extract the exact information you need, while ensuring the outputs are structured and reliably tied back to its source?
Today, we're excited to introduce LangExtract, a new open-source Python library designed to empower developers to do just that. LangExtract provides a lightweight interface to various LLMs such as our Gemini models for processing large volumes of unstructured text into structured information based on your custom instructions, ensuring both flexibility and traceability.
Whether you're working with medical reports, financial summaries, or any other text-heavy domain, LangExtract offers a flexible and powerful way to unlock the data within.
LangExtract offers a unique combination of capabilities that make it useful for information extraction:
Here's how to extract character details from a line of Shakespeare.
First, install the library:
For more detailed setup instructions, including virtual environments and API key configuration, please see the project README.
pip install langextract
Next, define your extraction task. Provide a clear prompt and a high-quality "few-shot" example to guide the model.
import textwrap
import langextract as lx
# 1. Define a concise prompt
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
# 2. Provide a high-quality example to guide the model
examples = [
lx.data.ExampleData(
text=(
"ROMEO. But soft! What light through yonder window breaks? It is"
" the east, and Juliet is the sun."
),
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"},
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"},
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="Juliet is the sun",
attributes={"type": "metaphor"},
),
],
)
]
# 3. Run the extraction on your input text
input_text = (
"Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
)
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-pro",
)
The result object contains the extracted entities, which can be saved to a JSONL file. From there, you can generate an interactive HTML file to view the annotations. This visualization is great for demos or evaluating the extraction quality, saving valuable time. It works seamlessly in environments like Google Colab or can be saved as a standalone HTML file, viewable from your browser.
# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
# Generate the interactive visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
f.write(html_content)
The same principles above apply to specialized domains like medicine, finance, engineering or law. The ideas behind LangExtract were first applied to medical information extraction and can be effective at processing clinical text. For example, it can identify medications, dosages, and other medication attributes, and then map the relationships between them. This capability was a core part of the research that led to this library, which you can read about in our paper on accelerating medical information extraction.
The animation below shows LangExtract processing clinical text to extract medication-related entities and groups them to the source medication.
To showcase LangExtract's power in a specialized field, we developed an interactive demonstration for structured radiology reporting called RadExtract on Hugging Face. This demo shows how LangExtract can process a free-text radiology report and automatically convert its key findings into a structured format, also highlighting important findings. This approach is important in radiology, where structuring reports enhances clarity, ensures completeness, and improves data interoperability for research and clinical care.
Disclaimer: The medication extraction example and structured reporting demo above are for illustrative purposes of LangExtract's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.
We're excited to see the innovative ways developers will use LangExtract to unlock insights from text. Dive into the documentation, explore the examples on our GitHub repository, and start transforming your unstructured data today.