Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning.
This 60-second video describes more about what Unstructured does and its benefits:
This 40-second video demonstrates a simple use case that Unstructured helps solve:
This one-minute video shows why using Unstructured is preferable to building your own similar solution:
Unstructured offers the Unstructured user interface (UI) and the Unstructured API. Read on to learn more.
No-code UI. Production-ready. Learn more.
Here is a screenshot of the Unstructured UI Start page:
This 90-second video provides a brief overview of the Unstructured UI:
Use scripts or code. Production-ready. Learn more.
The Unstructured API consists of two parts:
Here is a screenshot of some Python code that calls the Unstructured Workflow Endpoint:
The Unstructured user interface (UI) and Unstructured API support processing of the following file types:
By file extension:
File extension |
---|
.abw |
.bmp |
.csv |
.cwk |
.dbf |
.dif * |
.doc |
.docm |
.docx |
.dot |
.dotm |
.eml |
.epub |
.et |
.eth |
.fods |
.heic |
.htm |
.html |
.hwp |
.jpeg |
.jpg |
.md |
.mcw |
.msg |
.mw |
.odt |
.org |
.p7s |
.pbd |
.pdf |
.png |
.pot |
.ppt |
.pptm |
.pptx |
.prn |
.rst |
.rtf |
.sdp |
.sxg |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
.zabw |
By file type:
Category | File types |
---|---|
Apple | .cwk , .mcw |
CSV | .csv |
Data Interchange | .dif * |
dBase | .dbf |
.eml , .msg , .p7s | |
EPUB | .epub |
HTML | .htm , .html |
Image | .bmp , .heic , .jpeg , .jpg , .png , .prn , .tiff |
Markdown | .md |
OpenOffice | .odt |
Org Mode | .org |
Other | .eth , .pbd , .sdp |
.pdf | |
Plain text | .txt |
PowerPoint | .pot , .ppt , .pptm , .pptx |
reStructured Text | .rst |
Rich Text | .rtf |
Spreadsheet | .et , .fods , .mw , .xls , .xlsx |
StarOffice | .sxg |
TSV | .tsv |
Word processing | .abw , .doc , .docm , .docx , .dot , .dotm , .hwp , .zabw |
XML | .xml |
*
For .dif
, \n
characters in .dif
files are supported, but \r\n
characters will raise the error
UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type
.
To use the Unstructured UI or call the Unstructured API, you must have an Unstructured account.
Unstructured offers three account pricing plans:
For more details, see the Unstructured Pricing page.
Some of these plans are billed on a per-page basis.
Unstructured calculates a page as follows:
.pdf
, .pptx
, and .tiff
..docx
files that have page metadata, Unstructured calculates the number of pages based on that metadata.This quickstart uses a no-code, point-and-click user interface (UI) in your web browser to have Unstructured process a single file that is stored on your local machine.
The file is first processed on Unstructured-hosted compute resources. The UI then shows the processed data that Unstructured generates for that file.
You can download that processed data as a .json
file to your local machine.
This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features. After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production.
To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types:
File type |
---|
.bmp |
.csv |
.doc |
.docx |
.email |
.epub |
.heic |
.html |
.jpg |
.md |
.odt |
.org |
.pdf |
.pot |
.potm |
.ppt |
.pptm |
.pptx |
.rst |
.rtf |
.sgl |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
For processing remote files at scale in production, Unstructured supports many more files types than these. See the list of supported file types.
Unstructured also supports processing files from remote object stores, and data from remote sources in websites, web apps, databases, and vector stores. For more information, see the source connector overview and the remote quickstart for how to set up and run production-ready Unstructured ETL+ workflows at scale.
If you do not have any files available, you can use one of the sample files that Unstructured offers in the UI. Or, you can download one or more sample files from the example-docs folder in the Unstructured repo on GitHub.
Sign up and sign in
Create a workflow
In the Unstructured UI, on the sidebar, click Workflows.
Click New Workflow.
Select Build it Myself, if it is not already selected.
Click Continue. The visual workflow editor appears.
The workflow is represented visually as a series of directed acyclic graph (DAG) nodes. Each node represents a step in the workflow. The workflow proceeds end to end from left to right. By default, the workflow starts with three nodes:
Process a local file
Drag the file that you want Unstructured to process from your local machine’s file browser app and drop it into the Source node’s Drop file to test area. The file must have a size of 10 MB or less and one of the file types listed at the beginning of this quickstart.
If you are not able to drag and drop the file, you can click Drop file to test and then browse to and select the file instead.
Alternatively, you can use a sample file that Unstructured offers. To do this, click the Source node, and then in the Source pane, with Details selected, on the Local file tab, click one of the files under Or use a provided sample file. To view the file’s contents before you select it, click the eyes button next to the file.
Above the Source node, click Test.
Unstructured displays a visual representation of the file and begins processing its contents, sending it through each of the workflow’s nodes in sequence. Depending on the file’s size and the workflow’s complexity, this processing could take several minutes.
After Unstructured has finished its processing, the processed data appears in the Test output pane, as a series of structured elements in JSON format.
In the Test output pane, you can:
.json
file to your local machine by clicking Download full JSON.When you are done, click the Close button in the Test output pane.
Add more nodes to the workflow
You can now add more nodes to the workflow to do further testing of various Unstructured features and with the option of eventually moving the workflow into production. For example, you can:
Add a Chunker node after the Partitioner node, to chunk the partitioned data into smaller pieces for your retrieval augmented generation (RAG) applications. To do this, click the add (+) button to the right of the Partitioner node, and then click Enrich > Chunker. Click the new Chunker node and specify its settings. For help, click the FAQ button in the Chunker node’s pane. Learn more about chunking and chunker settings.
Add an Enrichment node after the Chunker node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and named entity recognition (NER). To do this, click the add (+) button to the right of the Chunker node, and then click Enrich > Enrichment. Click the new Enrichment node and specify its settings. For help, click the FAQ button in the Enrichment node’s pane. Learn more about enrichments and enrichment settings.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Add an Embedder node after the Enrichment node, to generate vector embeddings for performing vector-based searches. To do this, click the add (+) button to the right of the Enrichment node, and then click Transform > Embedder. Click the new Embedder node and specify its settings. For help, click the FAQ button in the Embedder node’s pane. Learn more about embedding and embedding settings.
Each time you add a node or change its settings, you can click Test above the Source node again to test the current workflow end to end and see the results of the changes, if any.
Keep repeating this step as many times as you want, until you get the results you want.
Next steps
After you get the results you want, you have the option of moving your workflow into production. To do this, complete the following instructions.
The following instructions have you create a new workflow that is suitable for production. This behavior is planned to be fixed in a future release, allowing you to update the workflow that you just created, rather than needing to create a new one.
With your workflow remaining open in the visual workflow editor, open a new tab in your web browser, and in this new tab, sign in to your Unstructured account:
In this new tab, create a source connector for your remote source location. This is the location in production where you have files or data in a file or object store, website, database, or vector store that you want Unstructured to process.
Create a destination connector for your remote destination location. This is the location in production where you want Unstructured to put the processed data as .json
files in a file or object store, or as records in a database or vector store.
Create a workflow: on the sidebar, click Workflows, and then click New Workflow. Select Build it Myself, and then click Continue to open the visual workflow editor.
In the visual workflow editor, click Source.
In the Source pane, with Details selected, on the Connectors tab, select the source connector that you just created.
Click the Destination node.
In the Destination pane, with Details selected, select the destination connector that you just created.
Using your original workflow on the other tab as a guide, add any additional nodes to this new workflow as needed, and configure those new nodes’ settings to match the other ones.
Click Save.
To run the workflow:
a. Make sure to click Save first.
b. Click the Close button next to the workflow’s name in the top navigation bar.
c. On the sidebar, click Workflows.
d. In the list of available workflows, click the Run button for the workflow that you just saved.
e. On the sidebar, click Jobs.
f. In the list of available jobs, click the job that you just ran.
g. After the job status shows Finished, go to the your destination location to see the processed files or data that Unstructured put there.
See also the remote quickstart for more coverage about how to set up and run production-ready Unstructured ETL+ workflows at scale.
This quickstart uses the Unstructured Python SDK to call the Unstructured Workflow Endpoint to get your data RAG-ready. The Python code for this quickstart is in a remote hosted Google Colab notebook. Data is processed on Unstructured-hosted compute resources.
The requirements are as follows:
Unstructured-IO/unstructured-ingest
repository in GitHub.Sign up, sign in, and get your API key
Sign in to your Unstructured account:
Get your Unstructured API key:
a. In the Unstructured UI, click API Keys on the sidebar.
b. Click Generate API Key.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
Create and set up the S3 bucket
This quickstart uses an Amazon S3 bucket as both the source location and the destination location. (You can use other source and destination types that are supported by Unstructured. If you use a different source or destination type, or if you use a different S3 bucket for the destination location, you will need to modify the quickstart notebook accordingly.)
Inside of the S3 bucket, a folder named input
represents the
source location. This is where your files to be processed will be stored.
The S3 URI to the source location will be s3://<your-bucket-name>/input
.
Inside of the same S3 bucket, a folder inside named output
represents the destination location. This
is where Unstructured will put the processed data.
The S3 URI to the destination location will be s3://<your-bucket-name>/output
.
Learn how to create an S3 bucket and set it up for Unstructured. (Do not run the Python SDK code or REST commands at the end of those setup instructions.)
Run the quickstart notebook
After your S3 bucket is created and set up, follow the instructions in this quickstart notebook.
View the processed data
After you run the quickstart notebook, go to your destination location to view the processed data.
Learn more about the Unstructured API.
If you can’t find the information you’re looking for in the documentation, or if you need help, contact us directly, or join our Slack where our team and community can help you.
Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning.
This 60-second video describes more about what Unstructured does and its benefits:
This 40-second video demonstrates a simple use case that Unstructured helps solve:
This one-minute video shows why using Unstructured is preferable to building your own similar solution:
Unstructured offers the Unstructured user interface (UI) and the Unstructured API. Read on to learn more.
No-code UI. Production-ready. Learn more.
Here is a screenshot of the Unstructured UI Start page:
This 90-second video provides a brief overview of the Unstructured UI:
Use scripts or code. Production-ready. Learn more.
The Unstructured API consists of two parts:
Here is a screenshot of some Python code that calls the Unstructured Workflow Endpoint:
The Unstructured user interface (UI) and Unstructured API support processing of the following file types:
By file extension:
File extension |
---|
.abw |
.bmp |
.csv |
.cwk |
.dbf |
.dif * |
.doc |
.docm |
.docx |
.dot |
.dotm |
.eml |
.epub |
.et |
.eth |
.fods |
.heic |
.htm |
.html |
.hwp |
.jpeg |
.jpg |
.md |
.mcw |
.msg |
.mw |
.odt |
.org |
.p7s |
.pbd |
.pdf |
.png |
.pot |
.ppt |
.pptm |
.pptx |
.prn |
.rst |
.rtf |
.sdp |
.sxg |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
.zabw |
By file type:
Category | File types |
---|---|
Apple | .cwk , .mcw |
CSV | .csv |
Data Interchange | .dif * |
dBase | .dbf |
.eml , .msg , .p7s | |
EPUB | .epub |
HTML | .htm , .html |
Image | .bmp , .heic , .jpeg , .jpg , .png , .prn , .tiff |
Markdown | .md |
OpenOffice | .odt |
Org Mode | .org |
Other | .eth , .pbd , .sdp |
.pdf | |
Plain text | .txt |
PowerPoint | .pot , .ppt , .pptm , .pptx |
reStructured Text | .rst |
Rich Text | .rtf |
Spreadsheet | .et , .fods , .mw , .xls , .xlsx |
StarOffice | .sxg |
TSV | .tsv |
Word processing | .abw , .doc , .docm , .docx , .dot , .dotm , .hwp , .zabw |
XML | .xml |
*
For .dif
, \n
characters in .dif
files are supported, but \r\n
characters will raise the error
UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type
.
To use the Unstructured UI or call the Unstructured API, you must have an Unstructured account.
Unstructured offers three account pricing plans:
For more details, see the Unstructured Pricing page.
Some of these plans are billed on a per-page basis.
Unstructured calculates a page as follows:
.pdf
, .pptx
, and .tiff
..docx
files that have page metadata, Unstructured calculates the number of pages based on that metadata.This quickstart uses a no-code, point-and-click user interface (UI) in your web browser to have Unstructured process a single file that is stored on your local machine.
The file is first processed on Unstructured-hosted compute resources. The UI then shows the processed data that Unstructured generates for that file.
You can download that processed data as a .json
file to your local machine.
This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features. After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production.
To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types:
File type |
---|
.bmp |
.csv |
.doc |
.docx |
.email |
.epub |
.heic |
.html |
.jpg |
.md |
.odt |
.org |
.pdf |
.pot |
.potm |
.ppt |
.pptm |
.pptx |
.rst |
.rtf |
.sgl |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
For processing remote files at scale in production, Unstructured supports many more files types than these. See the list of supported file types.
Unstructured also supports processing files from remote object stores, and data from remote sources in websites, web apps, databases, and vector stores. For more information, see the source connector overview and the remote quickstart for how to set up and run production-ready Unstructured ETL+ workflows at scale.
If you do not have any files available, you can use one of the sample files that Unstructured offers in the UI. Or, you can download one or more sample files from the example-docs folder in the Unstructured repo on GitHub.
Sign up and sign in
Create a workflow
In the Unstructured UI, on the sidebar, click Workflows.
Click New Workflow.
Select Build it Myself, if it is not already selected.
Click Continue. The visual workflow editor appears.
The workflow is represented visually as a series of directed acyclic graph (DAG) nodes. Each node represents a step in the workflow. The workflow proceeds end to end from left to right. By default, the workflow starts with three nodes:
Process a local file
Drag the file that you want Unstructured to process from your local machine’s file browser app and drop it into the Source node’s Drop file to test area. The file must have a size of 10 MB or less and one of the file types listed at the beginning of this quickstart.
If you are not able to drag and drop the file, you can click Drop file to test and then browse to and select the file instead.
Alternatively, you can use a sample file that Unstructured offers. To do this, click the Source node, and then in the Source pane, with Details selected, on the Local file tab, click one of the files under Or use a provided sample file. To view the file’s contents before you select it, click the eyes button next to the file.
Above the Source node, click Test.
Unstructured displays a visual representation of the file and begins processing its contents, sending it through each of the workflow’s nodes in sequence. Depending on the file’s size and the workflow’s complexity, this processing could take several minutes.
After Unstructured has finished its processing, the processed data appears in the Test output pane, as a series of structured elements in JSON format.
In the Test output pane, you can:
.json
file to your local machine by clicking Download full JSON.When you are done, click the Close button in the Test output pane.
Add more nodes to the workflow
You can now add more nodes to the workflow to do further testing of various Unstructured features and with the option of eventually moving the workflow into production. For example, you can:
Add a Chunker node after the Partitioner node, to chunk the partitioned data into smaller pieces for your retrieval augmented generation (RAG) applications. To do this, click the add (+) button to the right of the Partitioner node, and then click Enrich > Chunker. Click the new Chunker node and specify its settings. For help, click the FAQ button in the Chunker node’s pane. Learn more about chunking and chunker settings.
Add an Enrichment node after the Chunker node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and named entity recognition (NER). To do this, click the add (+) button to the right of the Chunker node, and then click Enrich > Enrichment. Click the new Enrichment node and specify its settings. For help, click the FAQ button in the Enrichment node’s pane. Learn more about enrichments and enrichment settings.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Add an Embedder node after the Enrichment node, to generate vector embeddings for performing vector-based searches. To do this, click the add (+) button to the right of the Enrichment node, and then click Transform > Embedder. Click the new Embedder node and specify its settings. For help, click the FAQ button in the Embedder node’s pane. Learn more about embedding and embedding settings.
Each time you add a node or change its settings, you can click Test above the Source node again to test the current workflow end to end and see the results of the changes, if any.
Keep repeating this step as many times as you want, until you get the results you want.
Next steps
After you get the results you want, you have the option of moving your workflow into production. To do this, complete the following instructions.
The following instructions have you create a new workflow that is suitable for production. This behavior is planned to be fixed in a future release, allowing you to update the workflow that you just created, rather than needing to create a new one.
With your workflow remaining open in the visual workflow editor, open a new tab in your web browser, and in this new tab, sign in to your Unstructured account:
In this new tab, create a source connector for your remote source location. This is the location in production where you have files or data in a file or object store, website, database, or vector store that you want Unstructured to process.
Create a destination connector for your remote destination location. This is the location in production where you want Unstructured to put the processed data as .json
files in a file or object store, or as records in a database or vector store.
Create a workflow: on the sidebar, click Workflows, and then click New Workflow. Select Build it Myself, and then click Continue to open the visual workflow editor.
In the visual workflow editor, click Source.
In the Source pane, with Details selected, on the Connectors tab, select the source connector that you just created.
Click the Destination node.
In the Destination pane, with Details selected, select the destination connector that you just created.
Using your original workflow on the other tab as a guide, add any additional nodes to this new workflow as needed, and configure those new nodes’ settings to match the other ones.
Click Save.
To run the workflow:
a. Make sure to click Save first.
b. Click the Close button next to the workflow’s name in the top navigation bar.
c. On the sidebar, click Workflows.
d. In the list of available workflows, click the Run button for the workflow that you just saved.
e. On the sidebar, click Jobs.
f. In the list of available jobs, click the job that you just ran.
g. After the job status shows Finished, go to the your destination location to see the processed files or data that Unstructured put there.
See also the remote quickstart for more coverage about how to set up and run production-ready Unstructured ETL+ workflows at scale.
This quickstart uses the Unstructured Python SDK to call the Unstructured Workflow Endpoint to get your data RAG-ready. The Python code for this quickstart is in a remote hosted Google Colab notebook. Data is processed on Unstructured-hosted compute resources.
The requirements are as follows:
Unstructured-IO/unstructured-ingest
repository in GitHub.Sign up, sign in, and get your API key
Sign in to your Unstructured account:
Get your Unstructured API key:
a. In the Unstructured UI, click API Keys on the sidebar.
b. Click Generate API Key.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
Create and set up the S3 bucket
This quickstart uses an Amazon S3 bucket as both the source location and the destination location. (You can use other source and destination types that are supported by Unstructured. If you use a different source or destination type, or if you use a different S3 bucket for the destination location, you will need to modify the quickstart notebook accordingly.)
Inside of the S3 bucket, a folder named input
represents the
source location. This is where your files to be processed will be stored.
The S3 URI to the source location will be s3://<your-bucket-name>/input
.
Inside of the same S3 bucket, a folder inside named output
represents the destination location. This
is where Unstructured will put the processed data.
The S3 URI to the destination location will be s3://<your-bucket-name>/output
.
Learn how to create an S3 bucket and set it up for Unstructured. (Do not run the Python SDK code or REST commands at the end of those setup instructions.)
Run the quickstart notebook
After your S3 bucket is created and set up, follow the instructions in this quickstart notebook.
View the processed data
After you run the quickstart notebook, go to your destination location to view the processed data.
Learn more about the Unstructured API.
If you can’t find the information you’re looking for in the documentation, or if you need help, contact us directly, or join our Slack where our team and community can help you.