When you use Unstructured, here are some techniques that you can try to help speed up the processing of large files and large batches of files.
fast
strategy. Try the fast
strategy on a few of your documents before you try using the hi_res
strategy. Learn more.For speeding up file processing during ingestion, the Unstructured CLI and Unstructured Python Ingest enable you to instruct Unstructured to use additional local CPUs where applicable.
Using additional local CPUs applies only to pipeline steps that Unstructured logs as being processed across CPUs. It does not apply to pipeline steps that are logged as being processed asynchronously. To get a list of which operations are processed where, look for the following log messages when you run an ingest pipeline:
processing content across processes
. These steps might benefit by setting a higher number of local CPUs to be used.processing content across processes
. Any settings to use a higher number of local CPUs are ignored for these steps.For the Unstructured CLI, you can set --num-processes
to the maximum number of available local CPUs that you want to use where applicable, for example:
To get the maximum number of available local logical CPUs that can be used where applicable, see your operating system’s documentation.
For Unstructured Python Ingest, you can set the ProcessorConfig
object’s num_processes
parameter to the maximum number of available local CPUs that you want to use where applicable, for example:
In Python, to specify the maximum number of available local logical CPUs that can be used where applicable, you can call functions such as os.cpu_count and multiprocessing.cpu_count.
To speed up PDF file processing, the Unstructured Ingest CLI and the Unstructured Ingest Python library provide the following parameters to help speed up processing a large PDF file:
split_pdf_page
, when set to true, splits the PDF file on the client side before sending it as batches to Unstructured for processing. The number of pages in each batch is determined internally. Batches can contain between 2 and 20 pages.split_pdf_concurrency_level
is an integer that specifies the number of parallel requests. The default is 5. The maximum is 15. This behavior is ignored unless split_pdf_page
is also set to true.split_pdf_allow_failed
, when set to true, allows partitioning to continue even if some pages fail.split_pdf_page_range
is a list of two integers that specify the beginning and ending page numbers of the PDF file to be sent. A ValueError
is raised if the specified range is not valid. This behavior is ignored unless split_pdf_page
is also set to true.When you use Unstructured, here are some techniques that you can try to help speed up the processing of large files and large batches of files.
fast
strategy. Try the fast
strategy on a few of your documents before you try using the hi_res
strategy. Learn more.For speeding up file processing during ingestion, the Unstructured CLI and Unstructured Python Ingest enable you to instruct Unstructured to use additional local CPUs where applicable.
Using additional local CPUs applies only to pipeline steps that Unstructured logs as being processed across CPUs. It does not apply to pipeline steps that are logged as being processed asynchronously. To get a list of which operations are processed where, look for the following log messages when you run an ingest pipeline:
processing content across processes
. These steps might benefit by setting a higher number of local CPUs to be used.processing content across processes
. Any settings to use a higher number of local CPUs are ignored for these steps.For the Unstructured CLI, you can set --num-processes
to the maximum number of available local CPUs that you want to use where applicable, for example:
To get the maximum number of available local logical CPUs that can be used where applicable, see your operating system’s documentation.
For Unstructured Python Ingest, you can set the ProcessorConfig
object’s num_processes
parameter to the maximum number of available local CPUs that you want to use where applicable, for example:
In Python, to specify the maximum number of available local logical CPUs that can be used where applicable, you can call functions such as os.cpu_count and multiprocessing.cpu_count.
To speed up PDF file processing, the Unstructured Ingest CLI and the Unstructured Ingest Python library provide the following parameters to help speed up processing a large PDF file:
split_pdf_page
, when set to true, splits the PDF file on the client side before sending it as batches to Unstructured for processing. The number of pages in each batch is determined internally. Batches can contain between 2 and 20 pages.split_pdf_concurrency_level
is an integer that specifies the number of parallel requests. The default is 5. The maximum is 15. This behavior is ignored unless split_pdf_page
is also set to true.split_pdf_allow_failed
, when set to true, allows partitioning to continue even if some pages fail.split_pdf_page_range
is a list of two integers that specify the beginning and ending page numbers of the PDF file to be sent. A ValueError
is raised if the specified range is not valid. This behavior is ignored unless split_pdf_page
is also set to true.