I'm training using an NVIDIA GeForce RTX 2070 SUPER with 8Gb of VRAM, and I have 64 Gb of RAM on my PC.
I've used Keras to create a Sequential model to predict POS-tags. I've used the same model format to train models for text in several different languages. The models all trained alright, and when I run model.evaluate(test_data) they all produce a score. Similarly, when I run model.predict(test_data) most models produce the expected results, but there is one model, for one language, which acts differently.
This one model was trained the same as all the other models, so there should be no difference I think. When I run model.predict(test_data) using this model, at first it seems to be working. It starts applying the model to the dataset:
6/152 [=>............................] - ETA: 19s
It even appears to successfully complete this step, though it never gets as far as producing any results:
152/152 [==============================] - 20s 126ms/step
Unfortunately at this point it hangs and produces the following traceback:
2024-01-05 23:08:38.977923: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.61GiB (rounded to 2804106240)requested by op ConcatV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
...
...
...
2024-01-05 23:08:38.998922: I tensorflow/core/common_runtime/bfc_allocator.cc:1101] Sum Total of in-use chunks: 4.04GiB
2024-01-05 23:08:38.998977: I tensorflow/core/common_runtime/bfc_allocator.cc:1103] total_region_allocated_bytes_: 6263144448 memory_limit_: 6263144448 available bytes: 0 curr_region_allocation_bytes_: 8589934592
2024-01-05 23:08:38.999071: I tensorflow/core/common_runtime/bfc_allocator.cc:1109] Stats:
Limit: 6263144448
InUse: 4335309312
MaxInUse: 4520417536
NumAllocs: 1293
MaxAllocSize: 536870912
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2024-01-05 23:08:38.999241: W tensorflow/core/common_runtime/bfc_allocator.cc:491] ****************x*****************************************************______________________________
2024-01-05 23:08:38.999336: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at concat_op.cc:158 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[38688,18120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "C:\Users\admd9\PycharmProjects\codalab-sigtyp2024\generate_results.py", line 131, in
predictions = task_model.predict(test_gen)
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node __wrapped__ConcatV2_N_152_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[38688,18120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat
I can't work out why it's only happening with this one model, or why there would be a problem with memory allocation when it works for all the other models. It doesn't seem like it's trying to use a lot of memory either. So why am I getting this error message? And, how can I fix it?
I've tried setting memory growth, but it didn't work:
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
I've also reduced batch sizes. This didn't help. I've even gone back and retrained the model in case there was something wrong with the model itself. Still have the same problem with the new model. As a last option, I tried splitting the test set into smaller divisions, running model.predict(test_data) on each of these divisions, then recombining the results of each division. It sometimes successfully predicts the first division, but always runs out of memory and gives me the same error by the second division.
Is there anything I can do?
0 comments:
Post a Comment
Thanks