当前位置:   article > 正文

loool_userwarning: torch.set_default_tensor_type() is de

userwarning: torch.set_default_tensor_type() is deprecated as of pytorch 2.1

E:\DiagGPT-main\Scripts\python.exe E:/chrome/DiagGPT-main/sa.py
Issue #913: Solution to loading Llama 2 70B on 8 GPUs?
State: open
Body:
So, i have a server with 8 Tesla V100, 480GB of ram and 64TB of storage, but when i run llama-2-70b-chat, (no HF), i get this result

```
[2023-11-09 02:30:35,043] torch.distributed.run: [WARNING] 
[2023-11-09 02:30:35,043] torch.distributed.run: [WARNING] *****************************************
[2023-11-09 02:30:35,043] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2023-11-09 02:30:35,043] torch.distributed.run: [WARNING] *****************************************
> initializing model parallel with size 8
> initializing ddp with size 1
> initializing pipeline with size 1
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 5 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 3 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 7 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 1 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 2 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 4 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/llama/example_chat_completion.py", line 96, in <module>
    fire.Fire(main)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
  File "/mnt/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
  File "/mnt/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/llama/llama/model.py", line 376, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/llama/llama/model.py", line 340, in __init__
    self.w2 = RowParallelLinear(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 359, in __init__
    self.master_weight = _initialize_affine_weight(
  File "/usr/local/lib/python3.10/dist-packages/fairscale/nn/model_parallel/layers.py", line 68, in _initialize_affine_weight
    master_weight = torch.empty(out_features, in_features, dtype=weight.dtype, requires_grad=False)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 6 has a total capacty of 15.77 GiB of which 289.38 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 14.66 GiB is allocated by PyTorch, and 483.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-11-09 02:31:00,074] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54329 closing signal SIGTERM
[2023-11-09 02:31:01,090] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 54327) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
```

What exactly is the minimum amount of VRAM needed to run a 70b parameter model?

Issue #912: How to cache Llama2-chat-7b-hf when using HuggingFace
State: open
Body:
**Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the [FAQs](https://ai.meta.com/llama/faq/) and [existing/past issues](https://github.com/facebookresearch/llama/issues)**

## Describe the bug
<Please provide a clear and concise description of what the bug is. If relevant, please include a _minimal_ (least lines of code necessary) _reproducible_ (running this will give us the same result as you get) code snippet. Make sure to include the relevant imports.>

### Minimal reproducible example
<Remember to wrap the code in ```` ```triple-quotes blocks``` ````>
This code, taken from the HuggingFace Llama intro page works perfectly,

```python
# sample code that works
from transformers import AutoTokenizer
import transformers
import torch
import os
from transformers import AutoModelForCausalLM

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")
```

However, modifying it slightly to allow caching of the model causes issues.

```python
# sample code to repro the bug
from transformers import AutoTokenizer
import transformers
import torch
import os
from transformers import AutoModelForCausalLM

model_name = "meta-llama/Llama-2-7b-chat-hf"
model_cache = '/mnt/cache_folder/'


tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

# New code here:::
tokenizer.save_pretrained(model_cache)
pipeline.save_pretrained(model_cache)
pipeline = transformers.pipeline(
    "text-generation",
    model=model_cache,
    torch_dtype=torch.float16,
    device_map="auto",
)
```

### Output
<Remember to wrap the output in ```` ```triple-quotes blocks``` ````>

```
<paste stacktrace and other outputs here>
NotImplementedError: Cannot copy out of meta tensor; no data!
```

## Runtime Environment
- Model: Llama-2-7b-chat-hf
- Using via huggingface?: yes
- OS: Ubuntu
- GPU VRAM: 
- 1 GPUs available.
NVIDIA A10G
[{'device_name': 'NVIDIA A10G',
  'total_memory_GB': 23.83609856,
  'allocated_memory_GB': 20.425129984,
  'reserved_memory': 20.438843392,
  'free_memory_GB': 3.4109685759999984}]


**Additional context**
Add any other context about the problem or environment here.
Essentially, how do I run a Llama2 chat model in huggingface where I can cache the model?

Issue #911: LLaMA 1 access form not working
State: open
Body:
Hi, you provide a Google form for accessing LLaMA 1 weights but that does not work, either for me or for other PhD students in my department. Nothing happens upon filling the form and we have never heard back. An old GitHub issue on this topic is also not getting any responses. Could you please advise on how to proceed? We really need the 30B model to replicate the results of a paper, and that model size is only available for LLaMA 1.

Issue #910: llava_v1_5_mix665k dataset
State: open
Body:
Hello
Looking at the dataset list, which dataset does the prompts with an empty model belong to?
For example:

"id": "wgByO4Y_0",
    "model": "",
    
Thanks

Issue #909: How to obtain the hidden state of the last layer of Llama2 after calling the chat_completion method?
State: open
Body:
I've been thinking about how to obtain the hidden state of the last layer of Llama2 after calling the chat_completion method, but I'm not sure how to implement it. I hope to get some answers.

Issue #908: Meta model Conversion to the Hugging Face friendly version
State: closed
Body:
Hi,
I am trying to use the meta LLam2 I downloaded from Meta, but it has problem that needs to be converted to Hugging Face friendly version, I can not use the ones in Hugging Face because the GPU server I am using cannot connect to the internet. So, I saw the code for conversion, but it is not clear where to run the code. Also, the input path should be the directory where I have all the files with Tokenizer and mode, or the path that is just for the model and contains .chk .chk and .json for the weights? I would appreciate it if someone could help me with this problem, I stuck like 2 weeks.

Hi, closing this thread as it is a duplicate of your open issue at https://github.com/facebookresearch/llama/issues/904. Please continue the discussion there. Thanks
Issue #907: License of Llama2 derivative model
State: closed
Body:
Our customers are interested in training a model using Llama2 as a starting point. Before investing significant time and compute resources into this work, I wanted to request clarification on how derivative models should be licensed. 

Based on my reading of the [Llama2 license](https://ai.meta.com/llama/license/) especially section `1-b-i` , my understanding is that any model derived from Llama2 - whether by fine-tuning the weights or training from scratch using the codebase - would need to be released under the LLAMA 2 Community License. These derivative models could not be released under a more permissive license like MIT or Apache 2.0.

The key points are:

- Models fine-tuned from Llama2 weights need the LLAMA 2 Community License.
- New models trained from scratch using the Llama2 codebase also need the LLAMA 2 Community License. 
- The LLAMA 2 Community License does not allow derivative works to be re-licensed under permissive licenses like MIT or Apache 2.0 that were not written for AI systems.
  - If codebase is implemetend from scratch by referring [Llama2 paper](https://arxiv.org/abs/2302.13971), it does not need to inherit license because paper itself is not included to the "Llama Materilas"

Please let me know if this interpretation is accurate. I want to be certain I understand the obligations for derivative works before proceeding with model development using Llama2. Thank you again for the clarification.

## Related issues

* https://github.com/facebookresearch/llama/issues/240
* https://github.com/facebookresearch/llama/issues/226

i am not a lawyer so I would advocate that you check with your legal team for any clarification. But generally yes, the license requires that it is redistributed for any derivative models. 
Issue #906: docs. Correct the URL to the FAQ.md file
State: open
Body:
correct the URL to the FAQ.md file

Issue #905: Vertical lines on token embeddings visualization
State: open
Body:
I've visualized token embedding weights (loaded from /llama2/7B/consolidated.00.pth) as image (4096x32000 pixels) and I spotted some vertical lines that I don't understand. Here's a crop of the full image with these vertical lines clearly visible:

![llama-word-embeddings-crop](https://github.com/facebookresearch/llama/assets/139641/5e34f6a6-5643-457f-8ff4-9091e4907f31)

(the link to the full image is [here](https://drive.google.com/file/d/1AjO1Zb1XTuCTIvcQMRRKKZrPH95mo758/view?usp=share_link))

Any explanation why some dimensions of the token embedding would be special?

Sounds interesting, but I'm not sure if embeddings can be meaningfully visualized like this. Perhaps approaches like t-sne/umap might provide more insight?

cc @melanierk 
Issue #904: ERROR:  OSError:lama-2-7b-chat does not appear to have a file named config.json. 
State: open
Body:
Hi,
I am trying to run the Llama-7b chat that I already downloaded from Meta locally. I got this configuration error because I am using Transformers. I do not know how to run or change the code to be able to run with Transformers. Also, my local system is a remote GPU server, which does not have permission to connect to the internet.


```python
'''
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

# Define the model name or directory path
model_name_or_path = ".../llama-2-7b-chat"  # Replace with the actual model name or path

# Load the configuration
config = AutoConfig.from_pretrained(model_name_or_path)

# Check if a GPU is available, and if so, use it
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, config=config).to(device)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

print("Model loaded.")

'''
```

### Output
OSError:lama-2-7b-chat does not appear to have a file named config.json. 

```

Hello @Yarmohamadshr, the local downloaded model can only be loaded with their code (found in the same repo https://github.com/facebookresearch/llama/blob/main/example_chat_completion.py). If you want to use the HuggingFace code, you need to use the model from HuggingFace https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
My server does not have the permission to connect to the internet for Hugging Face log in, do you know how to use this hugging Face version without connection and offline?
@Yarmohamadshr You should be able to use the downloaded checkpoint files on your server without the HF login. before that you will have to convert those checkpoints to the HF format as mentioned here https://github.com/facebookresearch/llama/issues/875#issuecomment-1779647981 


@subramen I am trying to use the conversion code in the Conda virtual environment, I get the error for GitHub, I am sorry, I am not that familiar with this terminal code, Should I run it on cmd calling Python or in my Conda prompt, Below is the screen shot of the error for git clone
![Screenshot 2023-11-08 113309](https://github.com/facebookresearch/llama/assets/73134971/0a753900-18a5-4ed3-9b64-0ef32ed38e14)


@Yarmohamadshr It’s unclear why you’re trying to clone the transformers repo, you should clone this one https://github.com/facebookresearch/llama.

Second, you need to add your SSH key to GitHub to be able to clone repos:

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account
@Yarmohamadshr for installing transformers, you can just do: pip install transformers —upgrade
Issue #903: Authorization to translate documentation (to PT-BR)
State: open
Body:
Hello Llama 2's team.

First of all, I want to deeply thank you for all your contributions to AI - and to the world. Llama 2 is undoubtedly a significant step to democratizing AI. Meta is probably the most important player in terms of making AI indeed accessible to **everyone** and not actually charging for it - and more, actually contributing to the academy and individual students by making it Open Source.

Thank you!

And speaking of democratizing AI and information. We keep a non-profit students community here in Brazil, where language is still a barrier, with a focus on bringing high-quality material about ML and AI to Portuguese, so that Brazilian students have access to it. Our community is called [**BRAINS - Brazilian AI Networks**](https://brains.dev/). 

I have recently read your post [Getting started with Llama](https://ai.meta.com/llama/get-started/) on Meta AI's blog. And it is a masterpiece. From start to end. Very well written, concise and valuable at the same time. I want to apologize if I'm on the wrong channel to make such a request. But I'd like your permission to translate this blog post and have it available on our community - with proper credits, of course!

If it is not up to you to give such authorization, I'd deeply appreciate of you could point me to the right direction. I'm confident thousands of Brazilian students, like me, would benefit from having this content accessible in Portuguese.

Once again, thank you very much. For everything you've done and are still doing for the AI community.

And I hope we can take access of this blog post even further by translating it to other languages.

#NoBrains #NoGains

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/89064
推荐阅读
相关标签