【Interesting Facts About Tokens】 Why Is AI Charged Based on Tokens? Let's Dive In! — Slowly Learn AI040
Introduction
- Have you heard that AI charges by Token?
- It really consumes a lot of Tokens.
- My computer ran all night, and it ate up a lot of Tokens — does that feel like losing an apartment?
- Why use Tokens for billing?
- I’ve heard that Token charges are two-way.
- Asking AI a question costs money, and getting an answer does too. Isn’t that a bit excessive?
- So AI can’t stop rambling, can it?
- Are Tokens words or letters?
- How do they bill for Chinese characters?
- What about Arabic?
- What different meanings do Tokens have in the process of corporate digital transformation?
- Traditional information systems might just involve setting up structures and databases.
- Why is Tokenization an issue with AI applications?
This article attempts to address these questions and provide a better understanding of what Tokens really are. It’s a long read, so let’s dig in!
In the history of computer development, many terms that sound impressive gradually become part of everyday life. Take “Prompt,” for instance — it has certainly made its mark, and so has Token. At this point, it seems like it has indeed crossed into the mainstream.
So, is it a billing method proposed by OpenAI that everyone in the industry agrees is great? Or is there another reason?
Let’s start with its origins.
In a corporate environment, utilizing AI technology to reduce costs and increase efficiency is key, and understanding Tokens helps us better grasp how AI can be implemented in businesses. Simply put, think of Tokens as building blocks — we assemble these blocks to create the applications we need, thus enhancing productivity.
Basics of Tokens
Basic Concept of Tokens
Let’s first review how Tokens are described by OpenAI:
- 1 Token ~= 4 English characters
- 1 Token ~= ¾ of a word
- 100 Tokens ~= 75 words
or - 1-2 sentences ~= 30 Tokens
- 1 paragraph ~= 100 Tokens
- 1,500 words ~= 2,048 Tokens
Feeling a bit confused? It’s like pondering how many ways “茴” can be written! Let’s take a closer look at what it’s about:
Learning AI Meticulously, Sharing Knowledge Joyfully
Can you guess how many Tokens this sentence has? It’s 6 words, so it should be 6 Tokens, right? Not quite!
In ChatGPT 4, it’s actually 10 Tokens. You can see that punctuation is counted separately, and Joyfully
is split into Joy
and fully
.
From Code to Conversation: The Necessity of Introducing Tokens
The core language of computers consists of binary code made up of 0s and 1s, which serves as the most basic representation of all programs and data. Whether it’s high-level programming languages like Python and Java or various multimedia files such as images and videos, everything gets converted into this machine language. In traditional computer science, experts have worked hard to abstract the complexity of the real world by defining clear data types such as strings (a series of characters) and integers (numbers). This approach is indeed effective for handling structured data like mathematical calculations or database queries.
However, as technology has advanced and demands have increased, we want computers not only to process numbers and code but also to understand and process natural language — our everyday human languages. This ushers in the field of Natural Language Processing (NLP), which aims to enable computers to understand, interpret, and generate human language.
Given the characteristics of natural language, which include diversity, context-dependence, and ambiguity, we no longer face simple problems like 1+1=2
. Instead, we are tasked with figuring out how to make computers comprehend statements like, “Today is Friday. Where should we go for the weekend? How about staying home to learn AI?” The challenge is to analyze emotions and translate such sentences into different languages. In these scenarios, traditional data types are no longer sufficient.
This is where the concept of Tokens comes into play. Tokenization is the process of breaking down complex text data into smaller, more manageable units that are easier for computers to process, such as words, phrases, or punctuation. This enables computers to engage in more effective language processing, extracting meaning from text rather than merely counting characters.
From Determinism to Ambiguity: Traditional programming deals with precise and predictable data, whereas NLP entails interpreting polysemous and context-relevant language.
From Structured to Unstructured: Unlike structured databases or algorithms, NLP handles natural language text that is fluid and free-form.
What Are Tokens? Why Convert Text to Tokens?
Imagine a common application scenario in generative AI involving quick summarization. We want to grasp key information without going word by word. Tokens play a vital role, helping computers “understand” and process large volumes of text.
What are Tokens?
In NLP, Tokens generally refer to meaningful segments of text. These segments can be words, phrases, or punctuation, just like in the earlier example.
Why Convert to Tokens?
Converting text to Tokens is similar to breaking down a complex business report into key sections, or distilling an email into bullet points. This breakdown allows computers to handle and analyze language more efficiently, facilitating tasks like keyword searching, automatic translation, or sentiment analysis.
For instance, if someone has opened a chain store on Meituan and wants to analyze customer reviews to improve products (improve? Let’s assume so), decomposing the reviews into Tokens can help identify common issues or negative feedback.
Tokens may seem like words, but what’s the real story?
The Distinction and Connection Between Tokens, Characters, and Words.
Definition | Characteristics | Example | |
---|---|---|---|
Character | The basic building block of text ` | Might not convey complete meaning independently; combined with others can form words. | happy |
Word | Comprised of characters, conveys some meaning | The basic unit for communicating information, richer than a single character. | I’m happy |
Token | Usually corresponds to words but is more flexible; could be a phrase, punctuation, or even roots and prefixes. | The definition of a token depends on its application, e.g., text analysis, machine translation. | I , 'm , happy |
By now, you may feel you’re starting to grasp it all, largely depending on our understanding of language itself.
Despite the technical differences among characters, words, and Tokens, they are closely interconnected in text processing. Characters serve as the foundation for building words, and words are elements that form Tokens. In practice, recognizing and using Tokens relies on an understanding of characters and words.
For example, if we need to analyze a report about market trends, Tokenization allows us to rapidly identify key terms (like “growth,” “risk,” and “opportunity”), helping executives quickly grasp the core content of the report.
Overall, Tokens are a method that aids computers in processing and “understanding” text, making automated text handling viable, thus enabling businesses to utilize language information more effectively in data-driven decision-making.
So how are Tokens generated, and how are they processed? This requires us to step beyond traditional programming mindsets.
Token Generation and Processing
How are Tokens generated? The specific process of converting text into Tokens.
graph LR A[Text Processing Workflow] A1[Preprocessing] A2[Segmentation] A3[Tokenization] A4[Post-processing] A --> A1 A --> A2 A --> A3 A --> A4 A1 --> B1[Remove irrelevant characters] B1 --> B1a[Such as webpage code] A1 --> B2[Standardize text] B2 --> B2a[Unify case] B2 --> B2b[Convert between simplified and traditional characters] A1 --> B3[Remove stop words] B3 --> B3a[Such as "的", "了", etc.] A2 --> C1[Word segmentation for English] C1 --> C1a[Based on spaces and punctuation] A2 --> C2[Word segmentation for Chinese] C2 --> C2a[Algorithm-dependent recognition of word boundaries] A3 --> D1[Merge vocabulary] D1 --> D1a[Like proper nouns "New York"] D1 --> D2[Identify phrases or fixed collocations] D1 --> D3[Treat punctuation as standalone Tokens] A4 --> E1[Part-of-speech tagging] A4 --> E2[Semantic role labeling]
Different models may vary in their processing steps; to facilitate understanding, we outline several steps here. When digging into the value of data accumulated in corporate digital transformations, it’s essential to consider prioritizing data value and cost of processing, enabling appropriate evaluations.
For example:
Token 生成
The Role of Vocabulary in Token Generation
From our earlier discussion, we recognize that vocabulary plays a significant role in Token generation.
Boundary recognition, ensuring consistency, information compression, increasing processing speed, maintaining semantics:
By maintaining and updating the vocabulary, we can continuously optimize the Token generation processes to adapt to language changes and the emergence of new words, thereby enhancing the adaptability and accuracy of entire systems.
Handling Special Characters (like punctuation and spaces)
In Token generation, dealing with special characters is a crucial point that requires careful attention. Special characters like punctuation and spaces often carry significant structural and semantic functions in the text:
Punctuation: Punctuation is typically used to indicate sentence structure, such as periods (。) that denote the end, commas (,) used to separate items in a list or clauses, or quotation marks (“”) that highlight direct speech. In Tokenization, punctuation is generally treated as standalone Tokens because they can affect the tone and structure of sentences; sometimes, they can even change a sentence’s meaning.
Spaces: In English and other languages that use Latin characters, spaces serve as the primary means to separate words. Thus, spaces are generally not kept as Tokens during Tokenization; however, their presence is essential for determining word boundaries. In some formatted texts, spaces might also be used for aesthetic purposes — in this case, the treatment may depend on the context.
Special formatting characters: Characters like tabs (Tab) and newlines (\n) also control formatting in the text. In some situations, these characters may need to be ignored or treated specially, especially when handling plain text files.
Correctly handling these special characters is crucial for ensuring proper Tokenization. The strategies for dealing with these characters directly impact the effectiveness of subsequent text analysis and applications. When designing NLP systems, we must carefully consider how these characters should be managed to meet diverse application needs and data characteristics.
From our previous content, we have also realized that there are some differences in handling Tokens across different languages, and these differences help us gain a better understanding of
Diversity and Adaptability of Tokens
Tokenization Methods for Different Languages
The structural and grammatical differences among languages require Tokenization methods to be highly adaptable and flexible. For example:
English and other Western European languages: These languages typically use spaces as separators between words, making Tokenization relatively straightforward. For instance, the sentence “The quick brown fox” can be easily split into “The,” “quick,” “brown,” and “fox” based on spaces.
Chinese, Japanese, and Korean: These languages lack clear word separators, making Tokenization more complicated. For Chinese, it may rely on dictionaries or statistical models to identify which characters combine to form meaningful vocabulary. For example, “快速发展” must be recognized as a single Token rather than separating it into “快速” and “发展.”
Arabic and Hebrew: These right-to-left languages pose unique challenges for Tokenization due to the need to consider character connections and writing direction, requiring special demands on Tokenization algorithms.
Understanding these differences aids in better handling multilingual data in global operations, optimizing multilingual user interfaces and content creation, and enhancing user experience and market expansion.
How Is the Size and Granularity of Tokens Determined?
The size and granularity of Tokens depend on the specific needs of the application and the expected depth of processing:
Fine-grained Tokens: Generally used in scenarios requiring deep language understanding, such as sentiment analysis or semantic searching. For example, further breaking down compound words can help models capture nuanced changes in meaning.
Coarse-grained Tokens: Suitable for scenarios that require quick processing of large volumes of text data, such as document classification or preliminary keyword extraction. Coarse-grained Tokenization reduces complexity and computing demands.
Finding the right granularity for Tokens often involves a trade-off between processing speed and semantic accuracy. Understanding this helps executives make informed decisions about choosing appropriate technologies and tools to meet business needs while implementing AI projects.
Grasping the Tokenization methods of various languages and the principles of determining the size and granularity of Tokens can help you:
- Better evaluate AI projects: Understanding the complexity and challenges of Tokenization aids in making informed decisions when purchasing or developing related AI solutions.
- Optimize global operations: Being able to adapt Tokenization capabilities to multilingual environments is key for successful globalization, improving cross-cultural communication and user interaction.
- Enhance data processing efficiency: Choosing the appropriate Token granularity can optimize data processing efficiency and costs while meeting business demands.
So, what impact do Tokens have on models?
Tokens and AI Model Performance
Token strategies can influence the context space of large models to some extent. When we converse with AI, if these exchanges become too extensive, AI may forget earlier content. This can be understood as a limitation in context. Below are the context limitations for large language models from last year.
src: https://s10251.pcdn.co/pdf/2023-Alan-D-Thompson-2023-Context-Windows-Rev-0.pdf
This is last year’s data; here’s a diagram for Gemini.
src: https://beebom.com/gemini-1-5-pro-announced/
In China, Kimi can handle 100M PDF files; the size of the context window has become a key marketing point. So what impact does this have?
Under the current scaling law context, differing Token strategies still belong to the realm of the underlying algorithm discourse, meaning that optimizing Token strategies can’t hold a candle to simply purchasing more GPUs.
Impact of Tokens on Model Performance
sequenceDiagram participant U as User participant I as Input Processing participant M as Model Calculation participant S as Storage System U->>+I: Input conversation history (Token count) I->>+M: Parse Tokens and prepare data M->>+M: Compute self-attention Note over M: Calculate relationships between Tokens M->>+S: Request additional memory Note over S: Increase memory allocation based on Token count S-->>-M: Confirm memory allocation M->>M: Continue to compute response M-->>-I: Return generated response I-->>-U: Display response
How Does the Number of Tokens Affect the Model’s Computational Complexity and Memory Usage?
In generative AI models like GPT-4 or other Transformer-based models, the number of Tokens directly correlates with the computational complexity and memory usage during processing. Each additional Token means the model needs to process more data points, which increases the computational burden during both training and inference, as well as memory needs. For example, when training a language model, the model must store and compute the relationships between each Token and all other Tokens, which becomes especially pronounced in the model’s self-attention mechanism.
Example: Consider a generative chatbot project. If the input conversation history is too long (i.e., a high Token count), the model may slow down in generating responses while consuming more computational resources. For instance, a chat history with thousands of Tokens may lead to noticeably decreased processing speeds, particularly on resource-constrained devices.
A more intuitive understanding is that large model companies refrain from increasing capacity for practical reasons. Is bigger always better?
Do More Tokens Guarantee Better Model Performance?
Not necessarily. In generative AI, the right number of Tokens can help the model capture and understand context more accurately, boosting the relevance and accuracy of generated content. However, excessive Tokens may introduce irrelevant information, diminishing the model’s efficiency and output quality.
Example: In an AI system generating market reports, precise Token segmentation can ensure important information is highlighted rather than buried under unnecessary details. If the system must generate concise summaries from a wealth of financial news, excessive Tokens may lead to disorganized reports that fail to capture core information.
Presently, large model companies may employ strategies similar to cloud storage for managing large file processing. For instance, if User A uploads a file and User B uploads at a different time, it may simply utilize User A’s parsing results without needing to parse again. As content increases, this can establish a unique product advantage.
Optimizing Token Usage
How to Find a Balance Between Token Count and Model Performance?
The Token strategy here mainly concerns the tactics ordinary users employ when using Prompts to shape outcomes more in line with their expectations.
Finding the optimal balance between the number of Tokens and model performance is key to ensuring generative AI models are both efficient and accurate. Typically, this involves a process of trial and error and utilizing advanced model tuning techniques.
Example: In an automatic content generation system, balancing Token usage poses typical challenges. The system may need to extract key information from lengthy texts to create summaries. In this situation, choosing an appropriate number of Tokens to retain enough information while avoiding overly complex model architectures is crucial.
The Relationship Between Tokens and Context Windows and Its Impact on Text Generation Quality
In generative AI, the configuration of Tokens and context windows directly influences the coherence and logic of the generated text. The larger the context window, the more historical information the model can consider when generating text, enabling the creation of more coherent and natural text.
Example: Suppose we use an AI model to create articles for a technical blog. If the context window is set too small, the model might struggle to effectively link various sections of the article, resulting in content that feels disjointed. By optimizing Token usage and adjusting the context window size, we can significantly enhance the quality and readability of the article.
Next, we will address the topic we initially mentioned: for application systems, we desire good user experiences while also considering costs.
The Commercial Application of Tokens and Billing Models
First, let’s look at a table detailing the current billing situation for large models.
src: https://yourgpt.ai/tools/openai-and-other-llm-api-pricing-calculator
Generally speaking, using large language models can be divided into web-based conversations and API calls. Using OpenAI on the web has generally standardized at $20 a month. However, API calls can vary widely and can be quite expensive.
It’s a game of cat and mouse; even with ChatGPT Plus, there’s still a limit on the number of rounds within three hours. Many people have attempted to use web-scraping methods for access without going through the API, and such open-source code has mostly been eliminated!
Previously, telecommunications billing logic was based on duration, which was also a stage of high profits. Later, monthly subscription models emerged; the current Token billing mechanism has some similarities.
Token Billing Logic
Why Use Token Billing? Its Reasonableness and Commercial Model.
The Token billing model is very common in AI services, especially when using language models provided by firms like OpenAI. This billing model is based on the specific amount of service usage by users, charging based on the number of Tokens processed in each request.
Reasonableness:
The reasonableness of the Token billing model lies in its ability to accurately reflect users’ actual consumption of resources. Each Token represents a unit of information that the model needs to process; more Tokens mean greater computational resource consumption. Therefore, this billing method ensures users pay based on their actual usage while encouraging them to optimize their input to avoid unnecessary wastage.
Commercial Model:
From a business perspective, the Token billing model provides AI service providers with a flexible and fair pricing framework. It allows providers to set different pricing tiers based on system load and operating costs, attracting a diverse range of users from small developers to large enterprises.
Comparing Token Billing to Other Billing Methods (Such as Word Count, Character Count, Time)
Compared to other common billing models, Token billing has its unique advantages and limitations:
Word and Character Count Billing: These methods are straightforward and easy to understand. However, they often do not consider the complexity of processing and the actual computational resource utilized. For instance, processing a lengthy sentence with simple vocabulary may be easier than dealing with a technical term, yet the longer sentence might incur higher costs based solely on word count.
Time-Based Billing: Time-based billing models (such as charged by the minute or hour) are suitable for continuous services like streaming data processing or online learning. However, for short, request-based tasks, this model may lead to unfair or inaccurate billing.
graph TD; A[Token Billing] -->|Reflects actual resource consumption| B[Equitable resource allocation]; A -->|Optimizes input efficiency| C[Encourages input simplification]; D[Word/Character Billing] -->|Straightforward| E[Easy to understand and budget]; D -->|Does not consider complexity| F[May lead to inaccurate costs]; G[Time-Based Billing] -->|Suitable for continuous services| H[Streaming data processing/online learning]; G -->|Not suitable for short tasks| I[Potential for unfair billing];
Token billing provides a more nuanced measurement, and can fairly reflect users’ actual resource consumption.
The costs for large model companies can roughly include:
- Research and development costs (labor + experimentation)
- Training expenses (computing resources + data processing)
- Deployment costs (infrastructure + storage)
- Maintenance and upgrade costs
- Ethical compliance costs (data security, data compliance)
These costs being borne through Tokens seems unrealistic; in reality, only industry insiders could evaluate that. It may be the most feasible assessment method at this stage.
Actual Impact of Token Billing
Different Billing Methods’ Impacts on Users and Developers.
The Token billing model means that users need to manage their API requests more carefully to control costs. Developers have to design efficient queries to reduce redundant Token usage, maximizing the value of each request. This billing method encourages developers to optimize input data and processing workflows but may also increase the complexity of development and the initial optimization effort.
For providers, Token billing can help balance server load, forecast income, and optimize resource allocation. It can also act as a feedback mechanism for product optimization and adjustments to pricing strategies, aiding providers in better meeting market demands.
How to Optimize Token Usage to Reduce Costs?
Optimizing Token usage is key to controlling costs. This can be achieved through the following methods:
- Streamline input data: Before sending a request, remove unnecessary text and redundant data, keeping only the key information.
- Design efficient queries: Formulate well-conceived queries to avoid overly complex or deep chains of requests.
- Utilize caching strategies: Use cached results for common or repetitive requests to reduce backend service queries.
- Monitoring and analysis: Regularly analyze Token consumption data, identify optimization points, and adjust strategies to minimize waste.
Through these methods, we can not only reduce costs but also improve the system’s response speed and enhance user satisfaction, thereby gaining an advantage in a competitive market.
The Commercial Value of Tokens and Application Cases
Practical Applications of Tokens in Business
In enterprise operations, the application of Tokenization technology can significantly enhance data processing efficiency and decision quality. For non-technical executives, understanding Token applications can aid in better evaluating technology investments and promoting business innovation.
graph LR; A[Technical Perspective: The Role of Tokens in Natural Language Processing] B[Business Perspective: The Role of Tokens in Enhancing Enterprise Value] A --> A1[Information Extraction\nQuickly derive key insights] A --> A2[Sentiment Analysis\nIdentify customer emotions] A --> A3[Automatic Summarization\nGenerate document summaries] B --> B1[Improve Customer Interaction\n24/7 customer service] B --> B2[Market Analysis\nAcquire trend information] B --> B3[Personalized Recommendations\nIncrease transaction volumes] style A fill:#8ecae6,stroke:#333,stroke-width:4px style B fill:#90be6d,stroke:#333,stroke-width:4px style A1 fill:#219ebc,stroke:#333,stroke-width:2px style A2 fill:#219ebc,stroke:#333,stroke-width:2px style A3 fill:#219ebc,stroke:#333,stroke-width:2px style B1 fill:#ffb703,stroke:#333,stroke-width:2px style B2 fill:#ffb703,stroke:#333,stroke-width:2px style B3 fill:#ffb703,stroke:#333,stroke-width:2px
Technical Perspective: The Role of Tokens in Natural Language Processing
Tokenization is the technical process of breaking down complex text data into manageable units, enabling AI systems to perform effective data analysis and processing. This process is especially critical in Natural Language Processing (NLP), as it allows machines to “understand” human language and perform tasks such as:
- Information Extraction: Tokenization facilitates the quick extraction of key information from extensive text, like pinpointing relevant clauses in legal documents.
- Sentiment Analysis: By analyzing the Tokens in customer feedback, enterprises can gauge customer emotional tendencies, enabling them to adjust products or services.
- Automatic Summarization: Tokenization technology can automatically generate document summaries, enhancing the efficiency of knowledge workers.
Business Perspective: The Role of Tokens in Enhancing Enterprise Value
From a business standpoint, Tokens not only improve operational efficiency, but they can also unlock new business models and revenue streams:
- Improved Customer Interaction: Utilizing Tokenized chatbots enables 24/7 customer service, enhancing customer satisfaction while reducing service costs.
- Market Analysis: Tokenized processing can help businesses quickly extract trend information from market reports, guiding strategic decisions.
- Personalized Recommendations: On e-commerce platforms, Tokenization can analyze users’ purchase histories and browsing behaviors to deliver personalized product recommendations, boosting sales.
Case Analysis
Customer Service Chatbots
A typical application is customer service chatbots. For instance, a large telecommunications company deployed a Token-based customer service chatbot to manage user inquiries, such as billing issues or service disruptions. The chatbot quickly provides accurate answers by analyzing the user’s questions (which have been Tokenized) and routing queries to the appropriate service departments when necessary.
Content Recommendation Systems
In the media and entertainment industry, content recommendation systems utilize Tokenization technology to analyze users’ viewing or reading habits, thereby suggesting new movies, books, or articles that users may find interesting. For example, Netflix’s recommendation system analyzes the description Tokens of previously watched programs to predict which other shows a user might enjoy.
The Commercial Value of Tokens and Future Applications
In enterprise applications, understanding and effectively utilizing Tokens is key to the success of AI projects. Grasping the commercial value and challenges of Tokens is especially important for strategic planning and navigating technological innovation.
Commercial Applications of Tokens
Technical Perspective: The Role of Tokens
Tokens in Natural Language Processing (NLP) enable the effective processing of textual information by AI systems. In brief, Tokenization is the process of breaking down large sections of text into smaller units that can be processed, providing the groundwork for machine learning models to operate.
- Data Processing: When handling customer inquiries, analyzing market feedback, or managing large volumes of documents, Tokenization makes complex text data easier to manage and analyze.
- Efficiency Enhancement: By using Tokenization, AI models can quickly identify key information, thus speeding up decision-making processes and improving business response times.
Business Perspective: The Economic Value of Tokens
From a commercial perspective, Tokens are not just components of technical implementation; they are directly tied to enhancing operational efficiency, improving customer experience, and enabling new business models.
- Optimizing Customer Service: Tokenization makes customer service automation feasible, allowing for quick and accurate responses to customer requests, significantly boosting customer satisfaction and brand loyalty.
- Personalized Marketing: Leveraging Token analysis of user behavior and preferences enables organizations to offer highly personalized marketing content, thus increasing conversion rates in sales.
Future Outlook and Challenges of Tokens
Future Development Trends
As AI technology continues to advance, the application of Tokens is expected to become more intelligent and diverse:
- Cross-modal Applications: Token technology will extend beyond text processing to encompass the analysis of multimedia content like videos and audio, supporting a broader range of application scenarios.
- Intelligent Optimization: The methods for generating and processing Tokens will evolve to become more intelligent, such as using AI to automatically adjust Token size and quantity according to various business requirements.
Business Challenges and Opportunities
- Data Security and Privacy: Ensuring data security and user privacy in Tokenization processes will be a major challenge, particularly when dealing with sensitive information.
- Technology Integration: Seamlessly integrating Token technology with existing IT systems and business processes will be key to achieving technological transformation.
- Fairness and Explainability: Ensuring that AI decisions derived from Tokenization are fair and transparent will enhance trust among all stakeholders.
Conclusion
While writing this article, Lin Miao provided insights on current direction (thank you), https://arxiv.org/abs/2104.12369. Based on the practice of Huawei’s Pangu model, the development of Tokens in the Chinese domain is expected to lean toward more engineering-oriented practices, which still requires further observation.
Before composing this article, my understanding of Tokens was limited to the vague notion of one Chinese character equating to one Token, often conflating Tokens with vectorization as well. However, before vectorization, there is still the work of Tokenization to be considered. Better embracing AI and adapting to change, how can the data in existing enterprise application systems be leveraged more effectively? This can start here!
Reference Links
- https://platform.openai.com/tokenizer
- https://arxiv.org/abs/2104.12369
- https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
- https://www.coindesk.com/learn/a-beginners-guide-to-ai-tokens/
- https://ogre51.medium.com/context-window-of-language-models-a530ffa49989
- https://cobusgreyling.medium.com/rag-llm-context-size-6728a2f44beb
- https://www.humanfirst.ai/blog/how-does-large-language-models-use-long-contexts
- https://slator.com/10-large-language-models-that-matter-to-the-language-industry/
- https://yourgpt.ai/blog/general/long-context-window-vs-rag
- https://github.com/datawhalechina/hugging-llm/blob/main/content/chapter1/ChatGPT%E5%9F%BA%E7%A1%80%E7%A7%91%E6%99%AE%E2%80%94%E2%80%94%E7%9F%A5%E5%85%B6%E4%B8%80%E7%82%B9%E6%89%80%E4%BB%A5%E7%84%B6.md
- https://gpt-tokenizer.dev/