SAMPLING DEVICE
20220121490 · 2022-04-21
Assignee
Inventors
Cpc classification
G06F9/4881
PHYSICS
International classification
Abstract
A sampling device capable of balancing workload among units in data parallel processing is provided. A sampling device includes priority assignment method S201 that assigns higher priority to units with more remaining workload, priority-aware scheduling method S202-S204 that enables units with higher priority to do the sampling and model update when conflict happens, a modified priority-aware scheduling method shown as FIG. 10 that reduces scheduling overhead by re-assigning priority every several iterations, and another modified priority-aware scheduling method shown as FIG. 12 that explores different priority re-assignment frequency and stores the sorted sequences in memory.
Claims
1. A sampling device comprising: a plurality of processing units, each including: a token fetcher configured to fetch word tokens from a dataset, a token sampler configured to sample topics for tokens, and a model updater configured to update an LDA model; a priority assignment device configured to determine priority levels of the processing units; and a priority-aware scheduler configured to schedule the processing units based on the priority level of each processing unit.
2. A sampling device according to claim 1, wherein the priority assignment device includes: a sorter that sorts processing units based on their remaining workload, a workload array configured to store the amount of remaining workload in each subset, and a unit ID array configured to store the sorted unit IDs.
3. A priority assignment device according to claim 2, further comprising: a sorted sequence array configured to record sequences of sorted processing units.
4. A sampling device according to claim 1, wherein the priority-aware scheduler includes: a word ID array configured to record a word ID of a next token of each unit, an enable/disable array configured to store a sampling-enable indicator of each unit, and an indicator updater configured to update the sampling-enable indicators.
5. A priority-aware scheduler according to claim 4, wherein the indicator updater contains a unit ID array that records a unit ID for each word and a comparator that checks which units acquire a lock for words.
6. A parallelized Latent Dirichlet Allocation method comprising the steps of: fetching tokens from a dataset in a parallel manner among multiple parallel processing units; sampling topics for tokens fetched in each of the processing units; updating an LDA model, in a parallel manner, based on the sampling of topics for tokens by each of the processing units; determining a priority level of the processing units; and scheduling processing by the processing units based on the determined priority level of each processing unit.
7. A non-transitory computer readable storage medium contain instructions to cause a computer to execute: fetching tokens from a dataset in a parallel manner among multiple parallel processing units; sampling topics for tokens fetched in each of the processing units; updating an LDA model, in a parallel manner, based on the sampling of topics for tokens by each of the processing units; determining a priority level of the processing units; and scheduling processing by the processing units based on the determined priority level of each processing unit.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
DESCRIPTION OF EMBODIMENTS
[0029] Hereinafter, embodiments of the present invention will be described in detail with reference to accompanying drawings.
Embodiment 1
[0030] First, a first example embodiment of the invention is elaborated below referring to accompanying drawings.
[0031]
[0032] Training data 102 is a collection of documents, and each document contains multiple word tokens. Each token is mapped to a word ID, and there might be multiple tokens belonging to the same word ID. Training data 102 records the topic distribution of each document, and word ID and assigned topic ID of each word token. In the parallelized LDA, the dataset is partitioned into P subsets, each of which contains multiple documents. We partition in a way that the number of tokens in each subset is as close to N/P as possible, where N is the total number of tokens in the dataset and P is the number of available processing units in the sampling device. Each data subset is processed by one unit in parallel with others.
[0033] The LDA model 103 stores the topic distribution of each word. At each training epoch, the sampling device 101 re-assigns the topic to each token based on current distributions, and then updates the model according to the new topic ID.
[0034] The sampling device 101 is composed of a priority assignment device 107, a priority-aware scheduler 108 and multiple processing units. As shown in the figure, each processing unit includes a token fetcher 104, a sampler 105, and a model updater 106.
[0035] The priority-aware scheduler 107 is designed to resolve the model update conflict and balance the workload among units at runtime. If two or more units sample tokens with the same word ID, they will update the topic distribution of the same word, which causes memory access conflict. In this case, the scheduler allows the unit with the highest priority to do the sampling.
[0036]
[0037] In step S101, each fetcher 104 fetches the next token in its data subset, including the word ID, and current topic ID of the token.
[0038] In step S102, the scheduler 107 schedules P units based on their priority levels. This step is to ensure workload balancing and avoid model update conflict. The output of step 102 is sampling-enable indicators for units, and it determines which units are enabled to do the sampling at the current iteration. The detailed operations in this step will be presented later.
[0039] In step S103, token samplers 105 of enabled units sample the new topic for their next token in parallel.
[0040] In step S104, model updaters 106 of enabled units update the topic distribution model according to the new topic assignment in parallel.
[0041] After finishing one iteration, the sampler device checks whether or not all tokens in the dataset have been sampled (step S105). If the device has processed all the data, the epoch ends. Otherwise, the device repeats steps S101-S105.
[0042]
[0043] The size of arrays 201-204 is P, the number of processing units in the device. The workload array 201 stores the remaining workload in each data subset, specifically, the number of unprocessed tokens. The unit ID array 202 stores the processing unit IDs in ascending order in priority levels of processing units. The word ID array 203 stores word IDs of P tokens fetched by P units at the current iteration. The enable/disable array 204 stores sampling-enable indicators of units.
[0044] The indicator updater 206 contains an indicator unit ID array 207 and a comparator 208. The indicator unit ID array 207 has W entries, where W is the number of unique words that appear in the training data. Each entry records the ID of the unit that can sample for the corresponding word at current iteration. It performs like a “lock” of words. At each iteration, at most one unit can acquire the “lock” of a word, so as to ensure there is no model update conflict.
[0045]
[0046] In step S201, the sorter 205 takes the workload array 201 as sorting keys and sorts units in ascending order in levels of priority. To balance the workload among units, higher priority levels are assigned to units with more unprocessed tokens. The sorted sequence of unit IDs is stored in the unit ID array 202.
[0047] In step S202, the updater 206 updates the unit IDs for words. Initially, all entries of the indicator unit ID array 207 are initialized to −1, which means all words are “unlocked”. The updater 206 writes unit IDs to word entries corresponding to the fetched tokens. The ID of the unit with the lowest priority level will be written to the array first. Units with higher priorities can overwrite the prior units which fetch the same word. In this way, it is guaranteed that units with higher priorities can do the sampling and those with lower priorities need to wait when conflicts exist.
[0048] In step S203, the comparator 208 reads values from the indicator unit ID array 207 and checks which units have acquired the lock.
[0049] In step S204, the updater 206 sets the sampling-enable indicators for all units according to the comparison results.
[0050]
[0051] As shown in
[0052]
[0053]
[0054] Firstly, unit IDs that acquire the lock are loaded in a first temporary array 301, and it is then compared with a second temporary array 302. The comparator 208 performs an element-wise comparison of the first temporary array 301 and the second temporary array 302. If they are equal, the corresponding indicator is set to true. Otherwise, it is set to false. In the presented example, unit 1 cannot sample at this iteration as it has conflict with unit 4, which has more unprocessed tokens and thus a higher priority level than unit 1.
[0055] Although four units and eight words are presented in the example, the number of units and words is not limited when applying the present invention.
[0056] As the present example embodiment schedules units according to their amount of remaining workload, it is possible to balance the workload among units at runtime.
Embodiment 2
[0057] Next, a second example embodiment of the present invention is described referring to accompanying drawings.
[0058] Referring to
[0059] The detailed structure of the priority assignment device 407 is shown in
[0060]
[0061] In step S301, the iteration count is reset to 0.
[0062] In step S302, the priority assignment device 407 checks whether the iteration count is divisible by N or not.
[0063] If the output of S302 is true, the sorter 205 sorts units by the amount of remaining workload in step S303, and then saves the sorted sequence in the priority sequence array 409 in step S305.
[0064] In other cases, the updater 206 directly loads the priority sequence from the priority sequence array 409.
[0065] In step S306, the updater 206 updates the indicators.
[0066] In step S307, units enabled by the scheduler perform sampling for tokens, and update related models in parallel.
[0067] In step S308, the device checks whether all tokens have been sampled or not. If there are unprocessed tokens, the iteration count is increased by one and the device repeats from step S302.
[0068] Since the present example embodiment is configured in such a manner that priority levels of units are re-assigned every N iterations instead of at each iteration, the present example embodiment is capable of reducing the overhead of scheduling.
Embodiment 3
[0069] Next, a third example embodiment of the present invention is described with reference to the accompanying drawings.
[0070]
[0071] The modified priority assignment device 507 contains a sorter 205, a workload array 201, a unit ID array 202, and a sorted sequence array 509. The components 201, 202, and 205 are the same as those in the first embodiment. The sorted sequence array 509 is used to store sequences of sorted arrays that are needed during one learning epoch.
[0072] For a fixed sorting frequency N, the scheduling of units will remain unchanged from one epoch to another, so there is no need to sort units at each epoch. Alternatively, the sequences of sorted units can be stored in the sorted sequence array 509 and reused in multiple epochs.
[0073] When applying the present embodiment, users can either fix N at the beginning, or explore different values during the initial learning epochs and then fix it to the frequency that leads to the smallest number of iterations.
[0074]
[0075] In step S401, the sampling device checks whether the epoch count is smaller than m. During the first m epochs, N is set to one of the candidate value, Ni, in step S402, where i is the epoch count.
[0076] After the exploration phase, N is set to the frequency that provides the best performance, Nopt, in step S403.
[0077] Steps S404-S406 are the same as S301-S308 shown in
[0078] In step S407, after finishing sampling for all tokens, the devices check the iteration count of the current epoch. During the first m epochs, if the iteration count of the current epoch is smaller than the minimum, the minimum iteration count is updated, and the optimal frequency Nopt to the current frequency Ni is set in step S408.
[0079] As steps S401 and S403 indicate, the frequency of sorting units is fixed to Nopt after m learning epochs. At the (m+1)th epoch, we record the priority sequences of units in the sorted sequence array 509 at the step 410.
[0080]
[0081] To avoid the sorted sequence array 509 occupying too much memory space, the user can limit the times of priority assignment. In the example above, if the priority sequence of units is allowed to change at most three times at each epoch, the sorted sequence array 509 becomes
[0082] Next, the effect of the present example embodiment is described. The present embodiment determines the frequency of priority re-assignment by exploring different values, so it is possible to reduce the number of iterations needed to finishing processing all tokens, thus speeding up the execution.
[0083] In addition, the priority sequences are stored in memory and reused during the learning process, so that the scheduling overhead is further reduced.
[0084] It should be noted that the above described arrays may be implemented in one or more memory storage devices such as RAM.
[0085] Furthermore, example embodiments in accordance with the present example embodiments may be implemented as an apparatus, a device, a method, or a computer program product. Accordingly, the present example embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, the present example embodiments may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. For example, all of the functions performed by the individual hardware units of the example embodiments may be performed in software on a general purpose computer having a basic structure as that shown in
[0086] Additionally, any examples or illustrations given herein are not to be regarded in any way as restrictions on, limits to, or express definitions of any term or terms with which they are utilized. Instead, these examples or illustrations are to be regarded as being described with respect to one particular embodiment and as being illustrative only. Those of ordinary skill in the art will appreciate that any term or terms with which these examples or illustrations are utilized will encompass other embodiments which may or may not be given therewith or elsewhere in the specification and all such embodiments are intended to be included within the scope of that term or terms. Language designating such non-limiting examples and illustrations includes, but is not limited to: “for example,” “for instance,” “e.g.,” and “in one embodiment.”
REFERENCE SIGNS LIST
[0087] 1 Processing Unit [0088] 2 Processing Unit [0089] 3 Processing Unit [0090] 4 Processing Unit [0091] 101 Sampling Device [0092] 102 Training Data [0093] 103 Share Model [0094] 104 Token Fetcher [0095] 105 Token Sampler [0096] 106 Model Updater [0097] 107 Priority Assignment Device [0098] 108 Priority-Aware Scheduler [0099] 201 Workload Array [0100] 202 Unit ID Array [0101] 203 Word ID Array [0102] 204 Enable/Disable Array [0103] 205 Sorter [0104] 206 Indicator Updater [0105] 207 Indicator Unit ID Array [0106] 208 Comparator [0107] 301 First Temporary Array [0108] 302 Second Temporary Array [0109] 401 Sampling Device [0110] 407 Priority Assignment Device [0111] 409 Priority Sequence Array [0112] 507 Modified Priority Assignment Device [0113] 509 Sorted Sequence Array