As I've had little support about this problem from the IBM forum, maybe somebody here may be able to help me.
I want to implement an barrier between the SPE's, with no PPE intervention.
A barrier is a synchronization object which can be used to create meeting points between different processors. A processor which has called the barrier remains blocked until all other processors have also called the barrier.
My approach uses signals in ORed mode. My code looks okay to me, but unfortunately it hangs on my PS3. The program runs as expected on the Cell simulator (SDK 3.0).
The test code follows. first, a global definition (on both PPE and SPE) :
Code: Select all
typedef struct {
int spe_rank;
int spe_count;
int __dummy[2];
uint64_t sig1[6];
} spe_data;
Then, the PPE program (includes removed):
Code: Select all
#define SPE_COUNT 6
extern spe_program_handle_t barrier_sig_spu_handle;
typedef struct {
spe_data d __attribute((aligned(16)));
spe_context_ptr_t context;
pthread_t thread;
} spe_id;
void *main_thr(void *ptr) {
spe_id *id = (spe_id*)ptr;
unsigned int entry_point = SPE_DEFAULT_ENTRY;
int retval;
do
retval = spe_context_run(id->context, &entry_point, 0, &id->d, NULL, NULL);
while (retval > 0); /* Run until exit or error */
if(retval)
perror("An error occurred running the SPE program");
return NULL;
}
int main(int argc, char *argv[])
{
spe_id spe[SPE_COUNT] __attribute((aligned(16)));
int i, j;
uint32_t dummy;
uint64_t sig1_ea;
for (i=0; i<SPE_COUNT; i++) {
spe[i].d.spe_rank = i;
spe[i].d.spe_count = SPE_COUNT;
spe[i].context = spe_context_create(SPE_EVENTS_ENABLE
| SPE_CFG_SIGNOTIFY1_OR | SPE_MAP_PS, NULL);
spe_program_load(spe[i].context, &barrier_sig_spu_handle);
sig1_ea = (unsigned int)spe_ps_area_get(spe[i].context,
SPE_SIG_NOTIFY_1_AREA) + 12;
for (j=0; j<SPE_COUNT; j++)
spe[j].d.sig1[i] = sig1_ea;
}
for (i=0; i<SPE_COUNT; i++) {
pthread_create(&spe[i].thread, NULL, main_thr, &spe[i]);
}
for (i=0; i<SPE_COUNT; i++) {
while (!spe_out_mbox_status(spe[i].context));
spe_out_mbox_read(spe[i].context, &dummy, 1);
}
dummy = 0;
for (i=0; i<SPE_COUNT; i++)
spe_in_mbox_write(spe[i].context, &dummy, 1, SPE_MBOX_ALL_BLOCKING);
for (i=0; i<SPE_COUNT; i++) {
pthread_join(spe[i].thread, NULL);
}
return 0;
}
And finally, the SPE code:
Code: Select all
spe_data d __attribute((aligned(16)));
void spe_barrier(void) {
volatile vec_uint4 signal;
int i;
void *ls = ((char*)&signal)+12;
uint32_t expected = (1<<d.spe_count)-1;
uint32_t received = 1<<d.spe_rank;
signal = spu_promote(received, 3);
for (i=0; i<d.spe_count; i++)
if (i != d.spe_rank) {
mfc_sndsig(ls, d.sig1[i], 4, 0, 0);
}
while (received != expected) {
received |= spu_read_signal1();
/*printf("spe%d received = %d\n", d.spe_rank, received);*/
}
}
int main(unsigned long long spe_id, unsigned long long pdata)
{
mfc_get(&d, pdata, sizeof(d), 0, 0, 0);
mfc_write_tag_mask(1<<0);
spu_mfcstat(MFC_TAG_UPDATE_ALL);
spu_write_out_mbox(0);
spu_read_in_mbox();
printf("spe%d/%d: ready\n", d.spe_rank, d.spe_count);
spe_barrier();
printf("spe%d passed the barrier\n", d.spe_rank);
return 0;
}
On PS3, there are several cases:
- If the SPE code was compiled using -O2, only one SPE passes the barrier, all others hang
- If the SPE code was compiled without optimization, one, two or three SPE's pass the barrier
- If I comment out the printf in the spe_barrier function, the code works whatever the optimization level.
On the Cell simulator, every case works.
This optimization / printf difference makes me think about a synchronization problem, but my lack of experience does really not help me to find out the source of the problem.
I really wonder why a SPE can eventually not receive all the signals which have been sent to it.
What do you people think about this ? could it be a linux kernel problem ?
thank you in advance,
François