Zdravim. Potrebuji v Linu zpracovat nejaky html kod, ale narazil jsem na diakritiku, se kterou se nemuzu vyporadat. Jak jednoduse odstranim diakritiku ze znaku? nap. slovo "čeština" prevest na "cestina".
Printable View
Zdravim. Potrebuji v Linu zpracovat nejaky html kod, ale narazil jsem na diakritiku, se kterou se nemuzu vyporadat. Jak jednoduse odstranim diakritiku ze znaku? nap. slovo "čeština" prevest na "cestina".
Nevim jak v C, ale v PHP to bylo v nějakym skriptu udělaný takle nějak:
prostě se nadefinovaly znaky s nějakou diakritikou k těm pak jejich ekvivalenty bez ní a pak odstranilo.Kód:$bez_diakritiky = StrTr($s_diakritikou, "áäčďéěëíňóöřšťúůüýžÁÄČĎÉĚËÍŇÓÖŘŠŤÚŮÜÝŽ", "aacdeeeinoorstuuuyzAACDEEEINOORSTUUUYZ");
PHP mi je naprd :]. Moje fce vypada takto:Citace:
Původně odeslal Gappa
Kód:void DiakritikaPryc()
{
int *fd = OtevriSoubor(Stranka, "r"),*fds=OtevriSoubor(StrankaBezDiakritiky, "w");
unsigned char c, buffer[PocetBytu(fd)];
long j=0;
char mapa[128][2] = { /*charmapa znaku s diakritikou - cp1250 sucks*/
// http : //www.columbia.edu/kermit/cp1250.html
//dec char dec col/row oct hex description
128, 'E', // [€] 128 08/00 200 80 EURO SYMBOL
129, ' ', // [�] 129 08/01 201 81 ( UNDEFINED )
130, ' ', // [‚] 130 08/02 202 82 LOW 9 SINGLE QUOTE
131, ' ', // [?] 131 08/03 203 83 ( UNDEFINED )
132, ' ', // [„] 132 08/04 204 84 LOW 9 DOUBLE QUOTE
133, ' ', // […] 133 08/05 205 85 ELLIPSIS
134, ' ', // [†] 134 08/06 206 86 DAGGER
135, ' ', // [‡] 135 08/07 207 87 DOUBLE DAGGER
136, ' ', // [?] 136 08/08 210 88 ( UNDEFINED )
137, ' ', // [‰] 137 08/09 211 89 PER MIL SIGN
138, 'S', // [Š] 138 08/10 212 8A CAPITAL LETTER S WITH CARON
139, '<', // [‹] 139 08/11 213 8B LEFT SINGLE QUOTE BRACKET
140, 'S', // [S] 140 08/12 214 8C CAPITAL LETTER S WITH ACUTE ACCENT
141, 'T', // [T] 141 08/13 215 8D CAPITAL LETTER T WITH CARON
142, 'Z', // [Ž] 142 08/14 216 8E CAPITAL LETTER Z WITH CARON
143, 'Z', // [Z] 143 08/15 217 8F CAPITAL LETTER Z WITH ACUTE ACCENT
144, ' ', // [�] 144 09/00 220 90 ( UNDEFINED )
145, ' ', // [‘] 145 09/01 221 91 HIGH 6 SINGLE QUOTE
146, '’', // [’] 146 09/02 222 92 HIGH 9 SINGLE QUOTE
147, '"', // [“] 147 09/03 223 93 HIGH 6 DOUBLE QUOTE
148, '"', // [”] 148 09/04 224 94 HIGH 9 DOUBLE QUOTE
149, ' ', // [•] 149 09/05 225 95 LARGE CENTERED DOT
150, ' ', // [–] 150 09/06 226 96 EN DASH
151, ' ', // [—] 151 09/07 227 97 EM DASH
152, ' ', // [?] 152 09/08 230 98 ( UNDEFINED )
153, ' ', // [™] 153 09/09 231 99 TRADEMARK SIGN
154, 's', // [š] 154 09/10 232 9A SMALL LETTER S WITH CARON
155, '>', // [›] 155 09/11 233 9B RIGHT SINGLE QUOTE BRACKET
156, 's', // [s] 156 09/12 234 9C SMALL LETTER S WITH ACUTE ACCENT
157, ' ', // [t] 157 09/13 235 9D SMALL LETTER T WITH CARON
158, 'z', // [ž] 158 09/14 236 9E SMALL LETTER Z WITH CARON
159, 'z', // [z] 159 09/15 237 9F SMALL LETTER Z WITH ACUTE ACCENT
160, ' ', // [ ] 160 10/00 240 A0 NO-BREAK SPACE
161, ' ', // [?] 161 10/01 241 A1 CARON
162, ' ', // [?] 162 10/02 242 A2 BREVE
163, ' ', // [L] 163 10/03 243 A3 CAPITAL LETTER L WITH STROKE
164, ' ', // [¤] 164 10/04 244 A4 CURRENCY SIGN
165, 'A', // [A] 165 10/05 245 A5 CAPITAL LETTER A WITH OGONEK
166, ' ', // [¦] 166 10/06 246 A6 BROKEN BAR
167, ' ', // [§] 167 10/07 247 A7 PARAGRAPH SIGN
168, ' ', // [¨] 168 10/08 250 A8 DIAERESIS
169, ' ', // [©] 169 10/09 251 A9 COPYRIGHT SIGN
170, ' ', // [S] 170 10/10 252 AA CAPITAL LETTER S WITH CEDILLA
171, ' ', // [«] 171 10/11 253 AB LEFT ANGLE QUOTATION MARK
172, ' ', // [¬] 172 10/12 254 AC NOT SIGN
173, ' ', // [] 173 10/13 255 AD SOFT HYPHEN
174, ' ', // [®] 174 10/14 256 AE REGISTERED TRADE MARK SIGN
175, ' ', // [Z] 175 10/15 257 AF CAPITAL LETTER Z WITH DOT ABOVE
176, ' ', // [°] 176 11/00 260 B0 DEGREE SIGN, RING ABOVE
177, ' ', // [±] 177 11/01 261 B1 PLUS-MINUS SIGN
178, ' ', // [?] 178 11/02 262 B2 OGONEK
179, ' ', // [l] 179 11/03 263 B3 SMALL LETTER L WITH STROKE
180, ' ', // [´] 180 11/04 264 B4 ACUTE ACCENT
181, ' ', // [µ] 181 11/05 265 B5 MICRO SIGN
182, ' ', // [¶] 182 11/06 266 B6 PILCROW SIGN
183, ' ', // [·] 183 11/07 267 B7 MIDDLE DOT
184, ' ', // [¸] 184 11/08 270 B8 CEDILLA
185, 'a', // [a] 185 11/09 271 B9 SMALL LETTER A WITH OGONEK
186, 's', // [s] 186 11/10 272 BA SMALL LETTER S WITH CEDILLA
187, ' ', // [»] 187 11/11 273 BB RIGHT ANGLE QUOTATION MARK
188, ' ', // [L] 188 11/12 274 BC CAPITAL LETTER L WITH CARON
189, ' ', // [?] 189 11/13 275 BD DOUBLE ACUTE ACCENT
190, 'l', // [l] 190 11/14 276 BE CAPITAL LETTER I WITH CARON
191, 'z', // [z] 191 11/15 277 BF SMALL LETTER Z WITH DOT ABOVE
192, 'R', // [R] 192 12/00 300 C0 CAPITAL LETTER R WITH ACUTE ACCENT
193, 'A', // [Á] 193 12/01 301 C1 CAPITAL LETTER A WITH ACUTE ACCENT
194, 'A', // [Â] 194 12/02 302 C2 CAPITAL LETTER A WITH CIRCUMFLEX
195, ' ', // [A] 195 12/03 303 C3 CAPITAL LETTER A WITH BREVE
196, 'A', // [Ä] 196 12/04 304 C4 CAPITAL LETTER A WITH DIAERESIS
197, ' ', // [L] 197 12/05 305 C5 CAPITAL LETTER L WITH ACUTE ACCENT
198, 'C', // [C] 198 12/06 306 C6 CAPITAL LETTER C WITH ACUTE ACCENT
199, ' ', // [Ç] 199 12/07 307 C7 CAPITAL LETTER C WITH CEDILLA
200, 'C', // [C] 200 12/08 310 C8 CAPITAL LETTER C WITH CARON
201, 'E', // [É] 201 12/09 311 C9 CAPITAL LETTER E WITH ACUTE ACCENT
202, 'E', // [E] 202 12/10 312 CA CAPITAL LETTER E WITH OGONEK
203, 'E', // [Ë] 203 12/11 313 CB CAPITAL LETTER E WITH DIAERESIS
204, 'E', // [E] 204 12/12 314 CC CAPITAL LETTER E WITH CARON
205, 'I', // [Í] 205 12/13 315 CD CAPITAL LETTER I WITH ACUTE ACCENT
206, 'I', // [Î] 206 12/14 316 CE CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
207, 'D', // [D] 207 12/15 317 CF CAPITAL LETTER D WITH CARON
208, 'D', // [Ð] 208 13/00 320 D0 CAPITAL LETTER D WITH STROKE
209, 'N', // [N] 209 13/01 321 D1 CAPITAL LETTER N WITH ACUTE ACCENT
210, 'N', // [N] 210 13/02 322 D2 CAPITAL LETTER N WITH CARON
211, 'O', // [Ó] 211 13/03 323 D3 CAPITAL LETTER O WITH ACUTE ACCENT
212, 'O', // [Ô] 212 13/04 324 D4 CAPITAL LETTER O WITH CIRCUMFLEX
213, 'O', // [O] 213 13/05 325 D5 CAPITAL LETTER O WITH DOUBLE ACUTE ACCENT
214, 'O', // [Ö] 214 13/06 326 D6 CAPITAL LETTER O WITH DIAERESIS
215, '*', // [×] 215 13/07 327 D7 MULTIPLICATION SIGN
216, 'R', // [R] 216 13/08 330 D8 CAPITAL LETTER R WITH CARON
217, 'U', // [U] 217 13/09 331 D9 CAPITAL LETTER U WITH RING ABOVE
218, 'U', // [Ú] 218 13/10 332 DA CAPITAL LETTER U WITH ACUTE ACCENT
219, 'U', // [U] 219 13/11 333 DB CAPITAL LETTER U WITH DOUBLE ACUTE ACCENT
220, 'U', // [Ü] 220 13/12 334 DC CAPITAL LETTER U WITH DIAERESIS
221, 'Y', // [Ý] 221 13/13 335 DD CAPITAL LETTER Y WITH ACUTE ACCENT
222, 'T', // [T] 222 13/14 336 DE CAPITAL LETTER T WITH CEDILLA
223, ' ', // [ß] 223 13/15 337 DF SMALL GERMAN LETTER SHARP s
224, 'r', // [r] 224 14/00 340 E0 SMALL LETTER R WITH ACUTE ACCENT
225, 'a', // [á] 225 14/01 341 E1 SMALL LETTER A WITH ACUTE ACCENT
226, 'a', // [â] 226 14/02 342 E2 SMALL LETTER A WITH CIRCUMFLEX
227, 'a', // [a] 227 14/03 343 E3 SMALL LETTER A WITH BREVE
228, 'a', // [ä] 228 14/04 344 E4 SMALL LETTER A WITH DIAERESIS
229, 'l', // [l] 229 14/05 345 E5 SMALL LETTER L WITH ACUTE ACCENT
230, 'c', // [c] 230 14/06 346 E6 SMALL LETTER C WITH ACUTE ACCENT
231, 'c', // [ç] 231 14/07 347 E7 SMALL LETTER C WITH CEDILLA
232, 'c', // [c] 232 14/08 350 E8 SMALL LETTER C WITH CARON
233, 'e', // [é] 233 14/09 351 E9 SMALL LETTER E WITH ACUTE ACCENT
234, 'e', // [e] 234 14/10 352 EA SMALL LETTER E WITH OGONEK
235, 'e', // [ë] 235 14/11 353 EB SMALL LETTER E WITH DIAERESIS
236, 'e', // [e] 236 14/12 354 EC SMALL LETTER E WITH CARON
237, 'i', // [í] 237 14/13 355 ED SMALL LETTER I WITH ACUTE ACCENT
238, 'i', // [î] 238 14/14 356 EE SMALL LETTER I WITH CIRCUMFLEX ACCENT
239, 'd', // [d] 239 14/15 357 EF SMALL LETTER D WITH CARON
240, 'd', // [d] 240 15/00 360 F0 SMALL LETTER D WITH STROKE
241, 'n', // [n] 241 15/01 361 F1 SMALL LETTER N WITH ACUTE ACCENT
242, 'n', // [n] 242 15/02 362 F2 SMALL LETTER N WITH CARON
243, 'o', // [ó] 243 15/03 363 F3 SMALL LETTER O WITH ACUTE ACCENT
244, ' ', // [ô] 244 15/04 364 F4 SMALL LETTER O WITH CIRCUMFLEX
245, 'o', // [o] 245 15/05 365 F5 SMALL LETTER O WITH DOUBLE ACUTE ACCENT
246, 'o', // [ö] 246 15/06 366 F6 SMALL LETTER O WITH DIAERESIS
247, ' ', // [÷] 247 15/07 367 F7 DIVISION SIGN
248, 'r', // [r] 248 15/08 370 F8 SMALL LETTER R WITH CARON
249, 'u', // [u] 249 15/09 371 F9 SMALL LETTER U WITH RING ABOVE
250, 'u', // [ú] 250 15/10 372 FA SMALL LETTER U WITH ACUTE ACCENT
251, 'u', // [u] 251 15/11 373 FB SMALL LETTER U WITH DOUBLE ACUTE ACCENT
252, 'u', // [ü] 252 15/12 374 FC SMALL LETTER U WITH DIAERESIS
253, 'y', // [ý] 253 15/13 375 FD SMALL LETTER Y WITH ACUTE ACCENT
254, 't', // [t] 254 15/14 376 FE SMALL LETTER T WITH CEDILLA
255, ' '
}; // [?] 255 15/15 377 FF DOT ABOVE
while (!feof(fd))
{
int i;
c=fgetc(fd);
if(!feof(fd))
{
if(c >= 128)
{
for(i=0;i<128;i++)
{
if((i+128)==c) /*o rad v ASCII vys*/
{
buffer[j]=mapa[i][1];
}
}
}
else
{
buffer[j]=c;
}
j++;
}
}
fwrite(buffer,sizeof(char),strlen(buffer),fds);
ZavriSoubor(fds);
ZavriSoubor(fd);
}
To mě zas C je na prd, páč ho neumim :o)))))
kdybys to chtel v perlu (popripade si z nej vytahat ty tabulky) tak hledej cstocs.... a nebo ve zdrojacich mysql by mely tusim bejt....
já mam ve svym programu na odstranění diakritiky (v unicode) tohle:
Kód:int arrayutf[96] = {-61, -127, -60, -116, -60, -114, -61, -119, -60, -102, -61, -115, -60, -67, -59, -121, -61, -109, -59, -104, -59, -96, -59, -92, -61, -102, -59, -82, -61, -99, -59, -67, -61, -95, -60, -115, -60, -113, -61, -87, -60, -101, -61, -83, -60, -66, -59, -120, -61, -77, -59, -103, -59, -95, -59, -91, -61, -70, -59, -81, -61, -67, -59, -66, -61, -124, -61, -117, -61, -106, -61, -100, -61, -92, -61, -85, -61, -74, -61, -68, -61, -76, -61, -108, -60, -71, -60, -70, -60, -67, -60, -66, -59, -108, -59, -107};
int arraywin[48] = {65, 67, 68, 69, 69, 73, 76, 78, 79, 82, 83, 84, 85, 85, 89, 90, 97, 99, 100, 101, 101, 105, 108, 110, 111, 114, 115, 116, 117, 117, 121, 122, 65, 69, 79, 85, 97, 101, 111, 117, 111, 111, 76, 108, 76, 108, 82, 114};
string Util::disableCzChars(string message) {
string s = "";
for(unsigned int j = 0; j < message.length(); j++) {
int zn = (int)message[j];
int zzz = -1;
for(int l = 0; l < 96; l+=2) {
int zn2 = (int)message[j+1];
if ((zn == arrayutf[l])&&(zn2 == arrayutf[l+1])) {
zzz = (int)(l/2);
break;
}
}
if (zzz >= 0) {
s += (char)(arraywin[zzz]);
j++;
} else {
s += message[j];
}
}
return s;
}
ehm, to chces prepisovat kodovaci tabulky vsech moznejch a nemoznejch kodovejch stranek ? 100% na to existuje hotova knihovna, ale pokud to chces moci mermo delat sam, nezaponem oriznout pouzitej rozsah na znaky 0-127 (pripadne jeste vic, jen na alfanumericky).