Sankaku Complex Forums » Meta

Sankaku Complex viewed from Google

  1. CC said:
    Googles image search seems to do a pretty poor job at indexing them to.
    Looking at the generated sourcecode for an image page a few possible causes come to light.

    1. Alt text repeats the URL's keywords, might be considered as keyword spam.
    2. Non-standard orig_height and orig_width attributes in the image attribute, guessing they have to do something with the javascript functions being able to do something with them.
    3. The count is visibility: hidden to people with javascript disabled, might get blacklisted for that

    Not sure how any of the other image search engines work as I don't use them :), search engine optimizing has only been worth it for google for me. As the country where I live in googles market share is over 9000! errr like 98%.

    Alt text is the main determinant of the image search's keywords. If it isn't there you can pretty much forget about the image coming up; but as you say, duplicates may not be good...

    Count? Which part were we talking about? Not sure, but I heard "display: none;" is ok? Certainly it is used on the main site without issue.

    Posted 7 years ago # Quote
  2. Avatar Image

    CC

    <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
    <noscript>
    <div style="display:none;" class="statcounter">

    </div>
    </noscript>

    That bit might cause some problems, no idea how it would be useful to display something hidden if someone does not have javascript enabled..
    Something like "No stats for you unless you enable javascript" would make more sense.

    No idea based on what base crawlers decide hidden content would be something to blacklist, but if it's for example total % of total page content than a gallery would need very little hidden content to get blacklisted.

    [edit]
    Looks like I found the real cause..
    Remember me having problems fetching the xml stuff in PHP? I'm guessing it's blocking google as well.

    You might find this interesting. Normal google results:
    a whopping 12.500 results
    Now let's see do an image search limited to the same domain:
    a mere 83 results .

    Even worse, for comparison a moe.imouto.org image search gets 9 reults!

    Normally when blocking automatic requests, exceptions are added for search engines.. think someone forgot to do that.

    Posted 7 years ago # Quote
  3. You broke the site CC! Nice job.

    (looks like the HTML in your post was not sanitised by the forum feed display... I believe it has a strip tags feature so I will restore it when I have time to fiddle)

    Posted 7 years ago # Quote
  4. Avatar Image

    CC

    Huh? What did I do?
    The [ code ] tag overflowing?
    Add the following css to the pre, code identifiers:

    display: block;
    width: 580px;
    overflow-x: scroll; /* CSS3, wont validate */

    To fix the brs..
    pre br, code br
    {
    display: none;
    }

    If it was anything else.. my apologies though I've no idea what I did.

    Anyways, It's strange.. using the Firefox user-agent addon, loaded in a xml file with a ton user agent strings and according to http://www.useragentstring.com/ I am currently the google bot. Can still access chan.sankakucomplex.com just fine though. Even tried accessing an image directly and did a hard refresh.. shows up no problem.
    (logged out to to prevent it from filtering me out that way).
    So maybe I nubed up my own code before and google is biting the dust due to robots.txt or something?

    EDIT
    Ah I see, the latest forum posts thing doesn't do a htmlspecialchars(). In that case I blame bad programming instead of myself ;)

    Posted 7 years ago # Quote
  5. Avatar Image

    CC

    Bleh edit time exceeded.

    Looks like my "I R Searchengine" disguise failed, as linkinstreets page managed to sniff me out as a fake. So.. maybe google gets served different content from me-the-pretender still.

    Posted 7 years ago # Quote
  6. You're quite right of course, it was bad coding on the defaults. It would have happened eventually... Going to take a look at this.

    Did you check the robots.txt? I think it is mostly unmodified. Perhaps I should add a blanket exception for the image crawler, or Google completely?

    Interestingly, I saw that it is using the softsubs as the description for the images indexed - Beako gets "Damn all you lolicon!" as her description.

    By the way, I see 227 images (using .com) - different datacentre perhaps...

    Posted 7 years ago # Quote
  7. The stats for chan.sankakucomplex.com are ludicrous:

    images.google.com / referral 90 45.66 00:14:21 81.11% 18.89%

    The site gets on average 7,000 visits a day, and these stats are for a whole month. Need to get indexed.

    Posted 7 years ago # Quote
  8. Avatar Image

    CC

    The robots.txt is
    User-agent: ia_archiver
    User-agent: Internet Ninja 6.0
    Disallow: /
    User-agent: *
    Disallow: /artist/edit
    Disallow: /artist/update
    Disallow: /comment/show
    Disallow: /comment/create
    Disallow: /pool/add_post
    Disallow: /pool/remove_post
    Disallow: /post/atom
    Disallow: /post/upload
    Disallow: /post/create
    Disallow: /post/destroy
    Disallow: /post/tag_history
    Disallow: /post/update
    Disallow: /note/history
    Disallow: /tag/edit
    Disallow: /tag/update
    Disallow: /tag/mass_edit
    Disallow: /wiki/edit
    Disallow: /wiki/update
    Disallow: /wiki/revert
    Disallow: /wiki/history
    Disallow: /wiki/rename
    Disallow: /user

    User-agent: HTTrack
    Disallow: /post

    Can't say I've much experience with robots.txt other than manually adding a sitemap.xml location for the 3 big spiders (yahoo/msn/google) to garantuee they'll find every relevant link, but yours seems to:
    Disallow ia_archiver and Internet Ninja 6.0 to do anything.
    Disallow HTTrack from accessing anything in /post

    And the big list to ANY 'robot'/spider.
    Link to an image page is like:
    http://chan.sankakucomplex.com/post/show/122870/c-c-chibi-code_geass-lelouch_vi_brittania
    Which doesn't seem to give any problems with the rules in robots' txt.

    Path to an image is:
    http://chan.sankakucomplex.com/data/f5/bf/f5bfec62b8e3f5e05fc70935ac4dc5a0.jpg
    Thumbnail, similiar:
    http://chan.sankakucomplex.com/data/preview/f5/bf/f5bfec62b8e3f5e05fc70935ac4dc5a0.jpg
    again both seems safe.

    So all I can think of is that the danbooru devs added something to sniff out crawlers to prevent them from indexing every image as it would require an impressive amount of bandwidth. MSN+Yahoo+Google image search = 3.
    3 * size of avarage image * total number of images = errr many terabytes?

    Don't think not displaying something to spiders would be all that permanently lethal. The other way around is worse, displaying contents to bots that isn't normally accessible (and getting reported).
    Good example would be expert's change.. they used to do that and you could visit google cache to find the answer you were looking for as visiting the page yourself would leave you with a "sign up to see the answer". Think they've been reported for it as they have changed their ways and instead of sniffing out user-agent strings they took a cookie based approach.
    Block the cookies from their domain and you can see the answers at the bottom of the page ( I don't see much wrong in scamming scammers, but feel free to moderate my post if you consider this tip 'borderline' )

    If any file is to blame, I suspect it would be /app/controllers/application.rb
    but as I got 0 experience with ruby, I don't trust myself ;).
    Seems to be only file that contains the string 'google' where it's not meant as an example, and handles the logic whether it should throw an error or display a page. (Though I'm sure that's old news to Ruby on Rails developers)

    Posted 7 years ago # Quote
  9. I have reasonably extensive knowledge of robots.txt and nofollow and so on.

    You think it would be safe to allow the crawl if it is being impeded? The bandwidth can't be that much of a problem, the site is not that large...

    "Don't think not displaying something to spiders would be all that permanently lethal. The other way around is worse, displaying contents to bots that isn't normally accessible (and getting reported)."

    What do you mean here? Not normally accessible? Getting reported for what?

    I noticed Expert Exchange's little scam. Now they sneakily bury the answers down at the bottom...

    Posted 7 years ago # Quote
  10. lol, expert exchange. I usually use the google cache feature, as it will show the answer at the top

    Posted 7 years ago # Quote
  11. Stranger and stranger. It seems there is code to hand off requests to 503 oblivion if load is at a predetermined level, but it is all disabled via load_average_threshold=false here. Furthermore, it explicitly allows Google at all times if I am not mistaken?

    The other Gbot mention explicitly disables caching for Gbot, XML and JSON...

    Posted 7 years ago # Quote
  12. Avatar Image

    CC

    Thought the 503 part was commented out.
    At least the whole block is in the same color if the file is opened with Eclipse + Aptana studio Ruby on Rails plugin. So I thought that would mean an = sign at the start of a line would mean it's a comment, ending at the next line with an = sign at the start.
    (like the /* and */ combo in most languages).
    Would think even more so cause the block above is prefixed with number symbols # and is in the same color. # is a one line comment prefix in C sharp.

    But I'm not the one to answer these kind of questions, my knowledge is limited to Java, C#, VB, PHP, ASP.net :)
    Think the best thing to do would be to contact the danbooru developers as I'm sure this has is either a conscious decision to prevent bandwidth-raep or it's an 'oops' they're interested in fixing.

    [EDIT]
    Pictures say more then words, this is what I'm seeing ;)
    (If you're interested in it both eclipse and ruby on rails plugins are available at no costs)
    ( Ye I blurred out the other project as it's probably better off not being associated with umm sites with unicef questionable contents ;) )

    Attachments

    1. eclipse-aptana-ror.png 7 years old
    Posted 7 years ago # Quote
  13. Working with Disney there perhaps?

    I just inspected the site with Webmaster Tools. No real errors. I drastically increased the crawl rate as this may be a factor.

    Perhaps this is an issue with the image bot itself? As I understand it Google will index only images they feel they need, rather than the lot. This does not sit well with ignoring all the other sites too though.

    Posted 7 years ago # Quote
  14. Avatar Image

    CC

    orig_width="450"
    orig_height="629"
    After all then?

    You could try settings up a test case, simply a single html page with those invalid tags for an image and force google to attempt to index it.. then see if you can find the picture via image search.
    Seems awfully strange if all 3 major search engine would trip over such a minor issue though, most parsers would simply ignore it afaik.
    Keep reading, more probable cause below ;)

    Maybe this will shed a little light..
    Awfully long Microsoft Live Search link:
    First log off from chan.sankakucomplex.com as being logged in seems to result in an image displayed.

    Click any of the images on the left and then on the right click "Show full image size" results in a.
    403 Forbidden
    nginx/0.5.33

    Now open up chan.sankakucomplex.com in a tab and log back in, go back to the tab with 403 error and hit refresh (ctrl+f5 for a hard forced refresh).. image still refuses to load.
    Go to the addressbar, hit enter to force a page request.. image loads.

    There is some referer sniffing going on it seems.

    Posted 7 years ago # Quote
  15. They index 15,000 DB images. But moe, SC and gel (PHP DB clone) only get a few hundred each. It can't be the tags if DB is doing so well? Perhaps a PageRank issue?

    Posted 7 years ago # Quote
  16. Avatar Image

    CC

    Umm bump :)
    You replied before I editted the post with new info, read the above post again ;)

    Posted 7 years ago # Quote
  17. I know about the referrer sniffing because I set it up. The site is not an image server for random forums and blogs.

    I could add exceptions for search engine referrers - have you checked the other sites though? Do they restrict referrers? If they do not then it suggests a different issue still.

    Posted 7 years ago # Quote
  18. Moe allows all kinds of hotlinking. All its image results are actually from other sites hotlinking it. DB has no such issue - its pages are actually the sources.

    Posted 7 years ago # Quote
  19. Avatar Image

    CC

    Well I ran out of idea's besides giving the danbooru devs a poke.

    Posted 7 years ago # Quote

Reply

You must log in to post.